cancel
Showing results for 
Search instead for 
Did you mean: 

How many TOPS of your Skylake X / Cascade Lake X have ?

restsugavan
Level 14
  • According to Qualcomm qoute

 " TOPS quantifies an NPU's processing capabilities by measuring the number of operations (additions, multiplies, etc.) in trillions executed within a second.

 

This standardized measurement strongly indicates an NPU's performance, serving as a crucial yardstick for comparing AI performance across different processors and architectures. Because TOPS serves as a cornerstone performance metric for NPUs, exploring the parameters that make up the TOPS equation and how they can dictate performance is essential. Doing so can offer a deeper understanding of an NPU's capabilities. 

multiply-accumulate (MAC) operation executes the mathematical formulas at the core of AI workloads. A matrix multiply consists of a series of two fundamental operations: multiplication and addition to an accumulator. A MAC unit can, for example, run one of each per clock cycle, meaning it executes two operations per clock cycle. A given NPU has a set number of MAC units that can operate at varying levels of precision, depending on the NPU's architecture.

Frequency dictates the clock speed (or cycles per second) at which an NPU and its MAC units (as well as a CPU or GPU) operate, directly influencing overall performance. A higher frequency allows for more operations per unit of time, resulting in faster processing speeds. However, increasing frequency also leads to higher power consumption and heat generation, which impacts battery life and user experience. The TOPS number quoted for processors is generally at the peak operating frequency.

Precision refers to the granularity of calculations, with higher precision typically correlating with increased model accuracy at the expense of computational intensity. The most common high-precision AI models are 32-bit and 16-bit floating point, whereas faster, low-precision, low-power models typically use 8-bit and 4-bit integer precision. The current industry standard for measuring AI inference in TOPS is at INT8 precision.

To calculate TOPS, start with OPS, which equals two times the number of MAC units multiplied by their operating frequency. TOPS is the number of OPS divided by one trillion, making it simpler to list and compare, that is,

 

" TOPS = 2 × MAC unit count × Frequency / 1 trillion. " 
 
So our CPU TOPS number = nCORE x 2 x MAC Unit x X.XX GHz / 1,000,000,000,000 
For Example According to Skylake Microarchitect that can crunch max 128 MAC (INT8*) / cycle using 2 FMA ports.
 
 
00OKCVYTdJzghfuKSeCX7aK-15.fit_lim.size_600x600.v_1569469943.jpg

 Core i9 7980XE running at 4.4 GHz also provide

18 ( Core ) x 2 x 128 MAC Op/ Cycle  4.4 GHz Dividen by 1,000,000,000,000

= 18 x 2** x 2 x 128 x 4.4GHz  / 1,000,000,000,000 =  40.55 TOPS 😁 ** 

( Estimate from maximum retire of Skylake X / Cascade Lake X microarchitect throughput not include GPU datatype INT8 *** )

(** Hyperthreading ON **)

So.. we didn't need any NPU for Windows Copilot and other AI integrated App at all. 

COPILOTW.jpg

40.55 TOPS CPU performance is OK for 7-years old CPU.

TOPSLUNA.jpg

😁 It is easily leave Lunar Lake in the dust by use our old Skylake X against it with Monster NVIDIA GPU.😁

NVRTX4090TOP.jpg

 Total system AI mixing higher than 1321.2+40.55 = 1,361.75 TOPS 😁

The easiest way to go to AI ages from our X299 platform. Combine your CPU with Monster GPU is more than enough. 

1000028254.jpg

 Now your rigs can be a " Premium AI PC " with RTX GPU series 

1000028255.jpg

1000028260.jpg

  

1000028258.jpg

Note :: Number of estimate performance refered to maximum throughput of Intel Skylake microarchitect 2015. 

NVIDIA RTX Reference TOPS link here 

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2...

The current industry standard for measuring AI inference in TOPS is at INT8 precision ***.

 

 

W11 24H2 26257.5000 Core i9 7980XE 02007108 MCE ME 11.12.96.2535 R6E Modified BIOS 4001 SAMSUNG OG9 FW 1019.0 SSD 970 EVO PLUS 1 TB x 3 NVIDIA RTX 4090 GAME READY 560.70 64GB GSKILL DDR4 3200MHz JBL 9.1 Sound Bar DTS-X
297 Views
6 REPLIES 6

Int8bldr
Level 12

A couple of things:

I think Trillion = 10^12 and gigahertz is 10^9 so you need to divide your number with 1000

However, the 2 FMA's per core that we have in the Skylake and Cascadelake core I-9 79800XE and 10980XE are extremely capable.

FLOAT 32:

For example the VFMADDPS AVX-512 instruction: At Float32 precision each of the FMAs/Core can produce 16 multiplications plus 16 additions operations per 1/2 clock cycle (Throughput) =>
18 cores * 2 (FMA/s) * (16 (multiplications) + 16 (additions)) * 2 (since 1/2 clock cycle in Throughput) * FMA clock frequency /10^12

 So using your clock frequency of 4.4GHz we get

18 cores * 2 (FMAs) * (16 (multiplications) + 16 (additions)) * 2 (since 1/2 clock cycle in Throughput) * FMA clock frequency /10^12 = 10.1 “TOPS” at FLOAT 32 precision

Not 1000 TOPS but is in full Float 32 precision so still pretty good in my view.

INT8:

If we use the byte size instruction in the VNNI family (AVX512_VNNI flag): e.g. VPDPBUSD which Multiply and Add Unsigned and Signed Bytes we can 4x the calculation:

The VPDPBUSD in full AVX-512 width, multiplies and adds 16 * 4 pairs (=64) of bytes per 1/2 clock cycle (Throughput) =>
18 (cores) * 2 (FMAs) * (16*4 + 16*4)* 2 (since 1/2 clock cycle in Throughput) *4.4 Ghz/10^12 = 40.6 “TOPS”

BUT raw TOPS is not everything. AVX-512 is much more than that. The AVX-512 instructions set is extremely versatile and rich. It’s like a new instruction language all together. It gives you the power of expressing everything in vectorized form and do almost all other intense operations in full AVX-512 width speeding up all other parts of the code as a well.

A few other problems:

Intel dropped AVX-512 for Alderlake – big mistake in my view. Should have stuck to the plan!

The FMA’s cannot operate at full speed and have to down clocked with the AVX-512 offset in BIOS. I have my AVX-512 offset set so FMAs operate at 4GHz. Still pretty good.

I love my 10980XE it's an unbelievably good CPU!

So extremely capable - if you know how to unlock all the vectorized silicon power! 

However (again), the operations are memory constrained NOT FMA constrained, i.e. the CPU's AVX-512 engine can eat much faster than the memory channels can feed it. So going from 10 to 1000 TOS is not going to do anything if the memory cannot keep up - and it cannot. So AMX matrix multiplication is likely the way to go with much larger matrix registers, producing full matrix multiplications in silicon without additional memory access (just one load of two the two source matrices and one store of the final result).

Int8bldr
Level 12

I share your enthusiasm for out CPUs. However, there are (IMHO) a few double countings in your calculations.

to see what TOPS operations can be performed in parallel inside each core, you need to look at the skylake micro architecture. 

Skylake_architecture_diagram.png

Each core has 8 ports that can execute in parallel, but only port 0,1, and 5 can execute the TOS like operations like vector FMAs add and multiply at 512 bit vector width.

Further, when executing 512 bit instructions (avx512 or any 512 bit VNNI instruction) port 0 and 1 are fused in to one while port 5 is executing full 512. In other words, Skylake can retire two 512 bit FMAs per clock cycle. Hence, my calculations in my other post here.
Using hyperthreading will not double the TOPS, since it is just a way for the CPU to use more ports in parallel (if possible), but all Arithmetic ports are busy, so hyper threading does nothing to the TOPS number (but potentially slowing it down). Also, we cannot say that we can execute FLOATS ops and INTs ops in parallel since they are too using the same FMA.

In other words, the TOPS number is limited by port 0+1 and 5 throughput, and it is always two 512 bit FMA ops per clock cycle regardless of the mix of arithmetic instructions (of FLOATS and INTs). That is the theoretical max, limited by the actual physical silicon in a core. There is simply no other silicon available able to execute more arithmetic instructions in a core (save for very basic address adding which is a simple scalar address add operation on port 6 - 1 add per cycle = 0.001 TOPS = rounding error).

Hope I made myself understood.

Still, I love my 10980XE and combined with a 3090 it is extremely capable, if you know how to unlock all core performance. I do think AI will evolve to a point where our platform will be more then capable to run AI future models. Today, they the AI models are still very crude brute force and yeah not very intelligent - honestly. Train with everything and you get a Gaussian average result (laws of large sample number statistics), and a Gaussian average answer is, well, not very precise and insightful. My point is that more brute force (i.e. more TOPS) is not the answer. Perhaps the 7980XE and 10980XE have enough TOPS when the models are more intelligently refined in a few years.

We'll see...

Skylake X Microarchitect Diagram all L1 L2 L3 cache running on 512-bit bus wideSkylake X Microarchitect Diagram all L1 L2 L3 cache running on 512-bit bus wide

 

Hi @Int8bldr

Your diagram above was Skylake Client version that not correct for our Skylake X/W/SP / Cascade Lake X/W/SP/AP .

The caches all level bus wide were 64 byte ( 512-bit ) on Skylake server side.

Thus maximum output will be double on our CPU when compare with client skus.

So my estimate number above also double from Skylake Server diagram upon. 

W11 24H2 26257.5000 Core i9 7980XE 02007108 MCE ME 11.12.96.2535 R6E Modified BIOS 4001 SAMSUNG OG9 FW 1019.0 SSD 970 EVO PLUS 1 TB x 3 NVIDIA RTX 4090 GAME READY 560.70 64GB GSKILL DDR4 3200MHz JBL 9.1 Sound Bar DTS-X

Thanks! that is a better diagram.

You're right that the buss is 64B internal everywhere for high end SKUs.

But it does not change the fact that there are only 2 FMA's per core (on port 5 and fused port 0&1) for all higher end HEDT and Xeon SKUs with more than 10 cores. (for CPUs with less than 10 cores, port 5 does not have an AVX-512 FMA)

and that is the essential point in what I tried (unsuccessfully) to bring across. TOPS is multiply and add operations and that is done at 512-bit width in these 2 FMAs, limiting the theoretical through put to 2 ops per clock cycle. Empirically, we also we see that in real life testing here for instance: uops.info/table.html. Select only Cascade Lake  and AVX-512 (tested on a 10980XE) then type VPDPBUSD in the search field, click VPDPBUSD zmm, zmm, zmm for maximum throughput and scroll down to Cascade Lake: Throughput = 0.5 (stated inverted) => 2 retired VPDPBUSD FMA instructions per clock (each FMA performing 16*4 mult + 16*4 adds) consistent with the theoretical maximum with two FMAs (on port 0&1 and 5). Try also VFMADD213PS => same result (0.5) (in float32 so we get 16 mult + 16 adds per clock). The test was done on a 10980XE.

But again, the bottleneck is not going to be the FMA's performance - it is excellent at two 512bit FMAs retired per clock! In AI, it is going to be the memory bandwidth from the main memory. The CPU's FMAs can "eat" faster than the memory subsystem can "feed" it. Leading to that AMX is likely the best way forward in future CPUs since you with AMX access the memory only when reading the matrix and writing the final result matrix after a full matrix multiplication at BF16 or INT8 precision. And matrix multiplications is the dominating complexity in AI's inner loop.

Fact still remains: 7980XE and 10980 are outstanding CPU! Way ahead of its time, with tremendous performance potential for those that know how to unlock all silicon, i.e. the FMAs at full speed. 

Thank you for reading!

intel-cascade-lake-vector-throughput.jpg

Peak Skylake X / Cascade Lake X Throughput Slide from Intel

 

intel-cascade-lake-dl-boost-block.jpg

Totally MAC unit by INT from Intel

After seeking data 2 FMA 512-bit each core from Intel slide above. We'd 128 MAC per Core per Cycle.

TOPS Estimated for INT8* computation = 2 x 128 x 4.4 GHz = 1.126 TOPS / Core without HT* or 2.252 TOPS / Core with HT**

Thus 18 Core totally 20.26 TOPS without HT* or 40.52 TOPS / Core with HT** maximum

You're correct @Int8bldr 

The number based on Intel reference spec clock speed test 2.7 GHz based 4.0 GHz Turbo

52zA8Ac439ts7L65HyMEmY-1200-80.jpg

 

 

 

 

 

 

W11 24H2 26257.5000 Core i9 7980XE 02007108 MCE ME 11.12.96.2535 R6E Modified BIOS 4001 SAMSUNG OG9 FW 1019.0 SSD 970 EVO PLUS 1 TB x 3 NVIDIA RTX 4090 GAME READY 560.70 64GB GSKILL DDR4 3200MHz JBL 9.1 Sound Bar DTS-X