04-30-2024 04:18 AM - edited 05-27-2024 12:31 AM
" TOPS quantifies an NPU's processing capabilities by measuring the number of operations (additions, multiplies, etc.) in trillions executed within a second.
This standardized measurement strongly indicates an NPU's performance, serving as a crucial yardstick for comparing AI performance across different processors and architectures. Because TOPS serves as a cornerstone performance metric for NPUs, exploring the parameters that make up the TOPS equation and how they can dictate performance is essential. Doing so can offer a deeper understanding of an NPU's capabilities.
A multiply-accumulate (MAC) operation executes the mathematical formulas at the core of AI workloads. A matrix multiply consists of a series of two fundamental operations: multiplication and addition to an accumulator. A MAC unit can, for example, run one of each per clock cycle, meaning it executes two operations per clock cycle. A given NPU has a set number of MAC units that can operate at varying levels of precision, depending on the NPU's architecture.
Frequency dictates the clock speed (or cycles per second) at which an NPU and its MAC units (as well as a CPU or GPU) operate, directly influencing overall performance. A higher frequency allows for more operations per unit of time, resulting in faster processing speeds. However, increasing frequency also leads to higher power consumption and heat generation, which impacts battery life and user experience. The TOPS number quoted for processors is generally at the peak operating frequency.
Precision refers to the granularity of calculations, with higher precision typically correlating with increased model accuracy at the expense of computational intensity. The most common high-precision AI models are 32-bit and 16-bit floating point, whereas faster, low-precision, low-power models typically use 8-bit and 4-bit integer precision. The current industry standard for measuring AI inference in TOPS is at INT8 precision.
To calculate TOPS, start with OPS, which equals two times the number of MAC units multiplied by their operating frequency. TOPS is the number of OPS divided by one trillion, making it simpler to list and compare, that is,
Core i9 7980XE running at 4.4 GHz also provide
18 ( Core ) x 2 x 128 MAC Op/ Cycle 4.4 GHz Dividen by 1,000,000,000,000
= 18 x 2** x 2 x 128 x 4.4GHz / 1,000,000,000,000 = 40.55 TOPS 😁 **
( Estimate from maximum retire of Skylake X / Cascade Lake X microarchitect throughput not include GPU datatype INT8 *** )
(** Hyperthreading ON **)
So.. we didn't need any NPU for Windows Copilot and other AI integrated App at all.
40.55 TOPS CPU performance is OK for 7-years old CPU.
😁 It is easily leave Lunar Lake in the dust by use our old Skylake X against it with Monster NVIDIA GPU.😁
Total system AI mixing higher than 1321.2+40.55 = 1,361.75 TOPS 😁
The easiest way to go to AI ages from our X299 platform. Combine your CPU with Monster GPU is more than enough.
Now your rigs can be a " Premium AI PC " with RTX GPU series
Note :: Number of estimate performance refered to maximum throughput of Intel Skylake microarchitect 2015.
NVIDIA RTX Reference TOPS link here
The current industry standard for measuring AI inference in TOPS is at INT8 precision ***.
04-30-2024 08:57 AM
A couple of things:
I think Trillion = 10^12 and gigahertz is 10^9 so you need to divide your number with 1000
However, the 2 FMA's per core that we have in the Skylake and Cascadelake core I-9 79800XE and 10980XE are extremely capable.
FLOAT 32:
For example the VFMADDPS AVX-512 instruction: At Float32 precision each of the FMAs/Core can produce 16 multiplications plus 16 additions operations per 1/2 clock cycle (Throughput) =>
18 cores * 2 (FMA/s) * (16 (multiplications) + 16 (additions)) * 2 (since 1/2 clock cycle in Throughput) * FMA clock frequency /10^12
So using your clock frequency of 4.4GHz we get
18 cores * 2 (FMAs) * (16 (multiplications) + 16 (additions)) * 2 (since 1/2 clock cycle in Throughput) * FMA clock frequency /10^12 = 10.1 “TOPS” at FLOAT 32 precision
Not 1000 TOPS but is in full Float 32 precision so still pretty good in my view.
INT8:
If we use the byte size instruction in the VNNI family (AVX512_VNNI flag): e.g. VPDPBUSD which Multiply and Add Unsigned and Signed Bytes we can 4x the calculation:
The VPDPBUSD in full AVX-512 width, multiplies and adds 16 * 4 pairs (=64) of bytes per 1/2 clock cycle (Throughput) =>
18 (cores) * 2 (FMAs) * (16*4 + 16*4)* 2 (since 1/2 clock cycle in Throughput) *4.4 Ghz/10^12 = 40.6 “TOPS”
BUT raw TOPS is not everything. AVX-512 is much more than that. The AVX-512 instructions set is extremely versatile and rich. It’s like a new instruction language all together. It gives you the power of expressing everything in vectorized form and do almost all other intense operations in full AVX-512 width speeding up all other parts of the code as a well.
A few other problems:
Intel dropped AVX-512 for Alderlake – big mistake in my view. Should have stuck to the plan!
The FMA’s cannot operate at full speed and have to down clocked with the AVX-512 offset in BIOS. I have my AVX-512 offset set so FMAs operate at 4GHz. Still pretty good.
I love my 10980XE it's an unbelievably good CPU!
So extremely capable - if you know how to unlock all the vectorized silicon power!
04-30-2024 01:05 PM
However (again), the operations are memory constrained NOT FMA constrained, i.e. the CPU's AVX-512 engine can eat much faster than the memory channels can feed it. So going from 10 to 1000 TOS is not going to do anything if the memory cannot keep up - and it cannot. So AMX matrix multiplication is likely the way to go with much larger matrix registers, producing full matrix multiplications in silicon without additional memory access (just one load of two the two source matrices and one store of the final result).
05-09-2024 10:51 AM
I share your enthusiasm for out CPUs. However, there are (IMHO) a few double countings in your calculations.
to see what TOPS operations can be performed in parallel inside each core, you need to look at the skylake micro architecture.
Each core has 8 ports that can execute in parallel, but only port 0,1, and 5 can execute the TOS like operations like vector FMAs add and multiply at 512 bit vector width.
Further, when executing 512 bit instructions (avx512 or any 512 bit VNNI instruction) port 0 and 1 are fused in to one while port 5 is executing full 512. In other words, Skylake can retire two 512 bit FMAs per clock cycle. Hence, my calculations in my other post here.
Using hyperthreading will not double the TOPS, since it is just a way for the CPU to use more ports in parallel (if possible), but all Arithmetic ports are busy, so hyper threading does nothing to the TOPS number (but potentially slowing it down). Also, we cannot say that we can execute FLOATS ops and INTs ops in parallel since they are too using the same FMA.
In other words, the TOPS number is limited by port 0+1 and 5 throughput, and it is always two 512 bit FMA ops per clock cycle regardless of the mix of arithmetic instructions (of FLOATS and INTs). That is the theoretical max, limited by the actual physical silicon in a core. There is simply no other silicon available able to execute more arithmetic instructions in a core (save for very basic address adding which is a simple scalar address add operation on port 6 - 1 add per cycle = 0.001 TOPS = rounding error).
Hope I made myself understood.
Still, I love my 10980XE and combined with a 3090 it is extremely capable, if you know how to unlock all core performance. I do think AI will evolve to a point where our platform will be more then capable to run AI future models. Today, they the AI models are still very crude brute force and yeah not very intelligent - honestly. Train with everything and you get a Gaussian average result (laws of large sample number statistics), and a Gaussian average answer is, well, not very precise and insightful. My point is that more brute force (i.e. more TOPS) is not the answer. Perhaps the 7980XE and 10980XE have enough TOPS when the models are more intelligently refined in a few years.
We'll see...
05-09-2024 11:19 PM - edited 05-09-2024 11:34 PM
Skylake X Microarchitect Diagram all L1 L2 L3 cache running on 512-bit bus wide
Hi @Int8bldr
Your diagram above was Skylake Client version that not correct for our Skylake X/W/SP / Cascade Lake X/W/SP/AP .
The caches all level bus wide were 64 byte ( 512-bit ) on Skylake server side.
Thus maximum output will be double on our CPU when compare with client skus.
So my estimate number above also double from Skylake Server diagram upon.
05-10-2024 09:38 AM
Thanks! that is a better diagram.
You're right that the buss is 64B internal everywhere for high end SKUs.
But it does not change the fact that there are only 2 FMA's per core (on port 5 and fused port 0&1) for all higher end HEDT and Xeon SKUs with more than 10 cores. (for CPUs with less than 10 cores, port 5 does not have an AVX-512 FMA)
and that is the essential point in what I tried (unsuccessfully) to bring across. TOPS is multiply and add operations and that is done at 512-bit width in these 2 FMAs, limiting the theoretical through put to 2 ops per clock cycle. Empirically, we also we see that in real life testing here for instance: uops.info/table.html. Select only Cascade Lake and AVX-512 (tested on a 10980XE) then type VPDPBUSD in the search field, click VPDPBUSD zmm, zmm, zmm for maximum throughput and scroll down to Cascade Lake: Throughput = 0.5 (stated inverted) => 2 retired VPDPBUSD FMA instructions per clock (each FMA performing 16*4 mult + 16*4 adds) consistent with the theoretical maximum with two FMAs (on port 0&1 and 5). Try also VFMADD213PS => same result (0.5) (in float32 so we get 16 mult + 16 adds per clock). The test was done on a 10980XE.
But again, the bottleneck is not going to be the FMA's performance - it is excellent at two 512bit FMAs retired per clock! In AI, it is going to be the memory bandwidth from the main memory. The CPU's FMAs can "eat" faster than the memory subsystem can "feed" it. Leading to that AMX is likely the best way forward in future CPUs since you with AMX access the memory only when reading the matrix and writing the final result matrix after a full matrix multiplication at BF16 or INT8 precision. And matrix multiplications is the dominating complexity in AI's inner loop.
Fact still remains: 7980XE and 10980 are outstanding CPU! Way ahead of its time, with tremendous performance potential for those that know how to unlock all silicon, i.e. the FMAs at full speed.
Thank you for reading!
05-27-2024 12:20 AM
Peak Skylake X / Cascade Lake X Throughput Slide from Intel
Totally MAC unit by INT from Intel
After seeking data 2 FMA 512-bit each core from Intel slide above. We'd 128 MAC per Core per Cycle.
TOPS Estimated for INT8* computation = 2 x 128 x 4.4 GHz = 1.126 TOPS / Core without HT* or 2.252 TOPS / Core with HT**
Thus 18 Core totally 20.26 TOPS without HT* or 40.52 TOPS / Core with HT** maximum
You're correct @Int8bldr
The number based on Intel reference spec clock speed test 2.7 GHz based 4.0 GHz Turbo