This is a short post investigating the behavior of AVX2 and AVX-512 related license-based downclocking on Intel’s newest Ice Lake and Rocket Lake chips.
license-based downclocking1 refers to the semi-famous effect where lower than nominal frequency limits are imposed when certain SIMD instructions are executed, especially heavy floating point instructions or 512-bit wide instructions.
More details about this type of downclocking are available at this StackOverflow answer and we’ve already covered in somewhat exhaustive detail the low level mechanics of these transitions. You can also find some guidelines on to how make use of wide SIMD given this issue2.
All of those were written in the context of Skylake-SP (SKX) which were the first generation of chips to support AVX-512.
So what about Ice Lake, the newest chips which support both the SKX flavor of AVX-512 and also have a whole host of new AVX-512 instructions? Will we be stuck gazing longly at these new instructions from afar while never being allowed to actually use them due to downclocking?
Ice Lake Frequency Behavior
We will use the avx-turbo utility to measure the core count and instruction mix dependent frequencies for a CPU. This tools works in a straightforward way: run a given mix of instructions on the given number of cores, while measuring the frequency achieved during the test.
For example, the
avx256_fma_t test – which measures the cost of heavy 256-bit instructions with high ILP – runs the following sequence of FMAs:
vfmadd132pd ymm0,ymm10,ymm11 vfmadd132pd ymm1,ymm10,ymm11 vfmadd132pd ymm2,ymm10,ymm11 vfmadd132pd ymm3,ymm10,ymm11 vfmadd132pd ymm4,ymm10,ymm11 vfmadd132pd ymm5,ymm10,ymm11 vfmadd132pd ymm6,ymm10,ymm11 vfmadd132pd ymm7,ymm10,ymm11 vfmadd132pd ymm8,ymm10,ymm11 vfmadd132pd ymm9,ymm10,ymm11 ; repeat 10x for a total of 100 FMAs
In total, we’ll use five tests to test every combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit SIMD behaves the same as scalar), using this command line:
Ice Lake Results
I ran avx-turbo as described above on an Ice Lake i5-1035G4, which is the middle-of-the-range Ice Lake client CPU running at up to 3.7 GHz. The full output is hidden away in a gist, but here are the all-important frequency results (all values in GHz):
|Instruction Mix||Active Cores|
As expected, maximum frequency decreases with active core count, but scan down each column to see the effect of instruction category. Along this axis, there is almost no downclocking at all! Only for a single active core count is there any decrease with wider instructions, and it is a paltry only 100 MHz: from 3,700 MHz to 3,600 MHz when any 512-bit instructions are used.
In any other scenario, including any time more than one core is active, or for heavy 256-bit instructions, there is zero license-based downclocking: everything runs as fast as scalar.
There another change here too. In SKX, there are three licenses, or categories of instructions with respect to downclocking: L0, L1 and L2. Here, in client ICL, there are only two3 and those don’t line up exactly with the three in SKX.
To be clearer, in SKX the licenses mapped to instruction width and weight as follows:
In particular, note that 256-bit heavy instructions have the same license as 512-bit light.
In ICL client, the mapping is:
Now, 256 heavy and 512 light are in different categories! In fact, the whole concept of light vs heavy doesn’t seem to apply here: the categorization is purely based on the width4.
Edison Chan has graciously provided the output of running avx-turbo on his Rocket Lake i9-11900K, the top of the line Rocket Lake chip. The full results are available, but I’ve summarized the achieved frequencies in the following table.
Rocket Lake i9-11900K Frequency Matrix
|Active Cores||Instruction Mix|
|Scalar and 128||Light 256||Heavy 256||Light 512||Heavy 512|
The results paint a very promising picture of Rocket Lake’s AVX-512 frequency behavior: there is no license-based downclocking evident at any combination of core count and frequency6. Even heavy AVX-512 instructions can execute at the same frequency as lightweight scalar code.
In fact, the frequency behavior of this chip appears very simple: the full Turbo Boost 2.0 frequency7 of 5.1 GHz is available for any instruction mix up and up to 4 active cores, then the speed drops to 4.9 for 5 and 6 active cores, and finally to 4.8 GHz for 7 or 8 active cores. This means that at 8 active cores and AVX-512, you are still achieving 94% of the frequency observed for 1 active core running light instructions.
Well, so what?
At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than the prior-generation verdict of “AVX-512 generally causes significant downclocking”, on these Ice Lake and Rocket Lake client chips we can say that AVX-512 causes insignificant (usually, none at all) license-based downclocking and I expect this to be true on other ICL and RKL client chips as well.
Now, this adjustment of expectations comes with an important caveat: license-based downclocking is only one source of downclocking. It is also possible to hit power, thermal or current limits. Some configurations may only be able to run wide SIMD instructions on all cores for a short period of time before exceeding running power limits. In my case, the $250 laptop I’m testing this on has extremely poor cooling and rather than power limits I hit thermal limits (100°C limit) within a few seconds running anything heavy on all cores.
However, these other limits are qualitatively different than license based limits. They apply mostly8 in a pay for what you use way: if you use a wide or heavy instruction or two you incur only a microscopic amount of additional power or heat cost associated with only those instructions. This is unlike some license-based transitions where a core or chip-wide transition occurs that affects unrelated subsequent execution for a significant period of time.
Since wider operations are generally cheaper in power than an equivalent number of narrower operations9, you can determine up-front that a wide operation is worth it – at least for cases that scale well with width. In any case, the problem is most local: not depending on the behavior of the surrounding code.
Here’s what we’ve learned.
- The Ice Lake i5-1035 CPU exhibits only 100 MHz of license-based downclock with 1 active core when running 512-bit instructions, and no license downclock in any other scenario.
- The Rocket Lake i9-11900K CPU doesn’t exhibit any license-based downclock in the tested scenarios.
- The Ice Lake CPU has an all-core 512-bit turbo frequency of 3.3 GHz is 89% of the maximum single-core scalar frequency of 3.7 GHz, so within power and thermal limits this chip has a very “flat” frequency profile. The Rocket Lake 11900K is even flatter with an all-eight-cores frequency of 4.8 GHz clocking in at 94% of the 5.1 GHz single-core speed.
- Unlike SKX, this Ice Lake chip does not distinguish between “light” and “heavy” instructions for frequency scaling purposes: FMA operations behave the same as lighter operations.
So on ICL and RKL client, you don’t have to fear the downclock. Only time will tell if this applies also to the Ice Lake Xeon server chips.
Thanks to Edison Chan for the Rocket Lake i9-11900K results.
Discussion and Feedback
This post was discussed on Hacker News.
If you have a question or any type of feedback, you can leave a comment below. I’m also interested in results on other new Intel or AMD chips, like the i3 and i7 variants: let me know if you have one of those and we can collect results.
If you liked this post, check out the homepage for others you might enjoy.
It gets tiring to constantly repeat license-based downclock so I’ll often use simply “downclock” instead, but this should still be understood to refer to the license-based variety rather than other types of frequency throttling. ↩
Only two visible: it is possible that the three (or more) categories still exist, but they cause voltage transitions only, not any frequency transitions. ↩
One might imagine this is a consequence of ICL client having only one FMA unit on all SKUs: very heavy FP 512-bit operations aren’t possible. However, this doesn’t align with 256-bit heavy still being fast: you can still do 2x256-bit FMAs per cycle and this is the same FP intensity as 1x512-bit FMA per cycle. It’s more like, on this chip, FP operation don’t need more license based protection from other operations of the same width, and the main cost is 512-bit width. ↩
Those of a more critical bent might prefer long suffering or very long in the tooth as adjectives for this process. ↩
Some tests did show lower speeds, although these outlier results didn’t correlate well with heavy or light instructions, and the difference was generally 100 MHz or less. These likely represent other sources of reduced frequency, such as thermal throttling or switching to a higher active core count when a process not related to the test process has active threads. In any case, for each core count, we can find a test in each of the instruction categories that runs at full speed, allowing me to fill out the matrix even in the presence of these outliers. ↩
I mention Turbo Boost 2.0 specifically because this chip also has a higher Turbo Boost 3.0 maximum frequency of 5.2 GHz, and beyond that a high Thermal Velocity Boost frequency of 5.3 GHz. These higher frequencies apply only to specific chosen cores within the CPU selected at manufacturing based on their ability to reach these higher frequencies. We don’t see any of these higher speeds during the test, possibly because the cores the test pins itself to are not the chosen cores on this CPU. So the frequency behavior of this chip can be characterized as “very simple” only if you ignore these additional turbo levels and other complicating factors. ↩
I have to weasel-word with mostly here because even if there is no frequency transition, there may be a voltage transition which both incurs a halted period where nothing executes, and increases power for subsequent execution that may not require the elevated voltage. Also, there is the not-yet-discussed concept of implicit widening which may extend later narrow operations to maximum width if the upper parts of the registers are not zeroed with
For example, one 512-bit integer addition would generally be cheaper in energy use than the two 256-bit operations required to calculate the same result, because of execution overheads that don’t scale linearly with width (that’s almost everything outside of execution itself). ↩