This is a short post investigating the behavior of AVX2 and AVX-512 related licence-based downclocking on Intel’s newest Ice Lake chips.

Licence-based downclocking1 refers to the semi-famous effect where lower than nominal frequency limits are imposed when certain SIMD instructions are executed, especially heavy floating point instructions or 512-bit wide instructions.

More details about this type of downclocking are available at this StackOverflow answer and we’ve already covered in somewhat exhaustive detail the low level mechanics of these transitions. You can also find some guidelines on to how make use of wide SIMD given this issue2.

All of those were written in the context of Skylake-SP (SKX) which were the first generation of chips to support AVX-512.

So what about Ice Lake, the newest chips which support both the SKX flavor of AVX-512 and also have a whole host of new AVX-512 instructions? Will we be stuck gazing longly at these new instructions from afar while never being allowed to actually use them due to downclocking?

Read on to find out, or just skip to the end.

AVX-Turbo

We will use the avx-turbo utility to measure the core count and instruction mix dependent frequencies for a CPU. This tools works in a straightforward way: run a given mix of instructions on the given number of cores, while measuring the frequency achieved during the test.

For example, the avx256_fma_t test – which measures the cost of heavy 256-bit instructions with high ILP – runs the following sequence of FMAs:

	vfmadd132pd ymm0,ymm10,ymm11
	vfmadd132pd ymm1,ymm10,ymm11
	vfmadd132pd ymm2,ymm10,ymm11
	vfmadd132pd ymm3,ymm10,ymm11
	vfmadd132pd ymm4,ymm10,ymm11
	vfmadd132pd ymm5,ymm10,ymm11
	vfmadd132pd ymm6,ymm10,ymm11
	vfmadd132pd ymm7,ymm10,ymm11
	vfmadd132pd ymm8,ymm10,ymm11
	vfmadd132pd ymm9,ymm10,ymm11
	; repeat 10x for a total of 100 FMAs

In total, we’ll use five tests to test every combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit SIMD behaves the same as scalar), using this command line:

./avx-turbo --test=scalar_iadd,avx256_iadd,avx512_iadd,avx256_fma_t,avx512_fma_t

Ice Lake Results

I ran avx-turbo as described above on an Ice Lake i5-1035G4, which is the middle-of-the-range Ice Lake client CPU running at up to 3.7 GHz. The full output is hidden away in a gist, but here are the all-important frequency results (all values in GHz):

Instruction Mix Active Cores
1 2 3 4
Scalar/128-bit 3.7 3.6 3.3 3.3
Light 256-bit 3.7 3.6 3.3 3.3
Heavy 256-bit 3.7 3.6 3.3 3.3
Light 512-bit 3.6 3.6 3.3 3.3
Heavy 512-bit 3.6 3.6 3.3 3.3

As expected, maximum frequency decreases with active core count, but scan down each column to see the effect of instruction category. Along this axis, there is almost no downclocking at all! Only for a single active core count is there any decrease with wider instructions, and it is a paltry only 100 MHz: from 3,700 MHz to 3,600 MHz when any 512-bit instructions are used.

In any other scenario, including any time more than one core is active, or for heavy 256-bit instructions, there is zero licence-based downclocking: everything runs as fast as scalar.

Licence Mapping

There another change here too. In SKX, there are three licences, or categories of instructions with respect to downclocking: L0, L1 and L2. Here, in client ICL, there are only two3 and those don’t line up exactly with the three in SKX.

To be clearer, in SKX the licences mapped to instruction width and weight as follows:

Width Light Heavy
Scalar/128 L0 L0
256 L0 L1
512 L1 L2

In particular, note that 256-bit heavy instructions have the same licence as 512-bit light.

In ICL client, the mapping is:

Width Light Heavy
Scalar/128 L0 L0
256 L0 L0
512 L1 L1

Now, 256 heavy and 512 light are in different categories! In fact, the whole concept of light vs heavy doesn’t seem to apply here: the categorization is purely based on the width4.

So What?

Well, so what?

At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than “generally causing significant downclocking”, on this Ice Lake chip we can say that AVX-512 causes insignicant or zero licence-based downclocking and I expect this to be true on other Ice Lake client chips as well.

Now, this adjustment of expectations comes with an important caveat: licence-based downclocking is only one source of downclocking. It is also possible to hit power, thermal or current limits. Some configurations may only be able to run wide SIMD instructions on all cores for a short period of time before exceeding running power limits. In my case, the $250 laptop I’m testing this on has extremely poor cooling and rather than power limits I hit thermal limits (100°C limit) within a few seconds running anything heavy on all cores.

However, these other limits are qualitatively different than licence based limits. They apply mostly5 in a pay for what you use way: if you use a wide or heavy instruction or two you incur only a microscopic amount of additional power or heat cost associated with only those instructions. This is unlike some licence-based transitions where a core or chip-wide transition occurs that affects unrelated subsequent execution for a significant period of time.

Since wider operations are generally cheaper in power than an equivalent number of narrower operations6, you can determine up-front that a wide operation is worth it – at least for cases that scale well with width. In any case, the problem is most local: not depending on the behavior of the surrounding code.

Summary

Here’s what we’ve learned.

  • The Ice Lake i5-1035 CPU exhibits only 100 MHz of licence-based downclock with 1 active core when running 512-bit instructions.
  • There is no downclock in any other scenario.
  • The all-core 512-bit turbo frequency of 3.3 GHz is 89% of the maximum single-core scalar frequency of 3.7 GHz, so within power and thermal limits this chip has a very “flat” frequency profile.
  • Unlike SKX, this Ice Lake chip does not distinguish between “light” and “heavy” instructions for frequency scaling purposes: FMA operations behave the same as lighter operations.

So on ICL client, you don’t have to fear the downclock. Only time will tell if this applies also to ICL server.

Thanks

Stopwatch photo by Kevin Maillefer on Unsplash.

Discussion and Feedback

You can discuss this post on Hacker News.

If you have a question or any type of feedback, you can leave a comment below. I’m also interested in results on other ICL chips, like the i3 and i7 variants: let me know if you have one of those and we can collect results.

If you liked this post, check out the homepage for others you might enjoy.




  1. It gets tiring to constantly repeat license-based downclock so I’ll often use simply “downclock” instead, but this should still be understood to refer to the licence-based variety rather than other types of frequency throttling. 

  2. Note that Daniel has written much more than just that one. 

  3. Only two visible: it is possible that the three (or more) categories still exist, but they cause voltage transitions only, not any frequency transitions. 

  4. One might imagine this is a consequence of ICL client having only one FMA unit on all SKUs: very heavy FP 512-bit operations aren’t possible. However, this doesn’t align with 256-bit heavy still being fast: you can still do 2x256-bit FMAs per cycle and this is the same FP intensity as 1x512-bit FMA per cycle. It’s more like, on this chip, FP operation don’t need more licence based protection from other operations of the same width, and the main cost is 512-bit width. 

  5. I have to weasel-word with mostly here because even if there is no frequency transition, there may be a voltage transition which both incurs a halted period where nothing executes, and increases power for subsequent execution that may not require the elevated voltage. Also, there is the not-yet-discussed concept of implicit widening which may extend later narrow operations to maximum width if the upper parts of the registers are not zeroed with vzeroupper or vzeroall

  6. For example, one 512-bit integer addition would generally be cheaper in energy use than the two 256-bit operations required to calculate the same result, because of execution overheads that don’t scale linearly with width (that’s almost everything outside of execution itself).