Ice Lake AVX-512 Downclocking

Aug 19, 2020 • performancex86icelakeavx512

This is a short post investigating the behavior of AVX2 and AVX-512 related license-based downclocking on Intel’s newest Ice Lake and Rocket Lake chips.

license-based downclocking¹ refers to the semi-famous effect where lower than nominal frequency limits are imposed when certain SIMD instructions are executed, especially heavy floating point instructions or 512-bit wide instructions.

More details about this type of downclocking are available at this StackOverflow answer and we’ve already covered in somewhat exhaustive detail the low level mechanics of these transitions. You can also find some guidelines on to how make use of wide SIMD given this issue².

All of those were written in the context of Skylake-SP (SKX) which were the first generation of chips to support AVX-512.

So what about Ice Lake, the newest chips which support both the SKX flavor of AVX-512 and also have a whole host of new AVX-512 instructions? Will we be stuck gazing longly at these new instructions from afar while never being allowed to actually use them due to downclocking?

Read on to find out, or just skip to the end. The original version of this post included only Ice Lake is the primary focus. On March 28th, 2020 I updated it with a Rocket Lake section.

Ice Lake Frequency Behavior

AVX-Turbo

We will use the avx-turbo utility to measure the core count and instruction mix dependent frequencies for a CPU. This tools works in a straightforward way: run a given mix of instructions on the given number of cores, while measuring the frequency achieved during the test.

For example, the avx256_fma_t test – which measures the cost of heavy 256-bit instructions with high ILP – runs the following sequence of FMAs:

	vfmadd132pd ymm0,ymm10,ymm11
	vfmadd132pd ymm1,ymm10,ymm11
	vfmadd132pd ymm2,ymm10,ymm11
	vfmadd132pd ymm3,ymm10,ymm11
	vfmadd132pd ymm4,ymm10,ymm11
	vfmadd132pd ymm5,ymm10,ymm11
	vfmadd132pd ymm6,ymm10,ymm11
	vfmadd132pd ymm7,ymm10,ymm11
	vfmadd132pd ymm8,ymm10,ymm11
	vfmadd132pd ymm9,ymm10,ymm11
	; repeat 10x for a total of 100 FMAs

In total, we’ll use five tests to test every combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit SIMD behaves the same as scalar), using this command line:

avx-turbo --test=scalar_iadd,avx256_iadd,avx512_iadd,avx256_fma_t,avx512_fma_t

Ice Lake Results

I ran avx-turbo as described above on an Ice Lake i5-1035G4, which is the middle-of-the-range Ice Lake client CPU running at up to 3.7 GHz. The full output is hidden away in a gist, but here are the all-important frequency results (all values in GHz):

Instruction Mix	Active Cores
Instruction Mix	1	2	3	4
Scalar/128-bit	3.7	3.6	3.3	3.3
Light 256-bit	3.7	3.6	3.3	3.3
Heavy 256-bit	3.7	3.6	3.3	3.3
Light 512-bit	3.6	3.6	3.3	3.3
Heavy 512-bit	3.6	3.6	3.3	3.3

As expected, maximum frequency decreases with active core count, but scan down each column to see the effect of instruction category. Along this axis, there is almost no downclocking at all! Only for a single active core count is there any decrease with wider instructions, and it is a paltry only 100 MHz: from 3,700 MHz to 3,600 MHz when any 512-bit instructions are used.

In any other scenario, including any time more than one core is active, or for heavy 256-bit instructions, there is zero license-based downclocking: everything runs as fast as scalar.

license Mapping

There another change here too. In SKX, there are three licenses, or categories of instructions with respect to downclocking: L0, L1 and L2. Here, in client ICL, there are only two³ and those don’t line up exactly with the three in SKX.

To be clearer, in SKX the licenses mapped to instruction width and weight as follows:

Width	Light	Heavy
Scalar/128	L0	L0
256	L0	L1
512	L1	L2

In particular, note that 256-bit heavy instructions have the same license as 512-bit light.

In ICL client, the mapping is:

Width	Light	Heavy
Scalar/128	L0	L0
256	L0	L0
512	L1	L1

Now, 256 heavy and 512 light are in different categories! In fact, the whole concept of light vs heavy doesn’t seem to apply here: the categorization is purely based on the width⁴.

Rocket Lake

Rocket Lake (shortened as RKL, see wikipedia or wikichip for more) is more-or-less a backport of the 10nm Sunny Cove microarchitecture to Intel’s highly-tuned workhorse⁵ 14nm process.

Edison Chan has graciously provided the output of running avx-turbo on his Rocket Lake i9-11900K, the top of the line Rocket Lake chip. The full results are available, but I’ve summarized the achieved frequencies in the following table.

Rocket Lake i9-11900K Frequency Matrix

Active Cores	Instruction Mix
Active Cores	Scalar and 128	Light 256	Heavy 256	Light 512	Heavy 512
1 Core	5.1	5.1	5.1	5.1	5.1
2 Cores	5.1	5.1	5.1	5.1	5.1
3 Cores	5.1	5.1	5.1	5.1	5.1
4 Cores	5.1	5.1	5.1	5.1	5.1
5 Cores	4.9	4.9	4.9	4.9	4.9
6 Cores	4.9	4.9	4.9	4.9	4.9
7 Cores	4.8	4.8	4.8	4.8	4.8
8 Cores	4.8	4.8	4.8	4.8	4.8

The results paint a very promising picture of Rocket Lake’s AVX-512 frequency behavior: there is no license-based downclocking evident at any combination of core count and frequency⁶. Even heavy AVX-512 instructions can execute at the same frequency as lightweight scalar code.

In fact, the frequency behavior of this chip appears very simple: the full Turbo Boost 2.0 frequency⁷ of 5.1 GHz is available for any instruction mix up and up to 4 active cores, then the speed drops to 4.9 for 5 and 6 active cores, and finally to 4.8 GHz for 7 or 8 active cores. This means that at 8 active cores and AVX-512, you are still achieving 94% of the frequency observed for 1 active core running light instructions.

So What?

Well, so what?

At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than the prior-generation verdict of “AVX-512 generally causes significant downclocking”, on these Ice Lake and Rocket Lake client chips we can say that AVX-512 causes insignificant (usually, none at all) license-based downclocking and I expect this to be true on other ICL and RKL client chips as well.

Now, this adjustment of expectations comes with an important caveat: license-based downclocking is only one source of downclocking. It is also possible to hit power, thermal or current limits. Some configurations may only be able to run wide SIMD instructions on all cores for a short period of time before exceeding running power limits. In my case, the $250 laptop I’m testing this on has extremely poor cooling and rather than power limits I hit thermal limits (100°C limit) within a few seconds running anything heavy on all cores.

However, these other limits are qualitatively different than license based limits. They apply mostly⁸ in a pay for what you use way: if you use a wide or heavy instruction or two you incur only a microscopic amount of additional power or heat cost associated with only those instructions. This is unlike some license-based transitions where a core or chip-wide transition occurs that affects unrelated subsequent execution for a significant period of time.

Since wider operations are generally cheaper in power than an equivalent number of narrower operations⁹, you can determine up-front that a wide operation is worth it – at least for cases that scale well with width. In any case, the problem is most local: not depending on the behavior of the surrounding code.

Summary

Here’s what we’ve learned.

The Ice Lake i5-1035 CPU exhibits only 100 MHz of license-based downclock with 1 active core when running 512-bit instructions, and no license downclock in any other scenario.
The Rocket Lake i9-11900K CPU doesn’t exhibit any license-based downclock in the tested scenarios.
The Ice Lake CPU has an all-core 512-bit turbo frequency of 3.3 GHz is 89% of the maximum single-core scalar frequency of 3.7 GHz, so within power and thermal limits this chip has a very “flat” frequency profile. The Rocket Lake 11900K is even flatter with an all-eight-cores frequency of 4.8 GHz clocking in at 94% of the 5.1 GHz single-core speed.
Unlike SKX, this Ice Lake chip does not distinguish between “light” and “heavy” instructions for frequency scaling purposes: FMA operations behave the same as lighter operations.

So on ICL and RKL client, you don’t have to fear the downclock. Only time will tell if this applies also to the Ice Lake Xeon server chips.

Thanks

Thanks to Edison Chan for the Rocket Lake i9-11900K results.

Stopwatch photo by Kevin Andre on Unsplash.

Discussion and Feedback

This post was discussed on Hacker News.

If you have a question or any type of feedback, you can leave a comment below. I’m also interested in results on other new Intel or AMD chips, like the i3 and i7 variants: let me know if you have one of those and we can collect results.

If you liked this post, check out the homepage for others you might enjoy.

It gets tiring to constantly repeat license-based downclock so I’ll often use simply “downclock” instead, but this should still be understood to refer to the license-based variety rather than other types of frequency throttling. ↩
Note that Daniel has written much more than just that one. ↩
Only two visible: it is possible that the three (or more) categories still exist, but they cause voltage transitions only, not any frequency transitions. ↩
One might imagine this is a consequence of ICL client having only one FMA unit on all SKUs: very heavy FP 512-bit operations aren’t possible. However, this doesn’t align with 256-bit heavy still being fast: you can still do 2x256-bit FMAs per cycle and this is the same FP intensity as 1x512-bit FMA per cycle. It’s more like, on this chip, FP operation don’t need more license based protection from other operations of the same width, and the main cost is 512-bit width. ↩
Those of a more critical bent might prefer long suffering or very long in the tooth as adjectives for this process. ↩
Some tests did show lower speeds, although these outlier results didn’t correlate well with heavy or light instructions, and the difference was generally 100 MHz or less. These likely represent other sources of reduced frequency, such as thermal throttling or switching to a higher active core count when a process not related to the test process has active threads. In any case, for each core count, we can find a test in each of the instruction categories that runs at full speed, allowing me to fill out the matrix even in the presence of these outliers. ↩
I mention Turbo Boost 2.0 specifically because this chip also has a higher Turbo Boost 3.0 maximum frequency of 5.2 GHz, and beyond that a high Thermal Velocity Boost frequency of 5.3 GHz. These higher frequencies apply only to specific chosen cores within the CPU selected at manufacturing based on their ability to reach these higher frequencies. We don’t see any of these higher speeds during the test, possibly because the cores the test pins itself to are not the chosen cores on this CPU. So the frequency behavior of this chip can be characterized as “very simple” only if you ignore these additional turbo levels and other complicating factors. ↩
I have to weasel-word with mostly here because even if there is no frequency transition, there may be a voltage transition which both incurs a halted period where nothing executes, and increases power for subsequent execution that may not require the elevated voltage. Also, there is the not-yet-discussed concept of implicit widening which may extend later narrow operations to maximum width if the upper parts of the registers are not zeroed with vzeroupper or vzeroall. ↩
For example, one 512-bit integer addition would generally be cheaper in energy use than the two 256-bit operations required to calculate the same result, because of execution overheads that don’t scale linearly with width (that’s almost everything outside of execution itself). ↩

Comments

NOAH GOLDSTEIN • December 7th, 2021 20:08

@TravisDown if your interested, you might weighing in here: https://marc.info/?t=163876105100002&r=1&w=3

↪︎ Reply to NOAH GOLDSTEIN

Noah Goldstein • October 14th, 2021 20:37

Hey Travis,

Any info on the reduce IPC transition time / V-only transition for RKL?

↪︎ Reply to Noah Goldstein

YsHaNg • January 20th, 2021 16:08

I tested on tiger lake 1165G7 processor, although MSR seems not working under VM. The result shows:

CPU brand string: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
4 available CPUs: \[0, 1, 2, 3\]
2 physical cores: \[0, 2\]
Will test up to 2 CPUs

Cores	ID	Description	OVRLP3	Mops
1	scalar_iadd	Scalar integer adds	1.000	4079
1	avx256_iadd	256-bit integer serial adds	1.000	4074
1	avx512_iadd	512-bit integer serial adds	1.000	4077
1	avx256_fma_t	256-bit parallel DP FMAs	1.000	8155
1	avx512_fma_t	512-bit parallel DP FMAs	1.000	4570

Cores	ID	Description	OVRLP3	Mops
2	scalar_iadd	Scalar integer adds	1.000	4076, 4077
2	avx256_iadd	256-bit integer serial adds	1.000	4076, 4076
2	avx512_iadd	512-bit integer serial adds	1.000	4078, 4079
2	avx256_fma_t	256-bit parallel DP FMAs	1.000	8143, 8153
2	avx512_fma_t	512-bit parallel DP FMAs	1.000	4076, 4078

↪︎ Reply to YsHaNg

Author Travis Downs • January 25th, 2021 03:00

Thanks, are you running this under a VM? The output indicates 2 physical cores but the 1165G7 should have 4?

In any case, the results are strange. Most tests run at ~4.1 GHz except the 1-core FMA test which runs at ~4.6 GHz, but if anything should be slower it should that one. Perhaps virtualization is interfering, although the numbers do look fairly consistent to me (e.g., little variance around the typical Mops of 4076).

YsHaNg • January 29th, 2021 21:51

This is tested under Hyper-V. I have half CPU resources assigned to VM. (I didn’t know this is default.) My machine is Dell XPS 13 9310. Now 4 cores test: CPU brand string: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 8 available CPUs: [0, 1, 2, 3, 4, 5, 6, 7] 4 physical cores: [0, 2, 4, 6] Will test up to 4 CPUs Cores | ID | Description | OVRLP3 | Mops 1 | scalar_iadd | Scalar integer adds | 1.000 | 4670 1 | avx256_iadd | 256-bit integer serial adds | 1.000 | 4670 1 | avx512_iadd | 512-bit integer serial adds | 1.000 | 4570 1 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 9157 1 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 4568

Cores	ID	Description	OVRLP3	Mops
2	scalar_iadd	Scalar integer adds	1.000	4071, 4072
2	avx256_iadd	256-bit integer serial adds	1.000	4669, 4665
2	avx512_iadd	512-bit integer serial adds	1.000	4567, 4567
2	avx256_fma_t	256-bit parallel DP FMAs	1.000	8135, 8128
2	avx512_fma_t	512-bit parallel DP FMAs	1.000	4072, 4072

Cores	ID	Description	OVRLP3	Mops
3	scalar_iadd	Scalar integer adds	1.000	4211, 4665, 4208
3	avx256_iadd	256-bit integer serial adds	1.000	4069, 4070, 4071
3	avx512_iadd	512-bit integer serial adds	1.000	4145, 4138, 4567
3	avx256_fma_t	256-bit parallel DP FMAs	1.000	8145, 8148, 8145
3	avx512_fma_t	512-bit parallel DP FMAs	1.000	4567, 1982, 1979

Cores	ID	Description	OVRLP3	Mops
4	scalar_iadd	Scalar integer adds	1.000	3686, 4071, 4072, 3679
4	avx256_iadd	256-bit integer serial adds	1.000	4072, 4072, 3549, 3552
4	avx512_iadd	512-bit integer serial adds	1.000	3698, 3699, 3703, 3694
4	avx256_fma_t	256-bit parallel DP FMAs	1.000	8141, 8143, 8144, 4070
4	avx512_fma_t	512-bit parallel DP FMAs	1.000	4074, 1985, 1991, 4070

Hahem • August 22th, 2020 07:17

5 licenses in ICL client. 128/256/512 x Light/Heavy.

↪︎ Reply to Hahem

Author Travis Downs • January 25th, 2021 02:49

Thanks for your comment. Just doing the multiplication naively, that would be 6, wouldn’t it?

Meehigh • August 20th, 2020 17:23

Hi, very nice article indeed! Do you happen to know whenever the L1 transition is soft or hard?

↪︎ Reply to Meehigh

Author Travis Downs • August 23th, 2020 04:51

The transition is hard. Although like SKX there is a period where the instructions still execute but at reduced throughput, but a single instruction triggers the transition, just like all the “wider” transitions in SKX.

Aleksey Ignatenko • August 20th, 2020 06:59

Highly expected! Thanks for the info

↪︎ Reply to Aleksey Ignatenko

Andrew Hunter • August 19th, 2020 22:21

That’s great news, really!

Did you measure the voltage-only-transitions from your Skylake post? Those are to some extent much scarier to me for some latency-sensitive cases. I’d weakly guess you wouldn’t seen these with this improvement but I’m not confident.

↪︎ Reply to Andrew Hunter

Author Travis Downs • August 20th, 2020 03:29

Hi Andrew!

No, I haven’t measured those but a couple of other people asked too, so I might go ahead and do it. I wouldn’t be surprised to see some V-only transitions: it is entirely possible that three or more license levels still exist, but that they just aren’t evident only by checking frequencies.

Some information from Intel at Hot Chips also indicates that ICX (ICL server) has very low transition times, even for frequency transitions, so it would be interesting to see if this applies to ICL too and seems like something that would be good news for you if you’re concerned about latency.

Author Travis Downs • August 23th, 2020 04:53

I went ahead and measured them, and I also find voltage-only transitions and I find them to be about 2x as long as SKX (this might not represent any difference between the uarches, but between the voltage levels required, or something).

You can find some more details on Twitter.

Non-E-Moose • August 19th, 2020 21:33

Just like the absurd approaches taken by DEC with various VAX systems. Firmware throttling until the customer ‘buys’ the upgrade.

↪︎ Reply to Non-E-Moose