Your CPU May Have Slowed Down on Wednesday

Jun 17, 2021 • Intelx86uarch

A Strange Performance Effect

The plot below shows the throughput of filling a region of the given size (varying on the x-axis) with zeros¹ on Skylake (and Ice Lake in the second tab).

The two series were generated under apparently identical conditions: the same binary on the same machine. Only the date the benchmark was run varies. That is, on Monday, June 7th, filling with zeros is substantially faster than the same benchmark on Wednesday, at least when the region no longer fits in the L2 cache².

Skylake

[data table] [raw data]

Figure 13: A chart of region size (x-axis) versus fill throughput (y-axis) with two series, Monday and Wednesday, with the Wednesday series showing worse performance in L3 and RAM

Ice Lake

[data table] [raw data]

Hump Day Strikes Back

What’s going on here? Are my Skylake and Ice Lake hosts simply work-weary by Wednesday and don’t put in as much effort? Is there a new crypto-coin based on who can store the most zeros and this is a countermeasure to avoid ballooning CPU prices in the face of this new workload?

Believe it or not, it is none of the above!

These hosts run Ubuntu 20.04 and on Wednesday June 9th an update to the intel-microcode OS package was released. After a reboot³, this loads the CPU with new microcode (released a day earlier by Intel) that causes the behavior shown above. Specifically, this microcode⁴ disables the hardware zero store optimization we discussed in a previous post. It was disabled to mitigate CVE-2020-24512 further described⁵ in Intel security advisory INTEL-SA-00464.

To be clear, I don’t know for sure that the microcode disables the zero store optimization – but the evidence is rather overwhelming. After the update, performance is the same when filling zeros as for any other value, and the performance counters tracking L2 evictions suggestion that substantially all evictions are now non-silent (recall from the previous posts that silent evictions were a hallmark of the optimization).

Although I suspect the performance impact will be minuscule on average⁶, this surprise still serves as a reminder that raw CPU performance can silently change due to microcode updates and most Linux distributions and modern Windows have these updates enabled by default. We’ve seen this before. If you are trying to run reproducible benchmarks, you should always re-run your entire suite in order to make accurate comparisons, even on the same hardware, rather than just running the stuff you think has changed.

Reproduction

The code to collect this performance data and reproduce my results is available in zero-fill-bench, with some instructions in the README.

Mea Culpa and an Unsustainable Path

In writing the earlier blog entries on this topic, I was interested in the performance aspects of this optimization, not its potential as an attack vector. However, merely by observing (and publishing) the results, the optimization was affected: the system under measurement changed as a result of the observation. Now I can’t be sure that the optimization wouldn’t have eventually been disabled anyway, but it does seem that the reason for this behavior change to occur now was my earlier post.

I am not convinced that removing any optimization which can be used in a timing-based side channel is sustainable. I am not sure this is a thread you want to keep pulling on: practically every aspect of a modern CPU can vary in performance and timing based on internal state⁷. Trying to draw the security boundaries tightly around co-located entities (e.g., processes on the same CPU, especially on the same core), without allowing any leaks seems destined to fail without a complete overhaul of CPU design, likely at the cost of a large amount of performance. There are just too many holes to plug.

I hope that once the wave of vulnerabilities and disclosures that started with Meltdown and Spectre begins to recede, we can start to work on a measured approach to classifying and mitigating timing and other side-channel attacks. This could start by enumerating which performance characteristics are reasonable guaranteed to hold, and which aren’t. For example, it could be specified whether memory access timing may vary based on the value accessed. If it is allowed to vary, the zero store optimization would be allowed.

In any case, I still plan to write about performance-related microarchitectural details. I just hope this outcome does not repeat itself.

Thanks

Thanks to JN, Chris Martin, Jonathan and m_ueberall for reporting or fixing typos in the text.

Stone photo by Colin Watts on Unsplash.

Discussion and Feedback

You can join the discussion on Twitter, Hacker News or r/Intel.

If you have a question or any type of feedback, you can leave a comment below.

If you liked this post, check out the homepage for others you might enjoy.

Specifically, it uses std::fill with a zero argument, with some inlining prevention, which ultimately results in a fill which uses a series of 32-byte vector loads and stores to store 256 bytes per unrolled iteration, with a loop body like this:
```
vmovdqu YMMWORD PTR [rax],ymm1
vmovdqu YMMWORD PTR [rax+0x20],ymm1
vmovdqu YMMWORD PTR [rax+0x40],ymm1
vmovdqu YMMWORD PTR [rax+0x60],ymm1
vmovdqu YMMWORD PTR [rax+0x80],ymm1
vmovdqu YMMWORD PTR [rax+0xa0],ymm1
vmovdqu YMMWORD PTR [rax+0xc0],ymm1
vmovdqu YMMWORD PTR [rax+0xe0],ymm1
```
So the compiler does a good job: you can’t ask for much better than that. ↩
I’m doing a bit of a retcon here. The effect is present as described on the gates given, and I observed and benchmarked it on Wednesday myself, but the specific data series used for the plots were generated a week later when I had time to collect the data properly in a relatively noise free environment. So the two series were collected back-to-back on the same day, varying only the hidden parameter you’ll learn about two paragraphs from now. ↩
To be clear, the microcode is not persistent, so it needs to be loaded on every boot. If you remove or downgrade the intel-microcode package, you’ll be back to an older microcode after the next boot. That is, unless you also update your BIOS which can also come with a microcode update: this will be persistent unless you downgrade your BIOS. ↩
The new June 8th microcode versions are 0xea for Skylake (versus 0xe2 previously) and 0xa6 for Ice Lake (versus 0xa0 previously). ↩
barely ↩
The performance regression shown in the plots is close to a worst case: the benchmark only fills zeros and nothing else. Real code doesn’t spend that much time filling zeros, although zero is no doubt the dominant value in large block fills, at least because the OS must zero pages before returning them to user processes and memory-safe languages like Java will zero some objects and array types in bulk. ↩
This observation becomes almost universal once you consider that the values involved in any operation affect power use (see e.g. Schöne et al or Cornebize and Legrand). Since power use can be directly (e.g., RAPL or external measurements) or indirectly (e.g., because of heat-dependent frequency changes) observed, it means that in theory any operation, even those widely considered to be constant-time, may leak information. ↩

Comments

a pentium pro • March 14th, 2025 04:39

I’m just a game developer, and I ended up on your blog while searching for C++ related topics. What’s your job? I always wonder why someone would dive into such low-level details and go that far.

↪︎ Reply to a pentium pro

SwineOne • September 18th, 2021 16:12

Hello Travis. Been reading your blog for a while but never bothered to comment.

I wonder if I could interest you in writing an article about factors affecting microbenchmark reproducibility? I’m trying to benchmark some CPU-bound kernels on small ARM single-board computers, such as the Raspberry Pi and Jetson Nano, and I used to see fluctuations of up to 15% after changing a small section of code that’s completely unrelated to the code being benchmarked. So far I’ve found the following causes:

Underpowered PSUs leading to CPU throttling;
Thermal throttling;
Calls to functions in shared libraries (i.e. indirect branches);
Data (heap and stack) not aligned to at least 64 bytes;
Branch targets no5 aligned to at least 16 bytes.

After fixing these, I only get the occasional 2% from one version of the code to the next (again, without touching the code being actually benchmarked). Still, I know I can do much better than that — most of the benchmarks barely change 0.1% from one run to the next. Unfortunately I’m running out of avenues to investigate. I’ll probably inspect some OS-related factors now: task scheduling, priority, CPU affinity, background tasks, etc.

Unfortunately PMUs on ARM processors (at least for the cores I’m working with) severely lag behind those on Intel. There’s no probing inside the core, front-end/back-end, execution ports, etc. Just some cache and branch/speculative execution related stuff.

↪︎ Reply to SwineOne

anon • June 25th, 2021 02:41

The feature is both a power and a performance optimisation. It takes effect when you write zero to an already-zeroed cache line. In addition to the page scrubbing case given, consider also a program stack with blocks of local variables initialised to zero.

The performance optimisation comes from reduced cache coherency overhead - the cache line doesn’t have to move out of the shared state if it doesn’t logically become modified. The power optimisation comes from reduced DIMM traffic, and reducing DIMM traffic is a very big deal for devices running on battery.

The problem is that it introduces data-dependent timing differences, and undermines a critical security property of hardened crypto libraries, which are designed to have timing which is invariant to the key, cypher and plaintext that they’re processing.

↪︎ Reply to anon

m_ueberall • June 23th, 2021 13:50

I’m sorry, but could someone point me to the calendar/point in time (read: year) where/when June 7th fell on a Tuesday? I’m confused.

↪︎ Reply to m_ueberall

Author Travis Downs • June 23th, 2021 19:44

Sorry, classic off-by-one on my part.

The CVE was made public and the microcode was released by Intel on Tuesday the 8th, but Ubuntu didn’t release the updated microcode package it until the next day, the 9th. So the 7th is the “last definitely fast day” and the 8th is ambiguous depending on how you get your microcode (and when you reboot). I’ve changed the comparison days to the 7th and 9th and fixed the text, and credited you.

Gwyneth Llewelyn • June 23th, 2021 19:44

I’m also very confused. There was an Intel microcode patch last released in 2021-06-09 at 04:23 (whatever time zone that is). On my calendar, that’s Wednesday — not the 8th nor the 7th. The last year when June 7th fell on a Tuesday was in 2016; the next year will be in 2022 (source: https://www.timeanddate.com/calendar/weekday-tuesday-7)

FreddyFrobnicator • June 23th, 2021 13:46

at the system level, zero fill might occur more than you expect. ex: https://stackoverflow.com/questions/18385556/does-windows-clear-memory-pages

↪︎ Reply to FreddyFrobnicator

Jonathan • June 23th, 2021 13:29

Wednesday June 8th? Not in the UK

↪︎ Reply to Jonathan

Author Travis Downs • June 23th, 2021 20:41

Yes, my bad, it should be the 9th for Ubuntu users. Tuesday June 8th is when the microcode was released by Intel. See also my longer reply to m_ueberall.

I’ve fixed the text and credited you.

Nick O • June 22th, 2021 16:42

I’m a little confused why this has security implications.

If you can write to a piece of memory, in what situation would you not be allowed to read that memory too? If you can read the memory, why does it matter if you can use a timing side-channel to observe whether the data is all zero?

↪︎ Reply to Nick O

Author Travis Downs • June 23th, 2021 20:45

The concern isn’t an attacker that can read and write to the memory, but one that can’t.

E.g., another process on the same machine could possible observe whether a cache line it cannot access is zero or not if it manages to evict it from cache and observe the timing. Or an attacker that doesn’t have access to the machine could time operations from the outside to determine if zero cache lines were involved. These are so-called “timing side channel attacks”.

JN • June 22th, 2021 15:19

This phrase - which is a rather crucial one, and one that, anyway, is somewhat hard to understand - is missing a word: ‘it does seem that the proximate cause this change to the microcode was my earlier post’ (sic).

↪︎ Reply to JN

Author Travis Downs • June 22th, 2021 16:26

Thanks, JN, I’ve reworded this and fixed the typo.

Cap'n O. • June 22th, 2021 14:42

The solutions is “ghost parity.” For every bit-store in the CPU, create a parallel bit-store that always contains the complement bits. Thus parity is preserved.

↪︎ Reply to Cap'n O.

Dave O • June 22th, 2021 11:37

CMOS behaves like that, but not all logic families do. ECL-based computers like the early Cray machines looked to the PSU like a giant resistor. It’s not a terribly likely direction of travel for modern computing, though.

↪︎ Reply to Dave O

Heretical blacksheep • June 22th, 2021 10:55

I think the only place this will end is where the Burroughs engineers were trying to tell the shared hardware advocates in the late 60s. The only secure (running) hardware is hardware that never shares resources between processes. Back then the argument was cost. RAM and CPU time was prohibitively expensive. Cost won over security only because all users on any particular system could be trusted and inter system networking didn’t exist as now. Now the shared resource model is breaking down in the face of an entire world of potentially hostile users aimed at any particular machine. It’s not sustainable.

↪︎ Reply to Heretical blacksheep

Freyaday • June 22th, 2021 06:36

My (shaky) understanding is that, since modern computation is destructive and nonreversible and information is conserved, overwritten values are leaked as heat. I wonder if reversible computing would, by its inherent nature, shut a whole bunch of side channels off…

↪︎ Reply to Freyaday

pepe • June 22th, 2021 08:57

Even with reversible computing, most reasonable uses (e.g. perform irreversible computations) end up converting ancilla zero bits into “garbage”. This garbage is what gives the logical reversibility, and thus it will contain the necessary information to recompute the input. Now, you either clone output & uncompute the garbage (to get zeroes again), with a huge performance overhead, or you destroy the garbage, which again dissipates/leaks the information. At least that’s the theory…

In any case, I would also love to see research on the intersection of security & reversible computing. To me any exciting future must involve NVRAM + error correction + reversible computing.

pepe • June 22th, 2021 10:02

For most SCA the issue still is secret-dependent control flow and data accesses, reversibility does not prevent any of that. In fact it arguably makes them worse, if everything is an isomorphism, any given leak will contain more information?

yabbadabba • June 22th, 2021 12:29

from a closed systme heat cannot escape. if a vpu is a closed system, entropy would not change, therefore no new info would.be generated.