Your CPU May Have Slowed Down on Wednesday
A Strange Performance Effect
The plot below shows the throughput of filling a region of the given size (varying on the x-axis) with zeros1 on Skylake (and Ice Lake in the second tab).
The two series were generated under apparently identical conditions: the same binary on the same machine. Only the date the benchmark was run varies. That is, on Monday, June 7th, filling with zeros is substantially faster than the same benchmark on Wednesday, at least when the region no longer fits in the L2 cache2.
Hump Day Strikes Back
What’s going on here? Are my Skylake and Ice Lake hosts simply work-weary by Wednesday and don’t put in as much effort? Is there a new crypto-coin based on who can store the most zeros and this is a countermeasure to avoid ballooning CPU prices in the face of this new workload?
Believe it or not, it is none of the above!
These hosts run Ubuntu 20.04 and on Wednesday June 9th an update to the intel-microcode OS package was released. After a reboot3, this loads the CPU with new microcode (released a day earlier by Intel) that causes the behavior shown above. Specifically, this microcode4 disables the hardware zero store optimization we discussed in a previous post. It was disabled to mitigate CVE-2020-24512 further described5 in Intel security advisory INTEL-SA-00464.
To be clear, I don’t know for sure that the microcode disables the zero store optimization – but the evidence is rather overwhelming. After the update, performance is the same when filling zeros as for any other value, and the performance counters tracking L2 evictions suggestion that substantially all evictions are now non-silent (recall from the previous posts that silent evictions were a hallmark of the optimization).
Although I suspect the performance impact will be minuscule on average6, this surprise still serves as a reminder that raw CPU performance can silently change due to microcode updates and most Linux distributions and modern Windows have these updates enabled by default. We’ve seen this before. If you are trying to run reproducible benchmarks, you should always re-run your entire suite in order to make accurate comparisons, even on the same hardware, rather than just running the stuff you think has changed.
Reproduction
The code to collect this performance data and reproduce my results is available in zero-fill-bench, with some instructions in the README.
Mea Culpa and an Unsustainable Path
In writing the earlier blog entries on this topic, I was interested in the performance aspects of this optimization, not its potential as an attack vector. However, merely by observing (and publishing) the results, the optimization was affected: the system under measurement changed as a result of the observation. Now I can’t be sure that the optimization wouldn’t have eventually been disabled anyway, but it does seem that the reason for this behavior change to occur now was my earlier post.
I am not convinced that removing any optimization which can be used in a timing-based side channel is sustainable. I am not sure this is a thread you want to keep pulling on: practically every aspect of a modern CPU can vary in performance and timing based on internal state7. Trying to draw the security boundaries tightly around co-located entities (e.g., processes on the same CPU, especially on the same core), without allowing any leaks seems destined to fail without a complete overhaul of CPU design, likely at the cost of a large amount of performance. There are just too many holes to plug.
I hope that once the wave of vulnerabilities and disclosures that started with Meltdown and Spectre begins to recede, we can start to work on a measured approach to classifying and mitigating timing and other side-channel attacks. This could start by enumerating which performance characteristics are reasonable guaranteed to hold, and which aren’t. For example, it could be specified whether memory access timing may vary based on the value accessed. If it is allowed to vary, the zero store optimization would be allowed.
In any case, I still plan to write about performance-related microarchitectural details. I just hope this outcome does not repeat itself.
Thanks
Thanks to JN, Chris Martin, Jonathan and m_ueberall for reporting or fixing typos in the text.
Stone photo by Colin Watts on Unsplash.
Discussion and Feedback
You can join the discussion on Twitter, Hacker News or r/Intel.
If you have a question or any type of feedback, you can leave a comment below.
If you liked this post, check out the homepage for others you might enjoy.
-
Specifically, it uses
std::fill
with a zero argument, with some inlining prevention, which ultimately results in a fill which uses a series of 32-byte vector loads and stores to store 256 bytes per unrolled iteration, with a loop body like this:vmovdqu YMMWORD PTR [rax],ymm1 vmovdqu YMMWORD PTR [rax+0x20],ymm1 vmovdqu YMMWORD PTR [rax+0x40],ymm1 vmovdqu YMMWORD PTR [rax+0x60],ymm1 vmovdqu YMMWORD PTR [rax+0x80],ymm1 vmovdqu YMMWORD PTR [rax+0xa0],ymm1 vmovdqu YMMWORD PTR [rax+0xc0],ymm1 vmovdqu YMMWORD PTR [rax+0xe0],ymm1
So the compiler does a good job: you can’t ask for much better than that. ↩
-
I’m doing a bit of a retcon here. The effect is present as described on the gates given, and I observed and benchmarked it on Wednesday myself, but the specific data series used for the plots were generated a week later when I had time to collect the data properly in a relatively noise free environment. So the two series were collected back-to-back on the same day, varying only the hidden parameter you’ll learn about two paragraphs from now. ↩
-
To be clear, the microcode is not persistent, so it needs to be loaded on every boot. If you remove or downgrade the
intel-microcode
package, you’ll be back to an older microcode after the next boot. That is, unless you also update your BIOS which can also come with a microcode update: this will be persistent unless you downgrade your BIOS. ↩ -
The new June 8th microcode versions are
0xea
for Skylake (versus0xe2
previously) and0xa6
for Ice Lake (versus0xa0
previously). ↩ -
barely ↩
-
The performance regression shown in the plots is close to a worst case: the benchmark only fills zeros and nothing else. Real code doesn’t spend that much time filling zeros, although zero is no doubt the dominant value in large block fills, at least because the OS must zero pages before returning them to user processes and memory-safe languages like Java will zero some objects and array types in bulk. ↩
-
This observation becomes almost universal once you consider that the values involved in any operation affect power use (see e.g. Schöne et al or Cornebize and Legrand). Since power use can be directly (e.g., RAPL or external measurements) or indirectly (e.g., because of heat-dependent frequency changes) observed, it means that in theory any operation, even those widely considered to be constant-time, may leak information. ↩
Comments
Hello Travis. Been reading your blog for a while but never bothered to comment.
I wonder if I could interest you in writing an article about factors affecting microbenchmark reproducibility? I’m trying to benchmark some CPU-bound kernels on small ARM single-board computers, such as the Raspberry Pi and Jetson Nano, and I used to see fluctuations of up to 15% after changing a small section of code that’s completely unrelated to the code being benchmarked. So far I’ve found the following causes:
After fixing these, I only get the occasional 2% from one version of the code to the next (again, without touching the code being actually benchmarked). Still, I know I can do much better than that — most of the benchmarks barely change 0.1% from one run to the next. Unfortunately I’m running out of avenues to investigate. I’ll probably inspect some OS-related factors now: task scheduling, priority, CPU affinity, background tasks, etc.
Unfortunately PMUs on ARM processors (at least for the cores I’m working with) severely lag behind those on Intel. There’s no probing inside the core, front-end/back-end, execution ports, etc. Just some cache and branch/speculative execution related stuff.
The feature is both a power and a performance optimisation. It takes effect when you write zero to an already-zeroed cache line. In addition to the page scrubbing case given, consider also a program stack with blocks of local variables initialised to zero.
The performance optimisation comes from reduced cache coherency overhead - the cache line doesn’t have to move out of the shared state if it doesn’t logically become modified. The power optimisation comes from reduced DIMM traffic, and reducing DIMM traffic is a very big deal for devices running on battery.
The problem is that it introduces data-dependent timing differences, and undermines a critical security property of hardened crypto libraries, which are designed to have timing which is invariant to the key, cypher and plaintext that they’re processing.
I’m sorry, but could someone point me to the calendar/point in time (read: year) where/when June 7th fell on a Tuesday? I’m confused.
Sorry, classic off-by-one on my part.
The CVE was made public and the microcode was released by Intel on Tuesday the 8th, but Ubuntu didn’t release the updated microcode package it until the next day, the 9th. So the 7th is the “last definitely fast day” and the 8th is ambiguous depending on how you get your microcode (and when you reboot). I’ve changed the comparison days to the 7th and 9th and fixed the text, and credited you.
I’m also very confused. There was an Intel microcode patch last released in 2021-06-09 at 04:23 (whatever time zone that is). On my calendar, that’s Wednesday — not the 8th nor the 7th. The last year when June 7th fell on a Tuesday was in 2016; the next year will be in 2022 (source: https://www.timeanddate.com/calendar/weekday-tuesday-7)
at the system level, zero fill might occur more than you expect. ex: https://stackoverflow.com/questions/18385556/does-windows-clear-memory-pages
Wednesday June 8th? Not in the UK
Yes, my bad, it should be the 9th for Ubuntu users. Tuesday June 8th is when the microcode was released by Intel. See also my longer reply to m_ueberall.
I’ve fixed the text and credited you.
I’m a little confused why this has security implications.
If you can write to a piece of memory, in what situation would you not be allowed to read that memory too? If you can read the memory, why does it matter if you can use a timing side-channel to observe whether the data is all zero?
The concern isn’t an attacker that can read and write to the memory, but one that can’t.
E.g., another process on the same machine could possible observe whether a cache line it cannot access is zero or not if it manages to evict it from cache and observe the timing. Or an attacker that doesn’t have access to the machine could time operations from the outside to determine if zero cache lines were involved. These are so-called “timing side channel attacks”.
This phrase - which is a rather crucial one, and one that, anyway, is somewhat hard to understand - is missing a word: ‘it does seem that the proximate cause this change to the microcode was my earlier post’ (sic).
Thanks, JN, I’ve reworded this and fixed the typo.
The solutions is “ghost parity.” For every bit-store in the CPU, create a parallel bit-store that always contains the complement bits. Thus parity is preserved.
CMOS behaves like that, but not all logic families do. ECL-based computers like the early Cray machines looked to the PSU like a giant resistor. It’s not a terribly likely direction of travel for modern computing, though.
I think the only place this will end is where the Burroughs engineers were trying to tell the shared hardware advocates in the late 60s. The only secure (running) hardware is hardware that never shares resources between processes. Back then the argument was cost. RAM and CPU time was prohibitively expensive. Cost won over security only because all users on any particular system could be trusted and inter system networking didn’t exist as now. Now the shared resource model is breaking down in the face of an entire world of potentially hostile users aimed at any particular machine. It’s not sustainable.
My (shaky) understanding is that, since modern computation is destructive and nonreversible and information is conserved, overwritten values are leaked as heat. I wonder if reversible computing would, by its inherent nature, shut a whole bunch of side channels off…
Even with reversible computing, most reasonable uses (e.g. perform irreversible computations) end up converting ancilla zero bits into “garbage”. This garbage is what gives the logical reversibility, and thus it will contain the necessary information to recompute the input. Now, you either clone output & uncompute the garbage (to get zeroes again), with a huge performance overhead, or you destroy the garbage, which again dissipates/leaks the information. At least that’s the theory…
In any case, I would also love to see research on the intersection of security & reversible computing. To me any exciting future must involve NVRAM + error correction + reversible computing.
For most SCA the issue still is secret-dependent control flow and data accesses, reversibility does not prevent any of that. In fact it arguably makes them worse, if everything is an isomorphism, any given leak will contain more information?
from a closed systme heat cannot escape. if a vpu is a closed system, entropy would not change, therefore no new info would.be generated.