Hardware Store Elimination
I had no plans to write another post about zeros, but when life throws you a zero make zeroaid, or something like that. Here we go!
If you want to jump over the winding reveal and just read the summary and advice, now is your chance.
When writing simple memory benchmarks I have always taken the position the value written to memory didn’t matter. Recently, while running a straightforward benchmark1 probing the interaction between AVX-512 stores and read for ownership I ran into a weird performance deviation. This is that story2.
Table of Contents
Prelude
Data Dependent Performance
On current mainstream CPUs, the timing of most instructions isn’t data-dependent. That is, their performance is the same regardless of the value of the input(s) to the instruction. Unlike you3 or me your CPU takes the same time to add 1 + 2
as it does to add 68040486 + 80866502
.
Now, there are some notable exceptions:
- Integer division is data-dependent on most x86 CPUs: larger inputs generally take longer although the details vary widely among microarchitectures4.
- BMI2 instructions
pdep
andpext
have famously terrible and data-dependent performance on AMD Zen and Zen2 chips. - Floating point instructions often have slower performance when denormal numbers are encountered, although some rounding modes such as flush to zero may avoid this.
That list is not exhaustive: there are other cases of data-dependent performance, especially when you start digging into complex microcoded instructions such as cpuid
. Still, it isn’t unreasonable to assume that most simple instructions not listed above execute in constant time.
How about memory operations, such as loads and stores?
Certainly, the address matters. After all the address determines the caching behavior, and caching can easily account for two orders of magnitude difference in performance5. On the other hand, I wouldn’t expect the data values loaded or stored to matter. There is not much reason to expect the memory or caching subsystem to care about the value of the bits loaded or stored, outside of scenarios such as hardware-compressed caches not widely deployed6 on x86.
Source
The full benchmark associated with this post (including some additional benchmarks not mention here) is available on GitHub.
Benchmarks
That’s enough prelude for now. Let’s write some benchmarks.
A Very Simple Loop
Let’s start with a very simple task. Write a function that takes an int
value val
and fills a buffer of a given size with copies of that value. Just like memset
, but with an int
value rather than a char
one.
The canonical C implementation is probably some type of for loop, like this:
void fill_int(int* buf, size_t size, int val) {
for (size_t i = 0; i < size; ++i) {
buf[i] = val;
}
}
… or maybe this7:
void fill_int(int* buf, size_t size, int val) {
for (int* end = buf + size; buf != end; ++buf) {
*buf = val;
}
}
In C++, we don’t even need that much: we can simply delegate directly to std::fill
which does the same thing as a one-liner8:
std::fill(buf, buf + size, val);
There is nothing magic about std::fill
, it also uses a loop just like the C version above. Not surprisingly, gcc and clang compile them to the same machine code9.
With the right compiler arguments (-march=native -O3 -funroll-loops
in our case), we expect this std::fill
version (and all the others) to be implemented with with AVX vector instructions, and it is so. The part which does the heavy lifting work for large fills looks like this:
.L4:
vmovdqu YMMWORD PTR [rax + 0], ymm1
vmovdqu YMMWORD PTR [rax + 32], ymm1
vmovdqu YMMWORD PTR [rax + 64], ymm1
vmovdqu YMMWORD PTR [rax + 96], ymm1
vmovdqu YMMWORD PTR [rax + 128], ymm1
vmovdqu YMMWORD PTR [rax + 160], ymm1
vmovdqu YMMWORD PTR [rax + 192], ymm1
vmovdqu YMMWORD PTR [rax + 224], ymm1
add rax, 256
cmp rax, r9
jne .L4
It copies 256 bytes of data every iteration using eight 32-byte AVX2 store instructions. The full function is much larger, with a scalar portion for buffers smaller than 32 bytes (and which also handles the odd elements after the vectorized part is done), and a vectorized jump table to handle up to seven 32-byte chunks before the main loop. No effort is made to align the destination, but we’ll align everything to 64 bytes in our benchmark so this won’t matter.
Our First Benchmark
Enough foreplay: let’s take the C++ version out for a spin, with two different fill values (val
) selected completely at random: zero (fill0
) and one (fill1
). We’ll use gcc 9.2.1 and the -march=native -O3 -funroll-loops
flags mentioned above.
We organize it so that for both tests we call the same non-inlined function: the exact same instructions are executed and only the value differs. That is, the compile isn’t making any data-dependent optimizations.
Here’s the fill throughput in GB/s for these two values, for region sizes ranging from 100 up to 100,000,000 bytes.
About this chart:
At each region size (that is, at each position along the x-axis) 17 semi-transparent samples10 are plotted and although they usually overlap almost completely (resulting in a single circle), you can see cases where there are outliers that don’t line up with the rest of this sample. This plot tries to give you an idea of the spread of back-to-back samples without hiding them behind error bars11. Finally, the sizes of the various data caches (32, 256 and 6144 KiB for the L1D, L2 and L3, respectively) are marked for convenience.
L1 and L2
Not surprisingly, the performance depends heavily on what level of cache the filled region fits into.
Everything is fairly sane when the buffer fits in the L1 or L2 cache (up to ~256 KiB12). The relatively poor performance for very small region sizes is explained by the prologue and epilogue of the vectorized implementation: for small sizes a relatively large amount of time is spent in these int-at-a-time loops: rather than copying up to 32 bytes per cycle, we copy only 4.
This also explains the bumpy performance in the fastest region between ~1,000 and ~30,000 bytes: this is highly reproducible and not noise. It occurs because because some sampled values have a larger remainder mod 32. For example, the sample at 740 bytes runs at ~73 GB/s while the next sample at 988 runs at a slower 64 GB/s. That’s because 740 % 32 is 4, while 988 % 32 is 28, so the latter size has 7x more cleanup work to do than the former13. Essentially, we are sampling semi-randomly a sawtooth function and if you plot this region with finer granularity14 you can see it quite clearly.
Getting Weird in the L3
So while there are some interesting effects in the first half of the results, covering L1 and L2, they are fairly easy to explain and, more to the point, performance for the zero and one cases are identical: the samples are all concentric. As soon as we dip our toes into the L3, however, things start to get weird.
Weird in that we we see a clear divergence between stores of zero versus ones. Remember that this is the exact same function, the same machine code executing the same stream of instructions, only varying in the value of the ymm1
register passed to the store instruction. Storing zero is consistently about 17% to 18% faster than storing one, both in the region covered by the L3 (up to 6 MiB on my system), and beyond that where we expect misses to RAM (it looks like the difference narrows in the RAM region, but it’s mostly a trick of the eye: the relative performance difference is about the same).
What’s going on here? Why does the CPU care what values are being stored, and why is zero special?
We can get some additional insight by measuring the l2_lines_out.silent
and l2_lines_out.non_silent
events while we focus on the regions that fit in L2 or L3. These events measure the number of lines evicted from L2 either silently or non-silently.
Here are Intel’s descriptions of these events:
l2_lines_out.silent
Counts the number of lines that are silently dropped by L2 cache when triggered by an L2 cache fill. These lines are typically in Shared or Exclusive state.
l2_lines_out.non_silent
Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines are in Modified state. Modified lines are written back to L3.
The states being referred to here are MESI cache states, commonly abbreviated M (modified), E (exclusive, but not modified) and S (possibly shared, not modified).
The second definition is not completely accurate. In particular, it implies that only modified lines trigger the non-silent event. However, I find that unmodified lines in E state can also trigger this event. Roughly, the behavior for unmodified lines seems to be that lines that miss in L2 and L3 usually get filled into the L2 in a state where they will be evicted non-silently, but unmodified lines that miss in L2 and hit in L3 will generally be evicted silently15. Of course, lines that are modified must be evicted non-silently in order to update the outer levels with the new data.
In summary: silent evictions are associated with unmodified lines in E or S state, while non-silent evictions are associated with M, E or (possibly) S state lines, with the silent vs non-silent choice for E and S being made in some unknown matter.
Let’s look at silent vs non-silent evictions for the fill0
and fill1
cases:
About this chart:
For clarity, I show only the median single sample for each size16. As before, the left axis is fill speed and on the right axis the two types eviction events are plotted, normalized to the number of cache lines accessed in the benchmark. That is, a value of 1.0 means that for every cache line accessed, the event occurred one time.
The total number of evictions (sum of silent and non-silent) is the same for both cases: near zero17 when the region fits in L2, and then quickly increases to ~1 eviction per stored cache line. In the L3, fill1
also behaves as we’d expect: essentially all of the evictions are non-silent. This makes sense since modified lines must be evicted non-silently to write their modified data to the next layer of the cache subsystem.
For fill0
, the story is different. Once the buffer size no longer fits in L2, we see the same total number of evictions from L2, but 63% of these are silent, the rest non-silent. Remember, only unmodified lines even have the hope of a silent eviction. This means that at least 63% of the time, the L218 is able to detect that the write is redundant: it doesn’t change the value of the line, and so the line is evicted silently. That is, it is never written back to the L3. This is presumably what causes the performance boost: the pressure on the L3 is reduced: although all the implied reads19 still need to go through the L3, only about 1 out of 3 of those lines ends up getting written back.
Once the test starts to exceed the L3 threshold, all of the evictions become non-silent even in the fill0
case. This doesn’t necessarily mean that the zero optimization stops occurring. As mentioned earlier15, it is a typical pattern even for read-only workloads: once lines arrive in L2 as a result of an L3 miss rather than a hit, their subsequent eviction becomes non-silent, even if never written. So we can assume that the lines are probably still detected as not modified, although we lose our visibility into the effect at least as far as the l2_lines_out
events go. That is, although all evictions are non-silent, some fraction of the evictions are still indicating that the outgoing data is unmodified.
RAM: Still Weird
In fact, we can confirm that this apparent optimization still happens as move into RAM using a different set of events. There are several to choose from – and all of those that I tried tell the same story. We’ll focus on unc_arb_trk_requests.writes
, documented as follows:
Number of writes allocated including any write transaction including full, partials and evictions.
Important to note that the “uncore tracker” these events monitor is used by data flowing between L3 and memory, not between L2 and L3. So writes here generally refers to writes that will reach memory.
Here’s how this event scales for the same test we’ve been running this whole time (the size range has been shifted for focus on the area of interest)20:
The number of writes for well-behaved fill1
approaches one write per cache line as the buffer exceeds the size of L3 – again, this is as expected. For the more rebellious fill0
, it is almost exactly half that amount. For every two lines written by the benchmark, we only write one back to memory! This same 2:1 ratio is reflected also if we measure memory writes at the integrated memory controller21: writing zeros results in only half the number of writes at the memory controller.
Wild, Irresponsible Speculation and Miscellanous Musings
This is all fairly strange. It’s not weird that there would be a “redundant writes” optimization to avoid writing back identical values: this seems like it could benefit some common write patterns.
It is perhaps a bit unusual that it only apparently applies to all-zero values. Maybe this is because zeros overwriting zeros is one of the most common redundant write cases, and detecting zero values can done more cheaply than a full compare. Also, the “is zero?” state can be communicated and stored as a single bit, which might be useful. For example, if the L2 is involved in the duplicate detection (and the l2_lines_out
results suggest it is), perhaps the detection happens when the line is evicted, at which point you want to compare to the line in L3, but you certainly can’t store the entire old value in or near the L2 (that would require storage as large as the L2 itself). You could store an indicator that the line was zero, however, in a single bit and compare the existing line as part of the eviction process.
Predicting a New Predictor
What is the weirdest of all, however, is that the optimization doesn’t kick in 100% of the time but only for 40% to 60% of the lines, depending on various parameters22. What would lead to that effect? One could imagine that there could be some type of predictor which determines whether to apply this optimization or not, depending on e.g., whether the optimization has recently been effective – that is, whether redundant stores have been common recently. Perhaps this predictor also considers factors such as the occupancy of outbound queues23: when the bus is near capacity, searching for eliminating redundant writes might be more worth the power or latency penalty compared to the case when there is little apparent pressure on the bus.
In this benchmark, any predictor would find that the optimization is 100% effective: every write is redundant! So we might guess that the second condition (queue occupancy) results in a behavior where only some stores are eliminated: as more stores are eliminated, the load on the bus becomes lower and so at some point the predictor no long thinks it is worth it to eliminate stores and you reach a kind of stable state where only a fraction of stores are eliminated based on the predictor threshold.
Predictor Test
We can kind of test that theory: in this model, any store is capable of being eliminated, but the ratio of eliminated stores is bounded above by the predictor behavior. So if we find that a benchmark of pure redundant zero stores is eliminated at a 60% rate, we might expect that any benchmark with at least 60% redundant stores can reach the 60% rate, and with lower rates, you’d see full elimination of all redundant stores (since now the bus always stays active enough to trigger the predictor).
Apparently analogies are helpful, so an analogy here would be a person controlling the length of a line by redirecting some incoming people. For example, in an airport security line the handler tries to keep the line at a certain maximum length by redirecting (redirecting -> store elimination) people to the priority line if they are eligible and the main line is at or above its limit. Eligible people are those without carry-on luggage (eligible people -> zero-over-zero stores).
If everyone is eligible (-> 100% zero stores), this control will always be successful and the fraction of people redirected will depend on the relative rate of ingress and egress through security. If security only has a throughput of 40% of the ingress rate, 60% of people will redirected in the steady state. Now, consider what happens if not everyone is eligible: if the eligible fraction is at least 60%, nothing changes. You still redirect 60% of people. Only if the eligible rate drops below 60% is there a problem: now you’ll be redirecting 100% of eligible people, but the primary line will grow beyond your limit.
Whew! Not sure if that was helpful after all?
Let’s try a benchmark which adds a new implementation, alt01
which alternates between writing a cache line of zeros and a cache line of ones. All the writes are redundant, but only 50% are zeros, so under the theory that a predictor is involved we expect that maybe 50% of the stores will be eliminated (i.e., 100% of the redundant stores are eliminated and they make up 50% of the total).
Here we focus on the L3, similar to Fig. 2 above, showing silent evictions (the non-silent ones make up the rest, adding up to 1 total as before):
We don’t see 50% elimination. Rather we see less than half the elimination of the all-zeros case: 27% versus 63%. Performance is better in the L3 region than the all ones case, but only slightly so! So this doesn’t support the theory of a predictor capable of eliminating on any store and operating primarily on outbound queue occupancy.
Similarly, we can examine the region where the buffer fits only in RAM, similar to Fig. 3 above:
Recall that the lines show the number of writes reaching the memory subsystem. Here we see that alt01
again splits the difference between the zero and ones case: about 75% of the writes reach memory, versus 48% in the all-zeros case, so the elimination is again roughly half as effective. In this case, the performance also splits the difference between all zeros and all ones: it falls almost exactly half-way between the two other cases.
So I don’t know what’s going on exactly. It seems like maybe only some fraction are of lines are eligible for elimination due to some unknown internal mechanism in the uarch.
Hardware Survey
Finally, here are the performance results (same as Figure 1) on a variety of other Intel and AMD x86 architectures, as well as IBM’s POWER9 and Amazon’s Graviton 2 ARM processor, one per tab.
Some observations on these results:
- The redundant write optimization isn’t evident in the performance profile for any of the other non-SKL hardware tested. Not even closely related Intel hardware like Haswell or Skylake-X. I also did a few spot tests with performance counters, and didn’t see any evidence of a reduction in writes. So for now this might a Skylake client only thing (of course, Skylake client is perhaps the most widely deployed Intel uarch even due to the many identical-uarch-except-in-name variants: Kaby Lake, Coffee Lake, etc, etc). Note that the Skylake-S result here is for a different (desktop i7-6700) chip than the rest of this post, so we can at least confirm this occurs on two different chips.
- Except in the RAM region, Sandy Bridge throughput is half of its successors: a consequence of having only a 16-byte load/store path in the core, despite supporting 32-byte AVX instructions.
- AMD Zen2 has excellent write performance in the L2 and L3 regions. All of the Intel chips drop to about half throughput for writes in the L2: slightly above 16 bytes per cycle (around 50 GB/s for most of these chips). Zen2 maintains its L1 throughput and in fact has its highest results in L2: over 100 GB/s. Zen2 also manages more than 70 GB/s in the L3, much better than the Intel chips, in this test.
- Both Cannon Lake and Skylake-X exhibit a fair amount of inter-sample variance in the L2 resident region. My theory here would be prefetcher interference which behaves differently than earlier chips, but I am not sure.
- Skylake-X, with a different L3 design than the other chips, has quite poor L3 fill throughput, about half of contemporary Intel chips, and less than a third of Zen2.
- The POWER9 performance is neither terrible nor great. The most interesting part is probably the high L3 fill throughput: L3 throughput is as high or higher than L1 or L2 throughput, but still not in Zen2 territory.
- Amazon’s new Graviton processor is very interesting. It seems to be limited to one 16-byte store per cycle24, giving it a peak possible store throughput of 40 GB/s, so it doesn’t do well in the L1 region versus competitors that can hit 100 GB/s or more (they have both higher frequency and 32 byte stores), but it sustains the 40 GB/s all the way to RAM sizes, with a RAM result flat enough to serve drinks on, and this on a shared 64-CPU host where I paid for only a single core25! The RAM performance is the highest out of all hardware tested.
You might notice that Ice Lake, Intel’s newest microarchitecture, is missing from this list: that’s because there is a whole separate post on it.
Further Notes
Here’s a grab back of notes and observations that don’t get their a full section, don’t have any associated plots, etc. That doesn’t mean they are less important! Don’t say that! These ones matter too, they really do.
- Despite any impressions I may have given above: you don’t need to fully overwrite a cache line with zeros for this to kick in, and you can even write non-zero values if you overwrite them “soon enough” with zeros. Rather, the line must initially fully zero, and then all the escaping26 writes must be zero. Another way of thinking about this is that the thing that matters is the value of the cache line written back, as well as the old value of the line that the writeback is replacing: these must both be fully zero, but that doesn’t mean that you need to overwrite the line with zeros: any locations not written are still zero “from before”. I directly test this in the
one_per0
andone_per1
tests in the benchmark. These write only a singleint
value in each cache line, leaving the other values unchanged. In that benchmark the optimization kicks triggers in exactly the same way when writing a single zero. - Although we didn’t find evidence of this happening on other x86 hardware, nor on POWER or ARM, that doesn’t mean it isn’t or can’t happen. It is possible the conditions for it to happen aren’t triggered, or that it is happening but doesn’t make a difference in performance. Similarly for the POWER9 and ARM chips: we didn’t check any performance counters, so maybe the thing is happening but it just doesn’t make any difference in performance. That’s especially feasible in the ARM case where the performance is totally limited by the core-level 16-bytes-per-cycle write throughput: even if all writes are eliminated later on in the path to memory, we expect the performance to be the same.
- We could learn more about this effect by setting up a test which writes ones first to a region of some fixed size, then overwrites it with zeros, and repeats this in a rolling fashion over a larger buffer. This test basically lets the 1s escape to a certain level of the cache hierarchy, and seeing where the optimization kicks in will tell us something interesting.
Wrapping Up
Findings
Here’s a brief summary of what we found. This will be a bit redundant if you’ve just read the whole thing, but we need to accommodate everyone who just skipped down to this part, right?
- Intel chips can apparently eliminate some redundant stores when zeros are written to a cache line that is already all zero (the line doesn’t need to be fully overwritten).
- This optimization applies at least as early as L2 writeback to L3, so would apply to the extend that working sets don’t fit in L2.
- The effect eliminates both write accesses to L3, and writes to memory depending on the working set size.
- For the pure store benchmark discussed here effect of this optimization is a reduction in the number of writes of ~63% (to L3) and ~50% (to memory), with a runtime reduction of between 15% and 20%.
- It is unclear why not all redundant zero-over-zero stores are eliminated.
Tuning “Advice”
So is any of actually useful? Can we use this finding to quadruple the speed of the things that really matter in computation: tasks like bitcoin mining, high-frequency trading and targeting ads in real time?
Nothing like that, no – but it might provide a small boost for some cases.
Many of those cases are probably getting the benefit without any special effort. After all, zero is already a special value: it’s how memory arrives comes from the operating system, and at the language allocation level for some languages. So a lot of cases that could get this benefit, probably already are.
Redundant zero-over-zero probably isn’t as rare as you might think either: consider that in low level languages, memory is often cleared after receiving it from the allocator, but in many cases this memory came directly from the OS so it is already zero27. Consider also cases like fairly-sparse matrix multiplication: where your matrix isn’t sparse enough to actually use dedicated sparse routines, but still has a lot of zeros. In that case, you are going to be writing 0 all the time in your final result and scratch buffers. This optimization will reduce the writeback traffic in that case.
If you are making a ton of redundant writes, the first thing you might want to do is look for a way to stop doing that. Beyond that, we can list some ways you might be able to take advantage of this new behavior:
- In the case you are likely to have redundant writes, prefer zero as the special value that is likely to be redundantly overwritten. For example if you are doing some blind writes, something like card marking where you don’t know if your write is redundant, you might consider writing zeros, rather than writing non-zeros, since in the case that some region of card marks gets repeatedly written, it will be all-zero and the optimization can apply. Of course, this cuts the wrong way when you go to clear the marked region: now you have to write non-zero so you don’t get the optimization during clearing (but maybe this happens out of line with the user code that matters). What ends up better depends on the actual write pattern.
- In case you might have redundant zero-over-zero writes, pay a bit more attention to 64-byte alignment than you normally would because this optimization only kicks in when a full cache line is zero. So if you have some 64-byte structures that might often be all zero (but with non-zero neighbors), a forced 64-byte alignment will be useful since it would activate the optimization more frequently.
- Probably the most practical advice of all: just keep this effect in mind because it can mess up your benchmarks and make you distrust performance counters. I found this when I noticed that the scalar version of a benchmark was writing 2x as much memory as the AVX version, despite them doing the same thing other than the choice of registers. As it happens, the dummy value in vector register I was storing was zero, while in the scalar case it wasn’t: so there was a large difference that had nothing to do with scalar vs vector, but non-zero vs zero instead. Prefer non-zero values in store microbenchmarks, unless you really expect them to be zero in real life!
- Keep an eye for a more general version of this optimization: maybe one day we’ll see this effect apply to redundant writes that aren’t zero-over-zero.
- Floating point has two zero values: +0 and -0. The representation of +0 is all-bits-zero, so using +0 gives you the chance of getting this optimization. Of course, everyone is already using +0 whenever they explicitly want zero.
Of course, the fact that this seems to currently only apply on Skylake and Ice Lake client hardware makes specifically targeting this quite dubious indeed.
Thanks
Thanks to Daniel Lemire who provided access to the hardware used in the Hardware Survey part of this post.
Thanks Alex Blewitt and Zach Wegner who pointed out the CSS tab technique (I used the one linked in the comments of this post) and others who replied to this tweet about image carousels.
Thanks to Tarlinian, 不良大脑的所有者, Bruce Dawson, Zach Wegner and Andrey Penechko who pointed out typos or omissions in the text.
Discussion and Feedback
Leave a comment below, or discuss on Twitter, Hacker News, reddit or RWT.
Feedback is also warmly welcomed by email or as a GitHub issue.
-
Specifically, I was running
uarch-bench.sh --test-name=memory/bandwidth/store/*
from uarch-bench. ↩ -
Like many posts on this blog, what follows is essentially a reconstruction. I encountered the effect originally in a benchmark, as described, and then worked backwards from there to understand the underlying effect. Then, I wrote this post the other way around: building up a new benchmark to display the effect … but at that point I already knew what we’d find. So please don’t think I just started writing the benchmark you find on GitHub and then ran into this issue coincidentally: the arrow of causality points the other way. ↩
-
Probably? I don’t like to assume too much about the reader, but this seems like a fair bet. ↩
-
Starting with Ice Lake, it seems like Intel has implemented a constant-time integer divide unit. ↩
-
Latency-wise, something like 4-5 cycles for an L1 hit, versus 200-500 cycles for a typical miss to DRAM. Throughput wise there is also a very large gap (256 GB/s L1 throughput per core on a 512-bit wide machine versus usually less than < 100 GB/s per socket on recent Intel). ↩
-
Is it deployed anywhere at all on x86? Ping me if you know. ↩
-
It’s hard to say which is faster if they are compiled as written: x86 has indexed addressing modes that make the indexing more or less free, at least for arrays of element size 1, 2, 4 or 8, so the usual arguments againt indexed access mostly don’t apply. Probably, it doesn’t matter: this detail might have made a big difference 20 years ago, but it is unlikely to make a difference on a decent compiler today, which can transform one into the other, depending on the target hardware. ↩
-
For benchmarking purposes, we wrap this in another function so we can slap a
noinline
attribute on this function to ensure that we have a single non-inlined version to call for different values. If we just calledstd::fill
with a literalint
value, it highly likely to get inlined at the call site and we’d have code with different alignment (and possibly other differences) for each value. ↩ -
Admittedly I didn’t go line-by-line though the long vectorized version produced by clang but the line count is identical and if you squint so the assembly is just a big green and yellow blur they look the same… ↩
-
There are 27 samples total at each size: the first 10 are discarded as warmup and the remaining 17 are plotted. ↩
-
The main problem with error bars are that most performance profiling results, and especially microbenchmarks, are mightly non-normal in their distribution, so displaying an error bar based a statistic like the variance is often highly misleading. ↩
-
The ~ is there in ~256 KiB because unless you use huge pages, you might start to see L2 misses even before 256 KiB since only a 256 KiB virtually contiguous buffer is not necessarily well behaved in terms of evictions: it depends how those 4k pages are mapped to physical pages. As soon as you get too many 4k pages mapping to the same group of sets, you’ll see evictions even before 256 KiB. ↩
-
It is worth noting28 that this performance variation with buffer size isn’t exactly inescapable. Rather, it is just a consequence of poor remainder handling in the compiler’s auto-vectorizer. An approach that would be much faster and generate much less code to handle the remaining elements would be to do a final full-width vector store but aligned to the end of the buffer. So instead of doing up to 7 additional scalar stores, you do one additional vector store (and suffer up to one fewer branch mispredictions for random lengths, since the scalar outro involves conditional jumps). ↩
-
Those melty bits where the pattern gets all weird, in the middle and near the right side are not random artifacts: they are consistently reproducible. I suspect a collision in the branch predictor history. ↩
-
This behavior is interesting and a bit puzzling. There are several reasons why you might want to do a non-silent eviction. (1a) would be to keep the L3 snoop filter up to date: if the L3 knows a core no longer has a copy of the line, later requests for that line can avoid snooping the core and are some 30 cycles faster. (1b) Similarly, if the L3 wants to evict this line, this is faster if it knows it can do it without writing back, versus snooping the owning core for a possibly modified line. (2) Keeping the L3 LRU more up to date: the L3 LRU wants to know which lines are hot, but most of the accesses are filtered through the L1 and L2, so the L3 doesn’t get much information – a non-silent eviction can provide some of the missing info (3) If the L3 serves as a victim cache, the L2 needs to write back the line for it to be stored in L3 at all. SKX L3 actually works this way, but despite being a very similar uarch, SKL apparently doesn’t. However, one can imagine that on a miss to DRAM it may be advantageous to send the line directly to the L2, updating the L3 tags (snoop filter) only, without writing the data into L3. The data only gets written when the line is subsequently evicted from the owning L2. When lines are frequently modified, this cuts the number of writes to L3 in half. This behavior warrants further investigation. ↩ ↩2
-
You’ve already seen in Fig. 1 that there is little inter-sample variation, and this keeps the noise down. You can always check the raw data if you want the detailed view. ↩
-
This shows that the L2 is a write-back cache, not write-through: modified lines can remain in L2 until they are evicted, rather than immediately being written to the outer levels of the memory hierarchy. This type of design is key for high store throughput, since otherwise the long-term store throughput is limited to the bandwidth of the slowest write-through cache level. ↩
-
I say the L2 because the behavior is already reflected in the L2 performance counters, but it could be teamwork between the L2 and other components, e.g., the L3 could say “OK, I’ve got that line you RFO’d and BTW it is all zeros”. ↩
-
Although only stores appear in the source, at the hardware level this benchmark does at least as many reads as stores: every store must do a read for ownership (RFO) to get the current value of the line before storing to it. ↩
-
Eagle-eyed readers, all two of them, might notice that the performance in the L3 region is different than the previous figure: here the performance slopes up gradually across most of the L3 range, while in the previous test it was very flat. Absolute performance is also somewhat lower. This is a testing artifact: reading the uncore performance counters necessarily involves a kernel call, taking over 1,000 cycles versus the < 100 cycles required for
rdpmc
to measure the CPU performance counters needed for the prior figure. Due to “flaws” (laziness) in the benchmark, this overhead is captured in the shown performance, and larger regions take longer, meaning that this fixed measurement overhead has a smaller relative impact, so you get thismeasured = actual - overhead/size
type effect. It can be fixed, but I have to reboot my host into single-user mode to capture clean numbers, and I am feeling too lazy to do that right now, although as I look back at the size of the footnote I needed to explain it I am questioning my judgement. ↩ -
On SKL client CPUs we can do this with the
uncore_imc/data_writes/
events, which polls internal counters in the memory controller itself. This is a socket-wide event, so it is important to do this measurement on as quiet a machine as possible. ↩ -
I tried a bunch of other stuff that I didn’t write up in detail. Many of them affect the behavior: we still see the optimization but with different levels of effectiveness. For example, with L2 prefetching off, only about 40% of the L2 evictions are eliminated (versus > 60% with prefetch on), and the performance difference between is close to zero despite the large number of eliminations. I tried other sizes of writes, and with narrow writes the effect is reduced until it is eliminated at 4-byte writes. I don’t think the write size directly affects the optimization, but rather narrower writes slow down the maximum possible performance which interacts in some way with the hardware mechanisms that support this to reduce how often it occurs (a similar observation could apply to prefetching). ↩
-
By outbound queues I mean the path between an inner and outer cache level. So for the L2, the outbound bus is the so-called superqueue that connects the L2 to the uncore and L3 cache. ↩
-
The Graviton 2 uses the Cortex A76 uarch, which can execute 2 stores per cycle, but the L1 cache write ports limits sustained execution to only one 128-bit store per cycle. ↩
-
It was the first full day of general availability for Graviton, so perhaps these hosts are very lightly used at the moment because it certainly felt like I had the whole thing to myself. ↩
-
By escaping I mean that a store that visibly gets to the cache level where this optimization happens. For example, if I write a 1 immediately followed by a 0, the 1 will never make it out of the L1 cache, so from the point of view of the L2 and beyond only a zero was written. I expect the optimization to still trigger in this case. ↩
-
This phenomenon is why
calloc
is sometimes considerably faster thanmalloc + memset
. Withcalloc
the zeroing happens within the allocator, and the allocator can track whether the memory it is about to return is known zero (usually because it the block is fresh from the OS, which always zeros memory before handing it out to userspace), and in the case ofcalloc
it can avoid the zeroing entirely (socalloc
runs as fast asmalloc
in that case). The client code callingmalloc
doesn’t receive this information and can’t make the same optimization. If you stretch the analogy almost to the breaking point, one can see what Intel is doing here as “similar, but in hardware”. ↩ -
Ha! To me, everything is “worth noting” if it means another footnote. ↩
Comments
is any of actually useful how memory arrives comes from the operating system
On point 27., the memory you get from the OS is initially mapped to the shared zero page, but on the first access violation (non-zero write!), you get an interrupt which spends about half the time in updating the page tables, and the other half in performing a memset to 0. (Which by the way happens with a
rep stos
, not an AVX loop.)The freshly mapped pages may have been already zeroed out, but not always are, and unless the OS would be zeroing eagerly, it wouldn’t know either.
For that interrupt, this is (or was…) an essential optimization. Partly because newer Intel CPUs have a failed design, where non-“thread”-safe interrupts (such as a page fault) must use a spin lock as the only possible safeguard against a corrupted page table, but too many cores in the spinlock burn up all the cache bandwidth and thereby slow down the work in the critical section. In summary, you may have 1TB of RAM, and a couple dozen cores, but if you ain’t careful, all cores get trapped in the interrupt while you can’t page fault at a rate of more than 10-15GB/s.
I agree with your description of how the zero page allocation happens[1], but it’s not inconsistent with anything in the footnote, is it?
The footnote talks about the userspace side of things: the userspace allocator (e.g., in libc) gets “effectively” zeroed memory from the OS, regardless of how that happens. Now if
malloc
was used by the application code, the caller can’t assume the returned region is zero, because malloc doesn’t in general return zero memory: the (userspace) allocator may have either recycled an existing free region (not zero) or returned fresh obtained from the OS (zero). So the caller must zero the memory, which may be redundant if the pages were known to be zero to the allocator.In
calloc
, the zeroing happens inside the (userspace) allocator, which knows where it got the pages from, and hence whether they are guaranteed zero. So in the case of fresh memory from the OS, it can skip the zeroing.[1] Although it’s configurable so not everyone is using the zero page, especially outside of x86, and some have argued that this whole zero-page page thing is worse than the simple alternative: the benefit is that “sparse” mappings in which most pages are never written to, but are read, are efficient: they need only one physical page for every written page, but the downside is you need an extra fault for every page. Since most pages are written after allocation, is it even a win?
Quick question regarding Graviton performance. ARMv8 has an explicit instruction to zero out a cache line (DCZVA). Do you know if std::fill or memset have a short circuit for the case where the value to fill is zero which defaults to using DCZVA. I assume it would improve zero-fill significantly, although obviously, it is not a dynamic feature.
I doubt
std::fill
has a special case itself, because the standard library implementations rarely seem to use platform specific code to improve performance, relying instead on compiler transformations. Now, the compiler can transformstd::fill
tomemset
, andmemset
does have a DCVZA path in glibc at least!So it would be interesting to see how much difference this makes.
Regarding:
and
If you didn’t have the L2 prefetchers off the 40 - 60% success rate seems to corresponding somewhat reasonably with the fact that the L2 spatial and streamer prefetchers will attempt to prefetch 128 byte cache line pair and if this was indeed what was causing that number (say something like only the 128 byte aligned line could be “store-eliminated”) then that might explain wait the
alt01
version didn’t have any affect.I think your next blog post’s data on icelake about how
ymm
fill beatszmm
fill might also support your theory that the “store-elimination” is done opportunistically when there is heavy store pressure i.e you get 2x the writes withymm
so if you theory is correct twice the pressure would induce more efforts for “store-elimination”.Mostly bullshitting but it kind of lines up :)
Tested this. No luck.
Hi Noah,
Thanks for your comment!
The results shown have prefetching on, but I did also try with prefetching off and some other variations as discussed briefly in a footnote. In fact, I found less elimination with prefetching off.
Nice writeup
This is a super interesting effect, very nice find! I took the liberty of adding it to my HW effects repository (https://github.com/kobzol/hardware-effects/tree/master/hardware-store-elimination).
Having a 512-wide arithmetic op–even just a wide OR–lying around in the cache hierarchy outside of a processor seems rather peculiar to me. There also aren’t any arithmetic instruction ports used on a vector store (per Agner Fog’s tables), which raises the question of where the “is zero” bit is being calculated. To speculate and conjecture wildly for a moment… I think the most natural place for an “is zero” bit to come from is use of the (renamed) zero register in either the integer or vector register file. In other words, there’s a little heuristic piece of hardware in the store port that detects if you’ve just covered a cacheline with writes from the zero register, and sets an “is zero” bit on the last write if so. My prediction, then, is that if you “trick” a register into being zero, rather than performing operations that may have it renamed to the zero register, you’ll see the effect go away. One way I can think of to do this would be to add three static vectors that you know in advance sum to all zeros.
I am not sure we are on the same page wrt the costs of a 512-bit wide compare-to-zero. I am not an ASIC designer (I feel like I’m pointing this out almost every day, now!), but I believe the cost of such a comparator is quite small in the overall scheme of things. Things like 128-bit (result size) integer multipliers are probably a couple of orders of magnitude larger, and they still seem to take up vanishingly small portions of the chip if you look at die shots. So I don’t think having a 512-bit wide comparator somewhere in the cache hierarchy is a problem.
Keep in mind, this is still very much inside the processor! The whole cache hierarchy we are talking about is inside the processor, and the L2 is right inside the core. So the cache is implemented in very much the same logic process as the core itself, so there is particular problem putting a comparator in there.
The test already does this: if you look at the assembly, no zeroing idiom is being used, and the function is compiled separately and so is generic for all values. The function takes an integer register, and broadcasts it to a vector one, so the zeroness of the register is lost (unless you think there is a special case for broadcast from a zero integer register, which I think is very unlikely). Still, I did double check and changed the code to read from a
volatile int zero = 0
variable, and checked the assembly to confirm that the zero now comes from a memory read, not a zeroing idiom in an integer register and I get the results.Apart from the actual results, I see at least a couple of problems with the suggested mechanism:
1) I think it would massively reduce the applicability of the optimization, because a lot of zeroing is done with not-zero-idiom initialized registers. I think also the zeroed-by-idiom state is lost after e.g., a context switch.
2) The thing you need to track is the zeroness of the entire 64-byte line. Constantly checking and updating this state on stores of smaller values seems both expensive and difficult. For example when you write a zero to a line, you need know if the rest of the line was zero to update the all zero state. Also, you might write the line millions of times before it is ever evicted from the L1, so you update the status millions of times when only the final state is useful.
Calculating this property at the moment (or nearby) to when the line is actually being evicted, as Intel seems to do, strikes me as a better strategy.
You really should run those tests in parallel, because server CPUs are designed to run workloads that use more than one core. It’s neat that Graviton can drive 40 GB/s from a single core, but that’s irrelevant unless you actually care about workloads where only one core of many is moving data.
Are you under the mistaken impression that I’m trying to do a server workload comparison or something like that?
I’m not. I’m running these tests only to probe this specific uarch effect and single-thread is ideal for that. As a side effect, I made some notes on single threaded performance, but I’m not doing multi-threaded numbers here because none of that is the point.
That said, single thread memory bandwidth is pretty important for server loads too: because it usually predicts well for 2, 3, 4 threads, etc. Not all server loads are pushing memory to the limit on every thread (in fact, I’d argue most don’t)!
In any case, using all the cores produces quite boring results on any decent server, whether it’s AMD, Intel or ARM: you reach some significant fraction of the implied maximum memory bandwidth, unless you screw up NUMA placement or something. Graviton is no different: IIRC it gets within 85% or 90% of the peak, based on tests run by others. You probably don’t want me to run those again – this is about a cool thing Intel is doing in their memory subsystem, not a server shootout.
Re: footnote 8: indeed,
std::fill(0)
vs.std::fill(1)
with compile-time constant values can make very different code: https://stackoverflow.com/q/42558907/224132std::fill(0) gets inlined a REP MOVSB on some GCC versions, but GCC doesn’t do memset pattern recognition for patterns wider than 1 byte and can only auto-vectorize fill(1) the normal way. (Like for a runtime-variable value).
gcc recognizes fill-with-same-bytes since gcc-4.9, see const_with_all_bytes_same in gcc/tree-loop-distribution.c and this CE example: https://godbolt.org/z/Neo9tN
Hi Peter, nice to see you here and thanks for your comment!
Very interesting. Thanks for the article!
Fabalous write-up. Thanks! :+1: