Performance Matters

Your CPU May Have Slowed Down on Wednesday

2021-06-17T00:00:00+00:00

A Strange Performance Effect

The plot below shows the throughput of filling a region of the given size (varying on the x-axis) with zeros¹ on Skylake (and Ice Lake in the second tab).

The two series were generated under apparently identical conditions: the same binary on the same machine. Only the date the benchmark was run varies. That is, on Monday, June 7th, filling with zeros is substantially faster than the same benchmark on Wednesday, at least when the region no longer fits in the L2 cache².

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Hump Day Strikes Back

What’s going on here? Are my Skylake and Ice Lake hosts simply work-weary by Wednesday and don’t put in as much effort? Is there a new crypto-coin based on who can store the most zeros and this is a countermeasure to avoid ballooning CPU prices in the face of this new workload?

Believe it or not, it is none of the above!

These hosts run Ubuntu 20.04 and on Wednesday June 9th an update to the intel-microcode OS package was released. After a reboot³, this loads the CPU with new microcode (released a day earlier by Intel) that causes the behavior shown above. Specifically, this microcode⁴ disables the hardware zero store optimization we discussed in a previous post. It was disabled to mitigate CVE-2020-24512 further described⁵ in Intel security advisory INTEL-SA-00464.

To be clear, I don’t know for sure that the microcode disables the zero store optimization – but the evidence is rather overwhelming. After the update, performance is the same when filling zeros as for any other value, and the performance counters tracking L2 evictions suggestion that substantially all evictions are now non-silent (recall from the previous posts that silent evictions were a hallmark of the optimization).

Although I suspect the performance impact will be minuscule on average⁶, this surprise still serves as a reminder that raw CPU performance can silently change due to microcode updates and most Linux distributions and modern Windows have these updates enabled by default. We’ve seen this before. If you are trying to run reproducible benchmarks, you should always re-run your entire suite in order to make accurate comparisons, even on the same hardware, rather than just running the stuff you think has changed.

Reproduction

The code to collect this performance data and reproduce my results is available in zero-fill-bench, with some instructions in the README.

Mea Culpa and an Unsustainable Path

In writing the earlier blog entries on this topic, I was interested in the performance aspects of this optimization, not its potential as an attack vector. However, merely by observing (and publishing) the results, the optimization was affected: the system under measurement changed as a result of the observation. Now I can’t be sure that the optimization wouldn’t have eventually been disabled anyway, but it does seem that the reason for this behavior change to occur now was my earlier post.

I am not convinced that removing any optimization which can be used in a timing-based side channel is sustainable. I am not sure this is a thread you want to keep pulling on: practically every aspect of a modern CPU can vary in performance and timing based on internal state⁷. Trying to draw the security boundaries tightly around co-located entities (e.g., processes on the same CPU, especially on the same core), without allowing any leaks seems destined to fail without a complete overhaul of CPU design, likely at the cost of a large amount of performance. There are just too many holes to plug.

I hope that once the wave of vulnerabilities and disclosures that started with Meltdown and Spectre begins to recede, we can start to work on a measured approach to classifying and mitigating timing and other side-channel attacks. This could start by enumerating which performance characteristics are reasonable guaranteed to hold, and which aren’t. For example, it could be specified whether memory access timing may vary based on the value accessed. If it is allowed to vary, the zero store optimization would be allowed.

In any case, I still plan to write about performance-related microarchitectural details. I just hope this outcome does not repeat itself.

Thanks

Thanks to JN, Chris Martin, Jonathan and m_ueberall for reporting or fixing typos in the text.

Stone photo by Colin Watts on Unsplash.

Discussion and Feedback

You can join the discussion on Twitter, Hacker News or r/Intel.

If you have a question or any type of feedback, you can leave a comment below.

If you liked this post, check out the homepage for others you might enjoy.

Specifically, it uses std::fill with a zero argument, with some inlining prevention, which ultimately results in a fill which uses a series of 32-byte vector loads and stores to store 256 bytes per unrolled iteration, with a loop body like this:
```
vmovdqu YMMWORD PTR [rax],ymm1
vmovdqu YMMWORD PTR [rax+0x20],ymm1
vmovdqu YMMWORD PTR [rax+0x40],ymm1
vmovdqu YMMWORD PTR [rax+0x60],ymm1
vmovdqu YMMWORD PTR [rax+0x80],ymm1
vmovdqu YMMWORD PTR [rax+0xa0],ymm1
vmovdqu YMMWORD PTR [rax+0xc0],ymm1
vmovdqu YMMWORD PTR [rax+0xe0],ymm1
```
So the compiler does a good job: you can’t ask for much better than that. ↩
I’m doing a bit of a retcon here. The effect is present as described on the gates given, and I observed and benchmarked it on Wednesday myself, but the specific data series used for the plots were generated a week later when I had time to collect the data properly in a relatively noise free environment. So the two series were collected back-to-back on the same day, varying only the hidden parameter you’ll learn about two paragraphs from now. ↩
To be clear, the microcode is not persistent, so it needs to be loaded on every boot. If you remove or downgrade the intel-microcode package, you’ll be back to an older microcode after the next boot. That is, unless you also update your BIOS which can also come with a microcode update: this will be persistent unless you downgrade your BIOS. ↩
The new June 8th microcode versions are 0xea for Skylake (versus 0xe2 previously) and 0xa6 for Ice Lake (versus 0xa0 previously). ↩
barely ↩
The performance regression shown in the plots is close to a worst case: the benchmark only fills zeros and nothing else. Real code doesn’t spend that much time filling zeros, although zero is no doubt the dominant value in large block fills, at least because the OS must zero pages before returning them to user processes and memory-safe languages like Java will zero some objects and array types in bulk. ↩
This observation becomes almost universal once you consider that the values involved in any operation affect power use (see e.g. Schöne et al or Cornebize and Legrand). Since power use can be directly (e.g., RAPL or external measurements) or indirectly (e.g., because of heat-dependent frequency changes) observed, it means that in theory any operation, even those widely considered to be constant-time, may leak information. ↩

Ice Lake AVX-512 Downclocking

2020-08-19T00:00:00+00:00

This is a short post investigating the behavior of AVX2 and AVX-512 related license-based downclocking on Intel’s newest Ice Lake and Rocket Lake chips.

license-based downclocking¹ refers to the semi-famous effect where lower than nominal frequency limits are imposed when certain SIMD instructions are executed, especially heavy floating point instructions or 512-bit wide instructions.

More details about this type of downclocking are available at this StackOverflow answer and we’ve already covered in somewhat exhaustive detail the low level mechanics of these transitions. You can also find some guidelines on to how make use of wide SIMD given this issue².

All of those were written in the context of Skylake-SP (SKX) which were the first generation of chips to support AVX-512.

So what about Ice Lake, the newest chips which support both the SKX flavor of AVX-512 and also have a whole host of new AVX-512 instructions? Will we be stuck gazing longly at these new instructions from afar while never being allowed to actually use them due to downclocking?

Read on to find out, or just skip to the end. The original version of this post included only Ice Lake is the primary focus. On March 28th, 2020 I updated it with a Rocket Lake section.

Ice Lake Frequency Behavior

AVX-Turbo

We will use the avx-turbo utility to measure the core count and instruction mix dependent frequencies for a CPU. This tools works in a straightforward way: run a given mix of instructions on the given number of cores, while measuring the frequency achieved during the test.

For example, the avx256_fma_t test – which measures the cost of heavy 256-bit instructions with high ILP – runs the following sequence of FMAs:

	vfmadd132pd ymm0,ymm10,ymm11
	vfmadd132pd ymm1,ymm10,ymm11
	vfmadd132pd ymm2,ymm10,ymm11
	vfmadd132pd ymm3,ymm10,ymm11
	vfmadd132pd ymm4,ymm10,ymm11
	vfmadd132pd ymm5,ymm10,ymm11
	vfmadd132pd ymm6,ymm10,ymm11
	vfmadd132pd ymm7,ymm10,ymm11
	vfmadd132pd ymm8,ymm10,ymm11
	vfmadd132pd ymm9,ymm10,ymm11
	; repeat 10x for a total of 100 FMAs

In total, we’ll use five tests to test every combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit SIMD behaves the same as scalar), using this command line:

avx-turbo --test=scalar_iadd,avx256_iadd,avx512_iadd,avx256_fma_t,avx512_fma_t

Ice Lake Results

I ran avx-turbo as described above on an Ice Lake i5-1035G4, which is the middle-of-the-range Ice Lake client CPU running at up to 3.7 GHz. The full output is hidden away in a gist, but here are the all-important frequency results (all values in GHz):

Instruction Mix	Active Cores
Instruction Mix	1	2	3	4
Scalar/128-bit	3.7	3.6	3.3	3.3
Light 256-bit	3.7	3.6	3.3	3.3
Heavy 256-bit	3.7	3.6	3.3	3.3
Light 512-bit	3.6	3.6	3.3	3.3
Heavy 512-bit	3.6	3.6	3.3	3.3

As expected, maximum frequency decreases with active core count, but scan down each column to see the effect of instruction category. Along this axis, there is almost no downclocking at all! Only for a single active core count is there any decrease with wider instructions, and it is a paltry only 100 MHz: from 3,700 MHz to 3,600 MHz when any 512-bit instructions are used.

In any other scenario, including any time more than one core is active, or for heavy 256-bit instructions, there is zero license-based downclocking: everything runs as fast as scalar.

license Mapping

There another change here too. In SKX, there are three licenses, or categories of instructions with respect to downclocking: L0, L1 and L2. Here, in client ICL, there are only two³ and those don’t line up exactly with the three in SKX.

To be clearer, in SKX the licenses mapped to instruction width and weight as follows:

Width	Light	Heavy
Scalar/128	L0	L0
256	L0	L1
512	L1	L2

In particular, note that 256-bit heavy instructions have the same license as 512-bit light.

In ICL client, the mapping is:

Width	Light	Heavy
Scalar/128	L0	L0
256	L0	L0
512	L1	L1

Now, 256 heavy and 512 light are in different categories! In fact, the whole concept of light vs heavy doesn’t seem to apply here: the categorization is purely based on the width⁴.

Rocket Lake

Rocket Lake (shortened as RKL, see wikipedia or wikichip for more) is more-or-less a backport of the 10nm Sunny Cove microarchitecture to Intel’s highly-tuned workhorse⁵ 14nm process.

Edison Chan has graciously provided the output of running avx-turbo on his Rocket Lake i9-11900K, the top of the line Rocket Lake chip. The full results are available, but I’ve summarized the achieved frequencies in the following table.

Rocket Lake i9-11900K Frequency Matrix

Active Cores	Instruction Mix
Active Cores	Scalar and 128	Light 256	Heavy 256	Light 512	Heavy 512
1 Core	5.1	5.1	5.1	5.1	5.1
2 Cores	5.1	5.1	5.1	5.1	5.1
3 Cores	5.1	5.1	5.1	5.1	5.1
4 Cores	5.1	5.1	5.1	5.1	5.1
5 Cores	4.9	4.9	4.9	4.9	4.9
6 Cores	4.9	4.9	4.9	4.9	4.9
7 Cores	4.8	4.8	4.8	4.8	4.8
8 Cores	4.8	4.8	4.8	4.8	4.8

The results paint a very promising picture of Rocket Lake’s AVX-512 frequency behavior: there is no license-based downclocking evident at any combination of core count and frequency⁶. Even heavy AVX-512 instructions can execute at the same frequency as lightweight scalar code.

In fact, the frequency behavior of this chip appears very simple: the full Turbo Boost 2.0 frequency⁷ of 5.1 GHz is available for any instruction mix up and up to 4 active cores, then the speed drops to 4.9 for 5 and 6 active cores, and finally to 4.8 GHz for 7 or 8 active cores. This means that at 8 active cores and AVX-512, you are still achieving 94% of the frequency observed for 1 active core running light instructions.

So What?

Well, so what?

At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than the prior-generation verdict of “AVX-512 generally causes significant downclocking”, on these Ice Lake and Rocket Lake client chips we can say that AVX-512 causes insignificant (usually, none at all) license-based downclocking and I expect this to be true on other ICL and RKL client chips as well.

Now, this adjustment of expectations comes with an important caveat: license-based downclocking is only one source of downclocking. It is also possible to hit power, thermal or current limits. Some configurations may only be able to run wide SIMD instructions on all cores for a short period of time before exceeding running power limits. In my case, the $250 laptop I’m testing this on has extremely poor cooling and rather than power limits I hit thermal limits (100°C limit) within a few seconds running anything heavy on all cores.

However, these other limits are qualitatively different than license based limits. They apply mostly⁸ in a pay for what you use way: if you use a wide or heavy instruction or two you incur only a microscopic amount of additional power or heat cost associated with only those instructions. This is unlike some license-based transitions where a core or chip-wide transition occurs that affects unrelated subsequent execution for a significant period of time.

Since wider operations are generally cheaper in power than an equivalent number of narrower operations⁹, you can determine up-front that a wide operation is worth it – at least for cases that scale well with width. In any case, the problem is most local: not depending on the behavior of the surrounding code.

Summary

Here’s what we’ve learned.

The Ice Lake i5-1035 CPU exhibits only 100 MHz of license-based downclock with 1 active core when running 512-bit instructions, and no license downclock in any other scenario.
The Rocket Lake i9-11900K CPU doesn’t exhibit any license-based downclock in the tested scenarios.
The Ice Lake CPU has an all-core 512-bit turbo frequency of 3.3 GHz is 89% of the maximum single-core scalar frequency of 3.7 GHz, so within power and thermal limits this chip has a very “flat” frequency profile. The Rocket Lake 11900K is even flatter with an all-eight-cores frequency of 4.8 GHz clocking in at 94% of the 5.1 GHz single-core speed.
Unlike SKX, this Ice Lake chip does not distinguish between “light” and “heavy” instructions for frequency scaling purposes: FMA operations behave the same as lighter operations.

So on ICL and RKL client, you don’t have to fear the downclock. Only time will tell if this applies also to the Ice Lake Xeon server chips.

Thanks

Thanks to Edison Chan for the Rocket Lake i9-11900K results.

Stopwatch photo by Kevin Andre on Unsplash.

Discussion and Feedback

This post was discussed on Hacker News.

If you have a question or any type of feedback, you can leave a comment below. I’m also interested in results on other new Intel or AMD chips, like the i3 and i7 variants: let me know if you have one of those and we can collect results.

If you liked this post, check out the homepage for others you might enjoy.

It gets tiring to constantly repeat license-based downclock so I’ll often use simply “downclock” instead, but this should still be understood to refer to the license-based variety rather than other types of frequency throttling. ↩
Note that Daniel has written much more than just that one. ↩
Only two visible: it is possible that the three (or more) categories still exist, but they cause voltage transitions only, not any frequency transitions. ↩
One might imagine this is a consequence of ICL client having only one FMA unit on all SKUs: very heavy FP 512-bit operations aren’t possible. However, this doesn’t align with 256-bit heavy still being fast: you can still do 2x256-bit FMAs per cycle and this is the same FP intensity as 1x512-bit FMA per cycle. It’s more like, on this chip, FP operation don’t need more license based protection from other operations of the same width, and the main cost is 512-bit width. ↩
Those of a more critical bent might prefer long suffering or very long in the tooth as adjectives for this process. ↩
Some tests did show lower speeds, although these outlier results didn’t correlate well with heavy or light instructions, and the difference was generally 100 MHz or less. These likely represent other sources of reduced frequency, such as thermal throttling or switching to a higher active core count when a process not related to the test process has active threads. In any case, for each core count, we can find a test in each of the instruction categories that runs at full speed, allowing me to fill out the matrix even in the presence of these outliers. ↩
I mention Turbo Boost 2.0 specifically because this chip also has a higher Turbo Boost 3.0 maximum frequency of 5.2 GHz, and beyond that a high Thermal Velocity Boost frequency of 5.3 GHz. These higher frequencies apply only to specific chosen cores within the CPU selected at manufacturing based on their ability to reach these higher frequencies. We don’t see any of these higher speeds during the test, possibly because the cores the test pins itself to are not the chosen cores on this CPU. So the frequency behavior of this chip can be characterized as “very simple” only if you ignore these additional turbo levels and other complicating factors. ↩
I have to weasel-word with mostly here because even if there is no frequency transition, there may be a voltage transition which both incurs a halted period where nothing executes, and increases power for subsequent execution that may not require the elevated voltage. Also, there is the not-yet-discussed concept of implicit widening which may extend later narrow operations to maximum width if the upper parts of the registers are not zeroed with vzeroupper or vzeroall. ↩
For example, one 512-bit integer addition would generally be cheaper in energy use than the two 256-bit operations required to calculate the same result, because of execution overheads that don’t scale linearly with width (that’s almost everything outside of execution itself). ↩

A Concurrency Cost Hierarchy

2020-07-06T00:00:00+00:00

Introduction

Concurrency is hard to get correct, at least for those of us unlucky enough to be writing in languages which expose directly the guts of concurrent hardware: threads and shared memory. Getting concurrency correct and fast is hard, too. Your knowledge about single-threaded optimization often won’t help you: at a micro (instruction) level we can’t simply apply the usual rules of μops, dependency chains, throughput limits, and so on. The rules are different.

If that first paragraph got your hopes up, this second one is here to dash them: I’m not actually going to do a deep dive into the very low level aspects of concurrent performance. There are a lot of things we just don’t know about how atomic instructions and fences execute, and we’ll save that for another day.

Instead, I’m going to describe a higher level taxonomy that I use to think about concurrent performance. We’ll group the performance of concurrent operations into six broad levels running from fast to slow, with each level differing from its neighbors by roughly an order of magnitude in performance.

I often find myself thinking in terms of these categories when I need high performance concurrency: what is the best level I can practically achieve for the given problem? Keeping the levels in mind is useful both during initial design (sometimes a small change in requirements or high level design can allow you to achieve a better level), and also while evaluating existing systems (to better understand existing performance and evaluate the path of least resistance to improvements).

A “Real World” Example

I don’t want this to be totally abstract, so we will use a real-world-if-you-squint¹ running example throughout: safely incrementing an integer counter across threads. By safely I mean without losing increments, producing out-of-thin air values, frying your RAM or making more than a minor rip in space-time.

Source and Results

The source for every benchmark here is available, so you can follow along and even reproduce the results or run the benchmarks on your own hardware. All of the results discussed here (and more) are available in the same repository, and each plot includes a [data table] link to the specific subset used to generate the plot.

Hardware

All of the performance results are provided for several different hardware platforms: Intel Skylake, Ice Lake, Amazon Graviton and Graviton 2. However except when I explicitly mention other hardware, the prose refers to the results on Skylake. Although the specific numbers vary, most of the qualitative relationships hold for the hardware too, but not always. Not only does the hardware vary, but the OS and library implementations will vary as well.

It’s almost inevitable that this will be used to compare across hardware (“wow, Graviton 2 sure kicks Graviton 1’s ass”), but that’s not my goal here. The benchmarks are written primarily to tease apart the characteristics of the different levels, and not as a hardware shootout.

Find below the details of the hardware used:

Micro-architecture	ISA	Model	Tested Frequency	Cores	OS	Instance Type
Skylake	x86	i7-6700HQ	2.6 GHz	4	Ubuntu 20.04
Ice Lake	x86	i5-1035G4	3.3 GHz	4	Ubuntu 19.10
Graviton	AArch64	Cortex-A72	2.3 GHz	16	Ubuntu 20.04	a1.4xlarge
Graviton 2	AArch64	Neoverse N1	2.5 GHz	16²	Ubuntu 20.04	c6g.4xlarge

Level 2: Contended Atomics

You’d probably expect this hierarchy to be introduced from fast to slow, or vice-versa, but we’re all about defying expectations here and we are going to start in the middle and work our way outwards. The middle (rounding down) turns out to be level 2 and that’s where we will jump in.

The most elementary way to safely modify any shared object is to use a lock. It mostly just works for any type of object, no matter its structure or the nature of the modifications. Almost any mainstream CPU from the last thirty years has some type of locking³ instruction accessible to userspace.

So our baseline increment implementation will use a simple mutex of type T to protect a plain integer variable:

T lock;
uint64_t counter;

void bench(size_t iters) {
    while (iters--) {
        std::lock_guard<T> holder(lock);
        counter++;
    }
}

We’ll call this implementation mutex add, and on my 4 CPU Skylake-S i7-6700HQ machine, when I use the vanilla std::mutex I get the following results for 2 to 4 threads:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The reported value is the median of all trials, and the vertical black error lines at the top of each bar indicate the interdecile range, i.e., the values at the 10th and 90th percentile. Where the error bars don’t show up, it means there is no difference between the p10 and p90 values at all, at least within the limits of the reporting resolution (100 picoseconds).

This shows that the baseline contended cost to modify an integer protected by a lock starts at about 125 nanoseconds for two threads, and grows somewhat with increasing thread count.

I can already hear someone saying: If you are just modifying a single 64-bit integer, skip the lock and just directly use the atomic operations that most ISAs support!

Sure, let’s add a couple of variants that do that. The std::atomic<T> template makes this easy: we can wrap any type meeting some basic requirements and then manipulate it atomically. The easiest of all is to use std::atomic<uint64>::operator++()⁴ and this gives us atomic add:

std::atomic<uint64_t> atomic_counter{};

void atomic_add(size_t iters) {
    while (iters--) {
        atomic_counter++;
    }
}

The other common approach would be to use compare and swap (CAS) to load the existing value, add one and then CAS it back if it hasn’t changed. If it has changed, the increment raced with another thread and we try again.

Note that even if you use increment at the source level, the assembly might actually end up using CAS if your hardware doesn’t support atomic increment⁵, or if your compiler or runtime just don’t take advantage of atomic operations even though they are available (e.g., see what even the newest version of icc does for atomic increment, and what Java did for years⁶). This caveat doesn’t apply to any of our tested platforms, however.

Let’s add a counter implementation that uses CAS as described above, and we’ll call it cas add:

std::atomic<uint64_t> cas_counter;

void cas_add(size_t iters) {
    while (iters--) {
        uint64_t v = cas_counter.load();
        while (!cas_counter.compare_exchange_weak(v, v + 1))
            ;
    }
}

Here’s what these look like alongside our existing std::mutex benchmark:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The first takeaway is that, at least in this unrealistic maximum contention benchmark, using atomic add (lock xadd at the hardware level) is significantly better than CAS. The second would be that std::mutex doesn’t come out looking all that bad on Skylake. It is only slightly worse than the CAS approach at 2 cores and beats it at 3 and 4 cores. It is slower than the atomic increment approach, but less than three times as slow and seems to be scaling in a reasonable way.

All of these operations are belong to level 2 in the hierarchy. The primary characteristic of level 2 is that they make a contended access to a shared variable. This means that at a minimum, the line containing the data needs to move out to the caching agent that manages coherency⁷, and then back up to the core that will receive ownership next. That’s about 70 cycles minimum just for that operation⁸.

Can it get slower? You bet it can. Way slower.

Level 3: System Calls

The next level up (“up” is not good here…) is level 3. The key characteristic of implementations at this level is that they make a system call on almost every operation.

It is easy to write concurrency primitives that make a system call unconditionally (e.g., a lock which always tries to wake waiters via a futex(2) call, even if there aren’t any), but we won’t look at those here. Rather we’ll take a look at a case where the fast path is written to avoid a system call, but the design or way it is used implies that such a call usually happens anyway.

Specifically, we are going to look at some fair locks. Fair locks allow threads into the critical section in the same order they began waiting. That is, when the critical section becomes available, the thread that has been waiting the longest is given the chance to take it.

Sounds like a good idea, right? Sometimes yes, but as we will see it can have significant performance implications.

On the menu are three different fair locks.

The first is a ticket lock with a sched_yield in the spin loop. The idea of the yield is to give other threads which may hold the lock time to run. This yield() approach is publicly frowned upon by concurrency experts⁹, who then sometimes go right ahead and use it anyway.

We will call it ticket yield and it looks like this:

/**
 * A ticket lock which uses sched_yield() while waiting
 * for the ticket to be served.
 */
class ticket_yield {
    std::atomic<size_t> dispenser{}, serving{};

public:
    void lock() {
        auto ticket = dispenser.fetch_add(1, std::memory_order_relaxed);

        while (ticket != serving.load(std::memory_order_acquire))
            sched_yield();
    }

    void unlock() {
        serving.store(serving.load() + 1, std::memory_order_release);
    }
};

Let’s plot the performance results for this lock alongside the existing approaches:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

This is level 3 visualized: it is an order of magnitude slower than the level 2 approaches. The slowdown comes from the sched_yield call: this is a system call and these are generally on the order of 100s of nanoseconds¹⁰, and it shows in the results.

This lock does have a fast path where sched_yield isn’t called: if the lock is available, no spinning occurs and sched_yield is never called. However, the combination of being a fair lock and the high contention in this test means that a lock convoy quickly forms (we’ll describe this in more detail later) and so the spin loop is entered basically every time lock() is called.

So have we now fully plumbed the depths of slow concurrency constructs? Not even close. We are only now just about to cross the River Styx.

Revisiting std::mutex

Before we proceed, let’s quickly revisit the std::mutex implementation discussed in level 2 in light of our definition of level 3 as requiring a system call. Doesn’t std::mutex also make system calls? If a thread tries to lock a std::mutex object which is already locked, we expect that thread to block using OS-provided primitives. So why isn’t it level 3 and slow like ticket yield?

The primary reason is that it makes few system calls in practice. Through a combination of spinning and unfairness I measure only about 0.18 system calls per increment, with three threads on my Skylake box. So most increments happen without a system call. On the other hand, ticket yield makes about 2.4 system calls per increment, more than an order of magnitude more, and so it suffers a corresponding decrease in performance.

That out of way, let’s get even slower.

Level 4: Implied Context Switch

The next level is when the implementation forces a significant number of concurrent operations to cause a context switch.

The yielding lock wasn’t resulting in many context switches, since we are not running more threads than there are cores, and so there usually is no other runnable process (except for the occasional background process). Therefore, the current thread stays on the CPU when we call sched_yield. Of course, this burns a lot of CPU.

As the experts recommend whenever one suggests yielding in a spin loop, let us try a blocking lock instead.

Blocking Locks

A more resource friendly design, and one that will often perform better is a blocking lock.

Rather than busy waiting, these locks ask the OS to put the current thread to sleep until the lock becomes available. On Linux, the futex(3) system call is the preferred way to accomplish this, while on Windows you have the WaitFor*Object API family. Above the OS interfaces, things like C++’s std::condition_variable provide a general purpose mechanism to wait until an arbitrary condition is true.

Our first blocking lock is again a ticket-based design, except this time it uses a condition variable to block when it detects that it isn’t first in line to be served (i.e., that the lock was held by another thread). We’ll name it ticket blocking and it looks like this:

void blocking_ticket::lock() {
    auto ticket = dispenser.fetch_add(1, std::memory_order_relaxed);

    if (ticket == serving.load(std::memory_order_acquire))
        return; // uncontended case

    std::unique_lock<std::mutex> lock(mutex);
    while (ticket != serving.load(std::memory_order_acquire)) {
        cvar.wait(lock);
    }
}

void blocking_ticket::unlock() {
    std::unique_lock<std::mutex> lock(mutex);
    auto s = serving.load(std::memory_order_relaxed) + 1;
    serving.store(s, std::memory_order_release);
    auto d = dispenser.load(std::memory_order_relaxed);
    assert(s <= d);
    if (s < d) {
        // wake all waiters
        cvar.notify_all();
    }
}

The main difference with the earlier implementation occurs in the case where we don’t acquire the lock immediately (we don’t return at the location marked // uncontended case). Instead of yielding in a loop, we take the mutex associated with the condition variable and wait until notified. Every time we are notified we check if it is our turn.

Even without spurious wakeups we might get woken many times, because this lock suffers from the thundering herd problem where every waiter is woken on unlock() even though only one will ultimately be able to get the lock.

We’ll try a second design too, that doesn’t suffer from thundering herd. This is a queued lock, where each lock waits on its own private node in a queue of waiters, so only a single waiter (the new lock owner) is woken up on unlock. We will call it queued fifo and if you’re interested in the implementation you can find it here.

Here’s how our new locks perform against the existing crowd:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

You’re probably seeing the pattern now: performance is again a new level of terrible compared to the previous contenders. About an order of magnitude slower than the yielding approach, which was already slower than the earlier approaches, which are now just slivers a few pixels high on the plots. The queued version of the lock does slightly better at increasing thread counts (especially on Graviton 2), as might be expected from the lack of the thundering herd effect, but is still very slow because the primary problem isn’t thundering herd, but rather a lock convoy.

Lock Convoy

Unlike unfair locks, fair locks can result in sustained convoys involving only a single lock, once the contention reaches a certain point¹¹.

Consider what happens when two threads, A and B, try to acquire the lock repeatedly. Let’s say A gets ticket 1 and B ticket 2. So A gets to go first and B has to wait, and for these implementations that means blocking (we can say the thread is parked by the OS). Now, A unlocks the lock and sees B waiting and wakes it. A is still running and soon tries to get the lock again, receiving ticket 3, but it cannot acquire the lock immediately because the lock is fair: A can’t jump the queue and acquire the lock with ticket 3 before B, holding ticket 2, gets its chance to enter the lock.

Of course, B is going to be a while: it needs to be woken by the scheduler and this takes a microsecond or two, at least. Now B wakes and gets the lock, and the same scenario repeats itself with the roles reversed. The upshot is that there is a full context switch for each acquisition of the lock.

Unfair locks avoid this problem because they allow queue jumping: in the scenario above, A (or any other thread) could re-acquire the lock after unlocking it, before B got its chance. So the use of the shared resource doesn’t grind to a halt while B wakes up.

So, are you tired of seeing mostly-white plots where the newly introduced algorithm relegates the rest of the pack to little chunks of color near the x-axis, yet?

I’ve just got one more left on the slow end of the scale. Unlike the other examples, I haven’t actually diagnosed something this bad in real life, but examples are out there.

Level 5: Catastrophe

Here’s a ticket lock which is identical to the first ticket lock we saw, except that the sched_yield(); is replaced by ;. That is, it busy waits instead of yielding (and here are the spin flavors which specialize on a shared ticket lock template). You could also replace this by a CPU-specific “relax” instruction like pause, but it won’t change the outcome. We call it ticket spin, and here’s how it performs compared to the existing candidates:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

What? That doesn’t look too bad at all. In fact, it is only slightly worse than the level 2 crew, the fastest we’ve seen so far¹².

The picture changes if we show the results for up to 6 threads, rather than just 4. Since I have 4 available cores¹³, this means that not all the test threads will be able to run at once:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

Now it becomes clear why this level is called catastrophic. As soon as we oversubscribe the number of available cores, performance gets about five hundred times worse. We go from 100s of nanoseconds to 100s of microseconds. I don’t show more threads, but it only gets worse as you add more.

We are also about an order of magnitude slower than the best solution (queued fifo) of the previous level, although it varies a lot by hardware: on Ice Lake the difference is more like forty times, while on Graviton this solution is actually slightly faster than ticket blocking (also level 4) at 17 threads. Note also the huge error bars. This is the least consistent benchmark of the bunch and exhibits a lot of variance and the slowest and fastest runs might vary by a factor of 100.

Lock Convoy on Steroids

So what happens here?

It’s similar to the lock convoy described above: all the threads queue on the lock and acquire it in a round-robin order due to the fair design. The difference is that threads don’t block when they can’t acquire the lock. This works out great when the cores are not oversubscribed, but falls off a cliff otherwise.

Imagine 5 threads, T1, T2, …, T5, where T5 is the one not currently running. As soon as T5 is the thread that needs the acquire the lock next (i.e., T5’s saved ticket value is equal to dispensing), nothing will happen because T1 through T4 are busily spinning away waiting for their turn. The OS scheduler sees no reason to interrupt them until their time slice expires. Time slices are usually measured in milliseconds. Once one thread is preempted, say T1, T5 will get the chance to run, but at most 4 total acquisitions can happen (T5, plus any of T2, T3, T4), before it’s T1’s turn. T1 is waiting for their chance to run again, but since everyone is spinning this won’t occur until another time slice expires.

So the lock can only be acquired a few times (at most $(nproc) times), or as little as once¹⁴, every time slice. Modern Linux using CFS doesn’t have a fixed timeslice, but on my system, sched_latency_ns is 18,000,000 which means that we expect two threads competing for one core to get a typical timeslice of 9 ms. The measured numbers are roughly consistent with a timeslice of single-digit milliseconds.

If I was good at diagrams, there would be a diagram here.

Another way of thinking about this is that in this over-subscription scenario, the ticket spin lock implies roughly the same number of context switches as the blocking ticket lock¹⁵, but in the former case each context switch comes with a giant delay caused by the need to exhaust the timeslice, while in the blocking case we are only limited by how fast a context switch can occur.

Interestingly, although this benchmark uses 100% CPU on every core, the performance of the benchmark in the oversubscribed case almost doesn’t depend on your CPU speed! Performance is approximately the same if I throttle my CPU to 1 GHz, or enable turbo up to 3.5 GHz. All of other implementations scale almost proportionally with CPU frequency. The benchmark does scale strongly with adjustment to sched_latency_ns (and sched_min_granularity_ns if the former is set low enough): lower scheduling latency values gives proportionally better performance as the time slices shrink, helping to confirm our theory of how this works.

This behavior also explains the large amount of variance once the available cores are oversubscribed: by definition, not all threads will be running at once, so the test becomes very sensitive to exactly where the not-running threads took their context switch. At the beginning of the test, only 4 of 6 threads will be running, and the two will be switched out, still waiting on the the barrier that synchronizes the test start. Since the two switched out threads haven’t tried to get the lock yet, the four running threads will be able to quickly share the lock between themselves, since the six-thread convoy hasn’t been set up.

This runs up the “iteration count” (work done) during an initial period which varies randomly, until the first context switch lets the fifth thread join the competition and then the convoy gets set up¹⁶. That’s when the catastrophe starts. This makes the results very noisy: for example, if you set a too-short time period for a trial, the entire test is composed of this initial phase and the results are artificially “good”.

We can probably invent something even worse, but that’s enough for now. Let’s move on to scenarios that are faster than the use of vanilla atomic add.

Level 1: Uncontended Atomics

Recall that we started at level 2: contended atomics. The name gives it away: the next faster level is when atomic operations are used but there is no contention, either by design or by luck. You might have noticed that so far we’ve only shown results for at least two threads. That’s because the single threaded case involves no contention, and so every implementation so far is level 1 if run on a single thread¹⁷.

Here are the results for all the implementations we’ve looked at so far, for a single thread:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The fastest implementations run in about 10 nanoseconds, which is 5x faster than the fastest solution for 2 or more threads. The slowest implementation (queued fifo) for one thread ties the fastest implementation (atomic add) at two threads, and beats it handily at three or four.

The number overlaid on each bar is the number of atomic operations¹⁸ each implementation makes per increment. It is obvious that the performance is almost directly proportional to the number of atomic instructions. On the other hand, performance does not have much of a relationship with the total number of instructions of any type, which vary a lot even between algorithms with the same performance as the following table shows:

Algorithm	Atomics	Instructions	Performance
mutex add	2	64	~21 ns
atomic add	1	4	~7 ns
cas add	1	7	~12 ns
ticket yield	1	13	~10 ns
ticket blocking	3	107	~32 ns
queued fifo	4	167	~45 ns
ticket spin	1	13	~10 ns
mutex3	2	17	~20 ns

In particular, note that mutex add has more than 9x the number of instructions compared to cas add yet still runs at half the speed, in line with the 2:1 ratio of atomics. Similarly, ticket yield and ticket spin have slightly better performance than cas add despite having about twice the number of instructions, in line with them all having a single atomic operation¹⁹.

The last row in the table shows the performance of mutex3, an implementation we haven’t discussed. It is a basic mutex offering similar functionality to std::mutex and whose implementation is described in Futexes Are Tricky. Because it doesn’t need to pass through two layers of abstraction²⁰, it has only about one third the instruction count of std::mutex, yet performance is almost exactly the same, differing by less than 10%.

So the idea that you can almost ignore things that are in a lower cost tier seems to hold here. Don’t take this too far: if you design a lock with a single atomic operation but 1,000 other instructions, it is not going to be fast. There are also reasons to keep your instruction count low other than microbenchmark performance: smaller instruction cache footprint, less space occupied in various out-of-order execution buffers, more favorable inlining tradeoffs, etc.

Here it is important to note that the change in level of our various functions didn’t require a change in implementation. These are exactly the same few implementations we discussed in the slower levels. Instead, we simply changed (by fiat, i.e., adjusting the benchmark parameters) the contention level from “very high” to “zero”. So in this case the level doesn’t depend only on the code, but also this external factor. Of course, just saying that we are going to get to level 1 by only running one thread is not very useful in real life: we often can’t simply ban multi-threaded operation.

So can we get to level 1 even under concurrent calls from multiple threads? For this particular problem, we can.

Adaptive Multi-Counter

One option is to use multiple counters to represent the counter value. We try to organize it so that that threads running concurrently on different CPUs will increment different counters. Thus the logical counter value is split across all of these internal physical counters, and so a read of the logical counter value now needs to add together all the physical counter values.

Here’s an implementation:

class cas_multi_counter {
    static constexpr size_t NUM_COUNTERS = 64;

    static thread_local size_t idx;
    multi_holder array[NUM_COUNTERS];

public:

    /** increment the logical counter value */
    uint64_t operator++(int) {
        while (true) {
            auto& counter = array[idx].counter;

            auto cur = counter.load();
            if (counter.compare_exchange_strong(cur, cur + 1)) {
                return cur;
            }

            // CAS failure indicates contention,
            // so try again at a different index
            idx = (idx + 1) % NUM_COUNTERS;
        }
    }

    uint64_t read() {
        uint64_t sum = 0;
        for (auto& h : array) {
            sum += h.counter.load();
        }
        return sum;
    }
};

We’ll call this cas multi, and the approach is relatively straightforward.

There are 64 padded²¹ physical counters whose sum makes up the logical counter value. There is a thread-local idx value, initially zero for every thread, that points to the physical counter that each thread should increment. When operator++ is called, we attempt to increment the counter pointed to by idx using CAS.

If this fails, however, we don’t simply retry. Failure indicates contention²² (this is the only way the strong variant of compare_exchange can fail), so we add one to idx to try another counter on the next attempt.

In a high-contention scenario like our benchmark, every CPU quickly ends up pointing to a different index value. If there is low contention, it is possible that only the first physical counter will be used.

Let’s compare this to the atomic add version we looked at above, which was the fastest of the level 2 approaches. Recall that it uses an atomic add on a single counter.

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

For 1 active core, the results are the same as we saw earlier: the CAS approach performs the same as the cas add algorithm²³, which is somewhat slower than atomic add, due to the need for an additional load (i.e., the line with counter.load()) to set up the CAS.

For 2 to 4 cores, the situation changes dramatically. The multiple counter approach performs the same regardless of the number of active cores. That is, it exhibits perfect scaling with multiple cores – in contrast to the single-counter approach which scales poorly. At four cores, the relative speedup of the multi-counter approach is about 9x. On Amazon’s Graviton ARM processor the speedup approaches eighty times at 16 threads.

This improvement in increment performance comes at a cost, however:

64 counters ought to be enough for anyone, but they take 4096 (!!) bytes of memory to store what takes only 8 bytes in the atomic add approach²⁴.
The read() method is much slower: it needs to iterate over and add all 64 values, versus a single load for the earlier approaches.
The implementation compiles to much larger code: 113 bytes versus 15 bytes for the single counter CAS approach or 7 bytes for the atomic add approach.
The concurrent behavior is considerably harder to reason about and document. For example, it is harder to explain the consistency condition provided by read() since it is no longer a single atomic read²⁵.
There is a single thread-local idx variable. So while different cas_multi_counter instances are logically independent, the shared idx variable means that things that happen in one counter can affect the non-functional behavior of the others²⁶.

Some of these downsides can be partly mitigated:

A much smaller number of counters would probably be better for most practical uses. We could also set the array size dynamically based on the detected number of logical CPUs since a larger array should not provide much of a performance increase. Better yet, we might make the size even more dynamic, based on contention: start with a single element and grow it only when contention is detected. This means that even on systems with many CPUs, the size will remain small if contention is never seen in practice. This has a runtime cost²⁷, however.
We could optimize the read() method by stopping when we see a zero counter. I believe a careful analysis shows that the non-zero counter values for any instance of this class are all in a contiguous region starting from the beginning of the counter array²⁸.
We could mitigate some of the code footprint by carefully carving the “less hot”²⁹ slow path out into a another function, and use our magic powers to encourage the small fast path (the first CAS) to be inlined while the fallback remains not inlined.
We could make the thread-local idx per instance specific to solve the “shared idx across all instances” problem. This does require some non-negligible amount of work to implement a dynamic TLS system which can create as many thread local keys as you want³⁰, and it is slower.

So while we got a good looking chart, this solution doesn’t exactly dominate the simpler ones. You pay a price along several axes for the lack of contention and you shouldn’t blindly replace the simpler solutions with this one – it needs to be a carefully considered and use-case dependent decision.

Is it over yet? Can I close this browser tab and reclaim all that memory? Almost. Just one level to go.

Level 0: Vanilla

The last and fastest level is achieved when only vanilla instructions are used (and without contention). By vanilla instructions I mean things like regular loads and stores which don’t imply additional synchronization above what the hardware memory model offers by default³¹.

How can we increment a counter atomically while allowing it to be read from any thread? By ensuring there is only one writer for any given physical counter. If we keep a counter per thread and only allow the owning thread to write to it, there is no need for an atomic increment.

The obvious way to keep a per-thread counter is use thread-local storage. Something like this:

/**
 * Keeps a counter per thread, readers need to sum
 * the counters from all active threads and add the
 * accumulated value from dead threads.
 */
class tls_counter {
    std::atomic<uint64_t> counter{0};

    /* protects all_counters and accumulator */
    static std::mutex lock;
    /* list of all active counters */
    static std::vector<tls_counter *> all_counters;
    /* accumulated value of counters from dead threads */
    static uint64_t accumulator;
    /* per-thread tls_counter object */
    static thread_local tls_counter tls;

    /** add ourselves to the counter list */
    tls_counter() {
        std::lock_guard<std::mutex> g(lock);
        all_counters.push_back(this);
    }

    /**
     * destruction means the thread is going away, so
     * we stash the current value in the accumulator and
     * remove ourselves from the array
     */
    ~tls_counter() {
        std::lock_guard<std::mutex> g(lock);
        accumulator += counter.load(std::memory_order_relaxed);
        all_counters.erase(std::remove(all_counters.begin(), all_counters.end(), this), all_counters.end());
    }

    void incr() {
        auto cur = counter.load(std::memory_order_relaxed);
        counter.store(cur + 1, std::memory_order_relaxed);
    }

public:

    static uint64_t read() {
        std::lock_guard<std::mutex> g(lock);
        uint64_t sum = 0, count = 0;
        for (auto h : all_counters) {
            sum += h->counter.load(std::memory_order_relaxed);
            count++;
        }
        return sum + accumulator;
    }

    static void increment() {
        tls.incr();
    }
};

The approach is the similar to the per-CPU counter, except that we keep one counter per thread, using thread_local. Unlike earlier implementations, you don’t create instances of this class: there is only one counter and you increment it by calling the static method tls_counter::increment().

Let’s focus a moment on the actual increment inside the thread-local counter instance:

void incr() {
    auto cur = counter.load(std::memory_order_relaxed);
    counter.store(cur + 1, std::memory_order_relaxed);
}

This is just a verbose way of saying “add 1 to this std::atomic<uint64_t> but it doesn’t have to be atomic”. We don’t need an atomic increment as there is only one writer³². Using the relaxed memory order means that no barriers are inserted³³. We still need a way to read all the thread-local counters, and the rest of the code deals with that: there is a global vector of pointers to all the active tls_counter objects, and read() iterates over this. All access to this vector is protected by a std::mutex, since it will be accessed concurrently. When threads die, we remove their entry from the array, and add their final value to tls_counter::accumulator which is added to the sum of active counters in read().

Whew.

So how does this tls add implementation benchmark?

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

That’s two nanoseconds per increment, regardless of the number of active cores. This turns out to be exactly as fast as just incrementing a variable in memory with a single instruction like inc [eax] or add [eax], 1, so it’s somehow as fast as possible for any solution which ends up incrementing something in memory³⁴.

Let’s take a look at the number of atomics, total instructions and performance for the three implementations in the last plot, for four threads:

Algorithm	Atomics	Instructions	Performance
atomic add	1	4	~ 110 ns
cas multi	1	7	~ 12 ns
tls add	0	12	~ 2 ns

This is a clear indication that the difference in performance has very little to do with the number of instructions: the ranking by instruction count is exactly the reverse of the ranking by performance! tls add has three times the number of instructions, yet is more than fifty times faster (so the IPC varies by a factor of more than 150x).

As we saw at the last 1, this improvement in performance doesn’t come for free:

The total code size is considerably larger than the per-CPU approach, although most of it is related to creation of the initial object on each thread, and not on the hot path.
We have one object per thread, instead of per CPU. For an application with many threads using the counter, this may mean the creation of many individual counters which use both more memory³⁵ and result in a slower read() function.
This implementation only supports one counter: the key methods in tls_counter are static. This boils down to the need for a thread_local object for the physical counter, which must be static by the rules of C++. A template parameter could be added to allow multiple counters based on dummy types used as tags, but this is still more awkward to use than instances of a class (and some platforms have limits on the number of thread_local variables). This limitation could be removed in the same way as discussed earlier for the cas multi idx variable, but at a cost in performance and complexity.
A lock was introduced to protect the array of all counters. Although the important increment operation is still lock-free, things like the read() call, the first counter access on a given thread and thread destruction all compete for the same lock. This could be eased with a read-write lock or a concurrent data structure, but at a cost as always.

The Table

Let’s summarize all the levels in this table.

The ~Cost column is a very approximate estimate of the cost of each “occurrence” of the expensive operation associated with the level. It should be taken as a very rough ballpark for current Intel and AMD hardware, but especially the later levels can vary a lot.

The Perf Event column lists a Linux perf event that you can use to count the number of times the operation associated with this level occurs, i.e., the thing that is slow. For example, in level 1, you count atomic operations using the mem_inst_retired.lock_loads counter, and if you get three counts per high level operation, you can expect roughly 3 x 10 ns = 30 ns cost. Of course, you don’t necessarily need perf in this case: you can inspect the assembly too.

The Local column records whether the behavior of this level is core local. If yes, it means that operations on different cores complete independently and don’t compete and so the performance scales with the number of cores. If not, there is contention or serialization, so the throughput of the entire system is often limited, regardless of how many cores are involved. For example, only one core at a time performs an atomic operation on a cache line, so the throughput of the whole system is fixed and the throughput per core decreases as more cores become involved.

The Key Characteristic tries to get across the idea of the level in one bit-sized chunk.

Level	Name	~Cost (ns)	Perf Event	Local	Key Characteristic
0	Vanilla	low	depends	Yes	No atomic instructions or contended accesses at all
1	Uncontended Atomic	10	`mem_inst_retired.` `lock_loads`	Yes	Atomic instructions without contention
2	True Sharing	40 - 400	`mem_load_l3_hit_retired.` `xsnp_hitm`	No	Contended atomics or locks
3	Syscall	1,000	`raw_syscalls:sys_enter`	No	System call
4	Context Switch	10,000	`context-switches`	No	Forced context switch
5	Catastrophe	huge	depends	No	Stalls until quantum exhausted, or other sadness

So What?

What’s the point of all this?

Primarily, I use the hierarchy as a simplification mechanism when thinking about concurrency and performance. As a first order approximation you mostly only need to care about the operations related to the current level. That is, if you are focusing on something which has contended atomic operations (level 2), you don’t need to worry too much about uncontended atomics or instruction counts: just focus on reducing contention. Similarly, if you are at level 1 (uncontended atomics) it is often worth using more instructions to reduce the number of atomics.

This guideline only goes so far: if you have to add 100 instructions to remove one atomic, it is probably not worth it.

Second, when optimizing a concurrent system I always try to consider how I can get to a (numerically) lower level. Can I remove the last atomic? Can I avoid contention? Successfully moving to a lower level can often provide an order-of-magnitude boost to performance, so it should be attempted first, before finer-grained optimizations within the current level. Don’t spend forever optimizing your contended lock, if there’s some way to get rid of the contention entirely.

Of course, this is not always possible, or not possible without tradeoffs you are unwilling to make.

Getting There

Here’s a quick look at some usual and unusual ways of achieving levels lower on the hierarchy.

Level 4

You probably don’t want to really be in level 4 but it’s certainly better than level 5. So, if you still have your job and your users haven’t all abandoned you, it’s usually pretty easy to get out of level 5. More than half the battle is just recognizing what’s going on and from there the solution is often clear. Many times, you’ve simply violated some rule like “don’t use pure spinlocks in userspace” or “you built a spinlock by accident” or “so-and-so accidentally held that core lock during IO”. There’s almost never any inherent reason you’d need to stay in level 5 and you can usually find an almost tradeoff-free fix.

A better approach than targeting level 4 is just to skip to level 2, since that’s usually not too difficult.

Level 3

Getting to level 3 just means solving the underlying reason for so many context switches. In the example used in this post, it means giving up fairness. Other approaches include not using threads for small work units, using smarter thread pools, not oversubscribing cores, and keeping locked regions short.

You don’t usually really want to be in level 3 though: just skip right to level 2.

Level 2

Level 3 isn’t a terrible place to be, but you’ll always have that gnawing in your stomach that you’re leaving a 10x speedup on the table. You just need to get rid of that system call or context switch, bringing you to level 2.

Most library provided concurrency primitives already avoid system calls on the happy path. E.g., pthreads mutex, std::mutex, Windows CRITICAL_SECTION will avoid a system call while acquiring and releasing an uncontended lock. There are, however, some notable exceptions: if you are using a Win32 mutex object or System V semaphore object, you are paying a system call on every operation. Double check if you can use an in-process alternative in this case.

For more general synchronization purposes which don’t fit the lock-unlock pattern, a condition variable often fits the bill and a quality implementation generally avoids system calls on the fast path. A relatively unknown and higher performance alternative to condition variables, especially suitable for coordinating blocking for otherwise lock-free structures, is an event count. Paul’s implementation is available in concurrency kit and we’ll mention it again at Level 0.

System calls often creep in when home-grown synchronization solutions are used, e.g., using Windows events to build your own read-write lock or striped lock or whatever the flavor of the day is. You can often remove the call in the fast path by making a check in user-space to see if a system call is necessary. For example, rather than unconditionally unblocking any waiters when releasing some exclusive object, check to see if there are waiters³⁶ in userspace and skip the system call if there are none.

If a lock is generally held for a short period, you can avoid unnecessary system calls and context switches with a hybrid lock that spins for an appropriate³⁷ amount of time before blocking. This can trade tens of nanoseconds of spinning for hundreds or thousands of nanoseconds of system calls.

Ensure your use of threads is “right sized” as much as possible. A lot of unnecessary context switches occur when many more threads are running than there are CPUs, and this increases the chance of a lock being held during a context switch (and makes it worse when it does happen: it takes longer for the holding thread to run again as the scheduler probably cycles through all the other runnable threads first).

Level 1

A lot of code that does the work to get to level 2 actually ends up in level 1. Recall that the primary difference between level 1 and 2 is the lack of contention in level 1. So if your process naturally or by design has low contention, simply using existing off-the-shelf synchronization like std::mutex can get you to level 1.

I can’t give a step-by-step recipe for reducing contention, but here’s a laundry list of things to consider:

Keep your critical sections as short as possible. Ensure you do any heavy work that doesn’t directly involve a shared resource outside of the critical section. Sometimes this means making a copy of the shared data to work on it “outside” of the lock, which might increase the total amount of work done, but reduce contention.
For things like atomic counters, try to batch your updates: e.g., if you update the counter multiple times during some operation, update a local on the stack rather than the global counter and only “upload” the entire value once at the end.
Consider using structures that use fine-grained locks, striped locks or similar mechanisms that reduce contention by locking only portions of a container.
Consider per-CPU structures, as in the examples above, or some approximation of them (e.g., hashing the current thread ID into an array of structures). This post used an atomic counter as a simple example, but it applies more generally to any case where the mutations can be done independently and aggregated later.

For all of the advice above, when I say consider doing X I really mean consider finding and using an existing off-the shelf component that does X. Writing concurrent structures yourself should be considered a last resort – despite what you think, your use case is probably not all that unique.

Level 1 is where a lot of well written, straightforward and high-performing concurrent code lives. There is nothing wrong with this level – it is a happy place.

Level 0

It is not always easy or possible to remove the last atomic access from your fast paths, but if you just can’t live with the extra ~10 ns, here are some options:

The general approach of using thread local storage, as discussed above, can also be extended to structures more complicated than counters.
You may be able to achieve fewer than one expensive atomic instruction per logical operation by batching: saving up multiple operations and then committing them at all once with a small fixed number of atomic operations. Some containers or concurrent structures may have a batched API which does this for you, but even if not you can sometimes add batching yourself, e.g., by inserting collections of elements rather than a single element³⁸.
Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem, but there are still some good options out there.
I find that RCU is especially powerful and fairly general if you are using a garbage collected language, or can satisfy the requirements for an efficient reclamation method in a non-GC language.
The seqlock³⁹ is an underrated and little known alternative to RCU without reclaim problems, although not as general. Concurrencykit has an implementation. It has an atomic-free read path for readers. Unfortunately, seqlocks don’t integrate cleanly with either the Java⁴⁰ or C++ memory models.
It is also possible in some cases to do a per-CPU rather than a per-thread approach using only vanilla instructions, although the possibility of interruption at any point makes this tricky. Restartable sequences (rseq) can help, and there are other tricks lurking out there.
Event counts, mentioned earlier, can even be level 0 in a single writer scenario, as Paul shows.
This is the last point, but it should be the first: you can probably often redesign your algorithm or application to avoid sharing data in the first place, or to share much less. For example, rather than constantly updating a shared collection with intermediate results, do as much private computation as possible before only merging the final results.

Summary

We looked at the six different levels that make up this concurrency cost hierarchy. The slow half (3, 4 and 5) are all basically performance bugs. You should be able to achieve level 2 or level 1 (if you naturally have low contention) for most designs fairly easily and those are probably what you should target by default. Level 1 in a contended scenario and level 0 are harder to achieve and often come with difficult tradeoffs, but the performance boost can be significant: often one or more orders of magnitude.

Thanks

Thanks to Paul Khuong who showed me something that made me reconsider in what scenarios level 0 is achievable and typo fixes.

Thanks to @never_released for help on a problem I had bringing up an EC2 bare-metal instance (tip: just wait).

Special thanks to matt_dz and Zach Wenger for helping fix about sixty typos between them.

Thanks to Alexander Monakov, Dave Andersen, Laurent and Kriau for reporting typos, and Aaron Jacobs for suggesting clarifications to the level 0 definition.

Traffic light photo by Harshal Desai on Unsplash.

Discussion and Feedback

You can leave a comment below or discuss on Hacker News, r/programming or r/cpp.

If you liked this post, check out the homepage for others you might enjoy.

Well, this is quite real world: such atomic counters are used widely for a variety of purposes. I throw the if you squint in there because, after all, we are using microbenchmarks which simulate a probably-unrealistic density of increments to this counter, and it is a bit of a stretch to make this one example span all five levels – but I tried! ↩
The Graviton 2 bare metal hardware has 64 cores, but this instance size makes 16 of them available. This means that in principle the results can be affected by the coherency traffic of other tenants on the same hardware, but the relatively stable results seem to indicate it doesn’t affect the results much. ↩
Some hardware supports very limited atomic operations, which may be mostly useful only for locking, although you can get tricky. ↩
We could actually use either the pre or post-increment version of the operator here. The usual advice is to prefer the pre-increment form ++c as it can be faster as it can return the mutated value, rather than making a copy to return after the mutation. Now this advice rarely applies to primitive values, but atomic increment is actually an interesting case which turns it on its head: the post-increment version is probably better (at least, never slower) since the underlying hardware operation returns the previous value. So it’s at least one extra operation to calculate the pre-increment value (or much worse, apparently, if icc gets involved). ↩
Many ISAs, including POWER and ARM, traditionally only included support for a CAS-like or LL/SC operation, without specific support for more specific atomic arithmetic operations. The idea, I think, was that you could build any operation you want on top of of these primitives, at the cost of “only” a small retry loop and that’s more RISC-y, right? This seems to be changing as ARMv8.1 got a bunch of atomic operations. ↩
From its introduction through Java 7, the AtomicInteger and related classes in Java implemented all their atomic operations on top of a CAS loop, as CAS was the only primitive implemented as an intrinsic. In Java 8, almost exactly a decade later, these were finally replaced with dedicated atomic RMWs where possible, with good results. ↩
On my system and most (all?) modern Intel systems this is essentially the L3 cache, as the caching home agent (CHA) lives in or adjacent to the L3 cache. ↩
This doesn’t imply that each atomic operation needs to take 70 cycles under contention: a single core could do multiple operations on the cache line after it gains exclusive ownership, so the cost of obtaining the line could be amortized over all of these operations. How much of this occurs is a measure of fairness: a very fair CPU will not let any core monopolize the line for long, but this makes highly concurrent benchmarks like this slower. Recent Intel CPUs seem quite fair in this sense. ↩
It also might not work how you think, depending on details of the OS scheduler. ↩
They used to be cheaper: based on my measurements the cost of system calls has more than doubled, on most Intel hardware, after the Spectre and Meltdown mitigations have been applied. ↩
An interesting thing about convoys is that they exhibit hysteresis: once you start having a convoy, they become self-sustaining, even if the conditions that started it are removed. Imagine two threads that lock a common lock for 1 nanosecond every 10,000 nanoseconds. Contention is low: the chance of any particular lock acquisition being contended is 0.01%. However, as soon as a contended acquisition occurs, the lock effectively becomes held for the amount of time it takes to do a full context switch (for the losing thread to block, and then to wake up). If that’s longer than 10,000 nanoseconds, the convoy will sustain itself indefinitely, until something happens to break the loop (e.g., one thread deciding to work on something else). A restart also “fixes” it, which is one of many possible explanations for processes that suddenly shoot to 100% CPU (but are still making progress), but can be fixed by a restart. Everything becomes worse with more than two threads, too. ↩
Actually I find it remarkable that this performs about as well as the CAS-based atomic add, since the fairness necessarily implies that the lock is acquired in a round-robin order, so the cache line with the lock must at a minimum move around to each acquiring thread. This is a real stress test of the arbitrary coherency mechanisms offered by the CPU. ↩
And no SMT enabled, so there are 4 logical processors from the point of view of the OS. ↩
In fact, the once scenario is the most likely, since one would assume with homogeneous threads the scheduler will approximate something like round-robin scheduling. So the thread that is descheduled is most likely the one that is also closest to the head of the lock queue, because it had been spinning the longest. ↩
In fact, you can measure this with perf and see that the total number of context switches is usually within a factor of 2 for both tests, when oversubscribed. ↩
There is another level of complication here: the convoy only gets set up when the fifth thread joins the fun if the thread that gets switched out had expressed interest in the lock before it lost the CPU. That is, after a thread unlocks the lock, there is a period before it gets a new ticket as it tries to obtain the lock again. Before it gets that ticket, it is essentially invisible to the other threads, and if it gets context switched out, the catastrophic convoy won’t be set up (because the new set of four threads will be able to efficiently share the lock among themselves). ↩
This won’t always necessarily be the case. You could write a primitive that always makes a system call, putting it at level 3, even if there is no contention, but here I’ve made sure to always have a no-syscall fast path for the no-contention case. ↩
On Intel hardware you can use details.sh to collect the atomic instruction count easily, taking advantage of the mem_inst_retired.lock_loads performance counter. ↩
The cas add implementation comes off looking slightly worse than the other single-atomic implementations here because the load required to set up the CAS value effectively adds to the dependency chain involving the atomic operation, which explains the 5-cycle difference with atomic add. This goes away if you can do a blind CAS (e.g., in locks’ acquire paths), but that’s not possible here. ↩
Actually three layers, libstdc++, then libgcc and then finally pthreads. I’ll count the first two as one though because those can all inline into the caller. Based on a rough accounting, probably 75% of the instruction count comes from pthreads, the rest from the other two layers. The pthreads mutexes are more general purpose than what std::mutex offers (e.g., they support recursion), and the features are configured at runtime on a per-mutex basis, so that explains a lot of the additional work these functions are doing. It’s only due to cost of atomic operations that std::mutex doesn’t take a significant penalty compared to a more svelte design. ↩
Padded means that the counters are aligned such that each falls into its own 64 byte cache line, to avoid false sharing. This means that even though each counter only has 8 bytes of logical payload, it requires 64 bytes of storage. Some people claim that you need to pad out to 128 bytes, not 64, to avoid the effect of the adjacent line prefetcher which fetches the 64-byte that completes an aligned 128-byte pair of lines. However, I have not observed this effect often on modern CPUs. Maybe the prefetcher is conservative and doesn’t trigger unless past behavior indicates the fetches are likely to be used, or the prefetch logic can detect and avoid cases of false sharing (e.g., by noticing when prefetched lines are subsequently invalidated by a snoop). ↩
Actually, not all failure indicates contention: there is a small chance also that a context switch exactly splits the load and the subsequent CAS, and in this case the CAS would fail when the thread was scheduled again if any thread that ran in the meantime updated the same counter. Treating this as contention doesn’t really cause any serious problems. ↩
Not surprising, since there is no contention and the fast path looks the same for either algorithm: a single CAS that always succeeds. ↩
Here I’m assuming that sizeof(std::atomic<uint64_t>) is 8, and this is the case on all current mainstream platforms. Also, you may or may not want to pad out the single-counter version to 64 bytes as well, to avoid some potential false sharing with nearby values, but this is different than the multi-counter case where padding is obligatory to avoid guaranteed false sharing. ↩
In this limited case, I think read() provides the same guarantees as the single-counter case. Informally, read() returns some value that the counter had at some point between the start and end of the read() call. Formally, there is a linearization point within read() although this point can only be determined in retrospect by examining the returned value (unlike the single-counter approaches, where the linearization is clear regardless of the value). However, this is only true because the only mutating operation is increment(). If we also offered a decrement() method, this would no longer be true: you could read values that the logical counter never had based on the sequence of increments and decrements. Specifically, if you execute increment(); decrement(); increment() and even if you know these operations are strictly ordered (e.g., via locking), a concurrent call to read() could return 2, even though the counter never logically exceeded 1. ↩
In particular, if contention is seen on one object, the per-thread index will change to avoid it, which changes the index of all other objects as well, even if they have not seen any contention. This doesn’t seem like much of a problem for this simple implementation (which index we write to doesn’t matter much), but it could make some other optimizations more difficult: e.g., if we size the counter array dynamically, we don’t want to unnecessarily change the idx for uncontended objects, since it requires a larger counter array, unnecessarily. ↩
At least, an extra indirection to access the array which is no longer embedded in the object⁴¹, and checks to ensure the array is large enough. Furthermore, we have another decision to make: when to expand the array. How much contention should we suffer before we decide the array is too small? ↩
The intuition is later counter positions only get written when an earlier position failed a compare and swap, which necessarily implies it was written to by some other thread and hence non-zero. There is some subtlety here: this wouldn’t hold if compare_exchange_weak was used instead of compare_exchange_strong, and it more obviously wouldn’t apply if we allowed decrements or wanted to change the “probe” strategy. ↩
I’m not sure if “less hot” means __attribute__((cold)) necessarily, that might be too cold. We mostly just want to separate the first-cas-succeeds case and the rest of the logic so we don’t pay the dynamic code size impact except when the fallback path is taken. ↩
A sketch of an implementation would be to use something like a single static thread_local pointer to an array or map, which maps an ID contained in the dynamic TLS key to the object data. Lookup speed is important, which favors an array, but you also need to be able to remove elements, which can favor some type of hash map. All of this is probably at least twice as slow as a plain thread_local access … or just use folly or boost. ↩
On x86, what’s vanilla and what isn’t is fairly cut and dried: regular memory accesses and read-modify-write instructions are vanilla while LOCKed RMW instructions, whether explicit like lock inc or implicit like xchg [mem], reg, are not and are an order of magnitude slower. Out of the fences, mfence is also a slow non-vanilla instruction, comparable in cost to a LOCKed instruction. On other platforms like ARM or POWER, there may be shades of grey: you still have vanilla accesses on one end, and expensive full barriers like dmb on ARM or sync on POWER at the other, but you also have things in the middle with some additional ordering guarantees but still short of sequential consistency. This includes things like LDAR and LDAPR on ARM which implement sort of a sliding scale of load ordering and performance. Still, on any given hardware, you might find that instructions generally fall into a “cheap” (vanilla) and “expensive” (non-vanilla) bucket. ↩
The only reason we even need std::atomic<uint64_t> at all is because it is undefined behavior to have concurrent access to any variable if at least one access is a write. Since the owning thread is making writes, this would technically be a violation of the standard if there was a concurrent tls_counter::read() call. Most actual hardware has no problem with concurrent reads and writes like this, but it’s better to stay on the right side of the law. Some hardware could also exhibit tearing of the writes, and std::atomic guarantees this doesn’t happen. That is, the read and write are still individually atomic. ↩
If you use the default std::memory_order_seq_cst, on x86 gcc inserts an mfence which makes this even slower than an atomic increment since mfence is generally slower than instructions with a lock prefix (it has slightly stronger barrier semantics). ↩
This is a very small white lie. I’ll explain more elsewhere. ↩
On the other hand, the TLS approach doesn’t need padding since the counters will generally appear next to other thread-local data, and not subject to false sharing, which means an 8x reduction (from 64 to 8 bytes) in the per-counter size, so if your process has a number of threads roughly equal to the number of cores, you will probably save memory over the per-CPU approach. ↩
It’s easy to introduce a missed wakeup problem if this isn’t done correctly. The usual cause is a race condition between some waiter arriving at a lock-like thing, seeing that it’s locked and then indicating interest, but in the critical region of that check-then-act the owning thread left the lock and didn’t see any waiters. The waiter blocks but there is nobody to unblock them. These bugs often go undetected since the situation resolves itself as soon as another thread arrives, so in a busy system you might not notice the temporarily hung threads. The futex system call is basically designed to make solving this easy, while the Event stuff in Windows requires a bit more work (usually based on a compare-and-swap). ↩
An “appropriate” time is probably something like the typical runtime of the locked region. Basically you want to spin in any case where the lock is held by a currently running thread, which will release it soon. As soon as you’ve been spinning for more than the typical hold time of the lock, it becomes much more likely you are simply waiting for a lock held by a thread that is not running (e.g., it was unlucky enough to incur a context switch while it held the lock). In that case, you are better off sleeping. ↩
An interesting design point is a data type that implements batching internally behind an API offering single-element operations. For example, a queue might decide that added elements won’t be immediately consumed (because there are already some elements in the queue), and hold them in a local staging area until several can be added as a batch, or until their absence would be noticed. ↩
Despite the current claim in wikipedia that seqlocks are somehow a Linux-specific construct involving the kernel, they work great in userspace only and are not tied to Linux. It is likely they were not invented for use in Linux but pre-dated the OS, although maybe the use in Linux was where the name seqlock first appeared? ↩
Java does provide StampedLock which offers seqlock functionality. ↩
Of course, we could go even one step further and embed a small array of 1 or 2 elements in the counter object, in the hope that this is enough and only use a dynamically allocated array and suffer the additional indirection if we observe contention. ↩

AVX-512 Mask Registers, Again

2020-05-26T00:00:00+00:00

Exposition

Not that long ago we looked at the AVX-512 mask registers. Specifically, the number of physical registers underlying the eight architectural ones, and some other behaviors such as zeroing idioms. Recently, a high resolution die shot of SKX appeared, and I thought it would be cool to verify our register count by visual inspection.

After all, rather than writing some complex software to test hardware, why not simply use a series of noxious chemicals and manual labor to painstakingly expose the CPU, then carefully photograph it with a microscope and stitch the photos together and finally, just use our eyes to count the register? If that doesn’t sound all that easy, you are not alone, but as luck would have it someone else has already done that part.

While trying to simply count the mask registers, I ran across something else even more interesting¹ instead…

Exposition
Rising Action
Some Missing Pieces
Thanks
Discussion and Feedback

Rising Action

The Die Shot

We’re interested in this die shot, recently released by Fritzchens Fritz on Flickr. We’ll be focusing on the highlighted area, which seems to have all the register files on the chip. If you want a full breakdown of the core, you can check guesses here on Twitter, on RWT and on Intel’s forums.

The Register Files

Here’s a close-up of that section, with the assumed register files and their purpose labeled².

We guess the register file identities based on:

The general purpose register files are of the right relative width (64 bits), and are in the right position below the integer execution units, and seem to have EFLAGS nearby.
The SIMD registers are obvious from their size and positioning underneath the vector pipes.
The upper 256 bits of the 512-bit zmm registers (labelled ZMM on the closeup) can be determined from comparing the SKL³ (no AVX-512) and SKX (has AVX-512) dies and noting that the bottom file is not present in SKL (a large empty area is present at that spot in SKL).

The Mystery Block

This leaves the mystery block in red. This block is in a prime spot below the vector execution units. Could it be the mask registers (kregs)? We found in the first post that the mask aren’t shared with either the scalar or SIMD registers, so we expect them to have their own physical register file. Maybe this is it?

Let’s compare the mystery register file to the integer register file, since they should be similar in size and appear to be implemented similarly:

Looking at the general purpose register file on the left, each block (6 of which are numbered on the general purpose file) seems to implement 16 bits, as if you zoom in you see a repeating structure of 16 elements, and 4 blocks makes 64 bits total which is the expected width of the file. We know from published numbers that the integer register file has 180 entries, and since there are 6 rows of 4 blocks, we expect each row to implement 180 / 6 = 30 registers.

Now we turn our attention to the mystery file, with the idea that it may be the mask register file. There are a total of 30 blocks. Looking at the general purpose registers, we determined each block can hold 16 bits (horizontally) from 30 registers (vertically, I guess). So 30 blocks gives us: 30 blocks * 30 registers/block * 16 bits / 64 bits = 225 registers. It’s too much! We calculated last time that there are ~142 physical mask registers, so this is way too high.

There’s another problem: we only have three columns of 16-bit blocks, for a total of 48 bits, horizontally. However, we know that a mask register must hold up to 64 bits (when using a byte-wise mask for a full 512-bit vector register). Also, while our calculation above worked out to a whole number, the number of blocks (30) is not divisible by 4, so even if you assumed the arrangement of the blocks didn’t matter, there is no possible mapping from each register to 4 distinct blocks. Instead, we’d need something weird like 2 blocks providing 15 registers (instead of 30), but 64 bits wide (instead of 32). That seems very unlikely.

So let’s look just at the two paired columns on the left for now: a total of 20 blocks. If we take the SIMD registers as an example, it is not necessary that the full width of the register is present horizontally in a single row: the SIMD registers have only 256 bits in a row (split into two 128-bit lines), and then other other 256 bits in a 512-bit zmm register appear vertically below, in the register file marked ZMM in the diagram. So there’s a kind of over-under arrangement⁴.

Since the mask registers are associated with elements of the vector registers, maybe they are split up in the same way? That is, a 64-bit mask register uses one 2x16-bit (32-bit) chunk from the top half and one from the bottom half, to make up 64 bits? This is 20 total blocks, giving 150 registers by the same calculation above. This is much closer to the 142 we found by experiment.

Still… that nagging feeling. 142 is not equal to 150, and what about that third column of blocks? That doubt crept in: I had second thoughts that this was the mask register file after all. What could it be then?

Let’s Get Legacy

I realized there was one register file unaccounted for: the file for legacy x87 floating point and MMX. We expect that x87 floating point and MMX use the the same physical file because MMX registers are architecturally aliased onto the x87 registers⁵. So where is this file on the die? I looked all around⁶ the die shot. There are no good candidates.

So maybe this thing we’ve been looking at is actually the x87/MMX register file? In one way it’s a better fit: the x87 FP register file needs to be ~80 bits wide, so that would explain the extra column: if we assume each row is half of a register as before, that’s 96 bits. That’s enough to hold 80 bit extended precision values, and the 16 bits left over are probably could be used to store the FPU status word accessed by ftstw and related instructions. This status word is updated after every operation so must also be renamed for reasonable performance⁷.

Additional evidence that this might be the x87/MMX register file comes from this KBL die shot also from Fritz:

Note that while the high 256 bits of the register file are masked out (this chip supports only AVX2, not AVX-512 so there are no zmm registers), the register file we are considering is present in its entirety.

Cool theory bro, but aren’t we back to square zero? If this is the file for the x87/MMX registers, where do the mask registers live?

There’s one possibility we haven’t discussed although some of you might be screaming it at your monitors by now: maybe the x87/MMX and the kreg register files are shared. That is, physically aliased⁸ to the same register file, shared competitively.

Testing Our Theory

The good news? We can test for this, in software. That’s good, because I was never really that comfortable with this die shot thing and there was the risk that I would BS more than usual. Software-based uarch probing is a bit more my thing.

We’ll use the test method originally described by Henry Wong and which we used in the last post on this topic, and implemented in the robsize tool. Here’s the quick description of the technique, a straight copy/paste from that post:

We separate two cache miss load instructions by a variable number of filler instructions which vary based on the CPU resource we are > probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the > total execution time is roughly as long as a single miss.

However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation > stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the > baseline cache miss latency.

Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular GP instructions usually consume one physical register from the GP PRF so are a good choice to measure the size of that resource.

The trick we use to see if two register files are shared is first to use a test the size of each register file alone, using a test that uses filler that targets only that register file, then to run a third test whose filler alternates between instructions that use each register file. If the register files are shared, we expect all tests to produce the same results, since they are all drawing from the same pool. If the register files are not shared, the third (alternating) test should result in a much higher apparent resource limit, since two different pools are being drawn from and so it will take twice as many⁹ filler instructions to hit the RF limit.

Enough talk, let’s do this. I implemented several new tests in robsize to probe possible register sharing. First, we look at Test 38, which uses MMX instructions¹⁰ to target the size of the x87/MMX register file:

[link to this chart] [raw data]

We see a clear spike at 128 instructions, so it seems like the size of the speculative¹¹ x87/MMX register file is 128 entries.

Next, we have Test 43 which follows the same pattern as Test 38 but using kaddd as a filler instruction so targets the mask (kreg) register file:

[link to this chart] [raw data]

This is mostly indistinguishable from the previous chart and we conclude that the size of the speculative mask register file is also 128.

Let’s see what happens when alternate MMX and another instruction type. Test 39 alternates MMX with integer SIMD instructions, and Test 40 alternatives with general purpose scalar instructions:

[link to this chart] [raw data]

Both of these show the same effect: the effective resource limitation is much higher: around 210 filler instructions. This indicates strongly that the x87/MMX register is not shared with either the SIMD or scalar register files.

Finally, we get to the end of this tale, Test 41. This test mixes MMX and mask register instructions¹²:

[link to this chart] [raw data]

This one is definitely not like the others. We see that the resource limit is now 128, same as for the single-type tests. We can immediately conclude that this means that mask registers and MMX registers are allocated from the same resource pool: they use the same physical register file.

This resolves the mystery of the missing register file: nothing is missing but rather this one register file simply serves double duty.

Normally a shared register file might be something to watch out for, performance-wise, but it is hard to imagine this having an impact in any non-artificial example. Who is going to be making heavy use of x87 or MMX (both obsolete) along with AVX-512 mask registers (the opposite end of the spectrum from “obsolete”)? It seems extremely unlikely. In any case, the register file is still quite large so hitting the limit is unlikely in any case.

So sharing these register files is a neat trick to reduce power and area: the register files aren’t all that big, but they live in pretty prime real-estate close to the execution units.

What’s cool about this one though is that is the first time that I’ve looked at a chip (that this is even possible is remarkable to me) and come up with a theory about the hardware we can test and confirm with a targeted microbenchmark. Here, it actually happened that way. I was already aware of the possibility of register file sharing (Henry had tests for this right in robsize from the start) – but although I considered other sharing scenarios I never considered sharing between x87/MMX and the mask registers until I tried to identify the register files on Franz’s die shots.

Some Missing Pieces

It seems like we’ve wrapped everything up nicely, but there are still a few rough edges.

We calculated a total of 128 speculative registers, plus 16 non-speculative (to hold the 8 x87/MMX regs and the 8 kregs) is 144, but our ballpark estimate based on the regfile size was 150. Perhaps more importantly, with 5 rows of registers, we expect the number of registers to be a multiple of 5. Perhaps there are a handful of registers used from an unknown purpose or some other flaw in the test.
I noticed an unexplained different in results between a test that uses a single instruction like kaddd k1, k2, k3 (test 27) and one that rotates through all the 8 registers: kaddd k0, k1, k1 then kaddd k1, k2, k2, etc (test 43). The former test results in an register file size of about 5 more than the latter. Similarly for tests using the MMX registers (compare tests 37 and 38). This post uses the rotate through all registers approach, while the original post used the fixed register variant in some cases so the number vary slightly. I have some theories but no definite explanation for this behavior.

Thanks

Thanks to Fritzchens Fritz who created the die shots analyzed here, and who graciously put them into the public domain.

Henry Wong who wrote the original article which introduced me to this technique and subsequently shared the code for his tool, which is now hosted on github.

Nemez who did a breakdown of the die shot, noting the register file in question as some type of integer register file, which originally piqued my curiosity.

Thanks to Daniel Lemire who provided access to the SKX hardware used in this post.

Thanks to Matt Godbolt and Vijay who pointed out typos in the text.

Discussion and Feedback

If you have something to say, leave a comment below or discuss this article on Hacker News.

Feedback is also warmly welcomed by email or as a GitHub issue.

Admittedly, “how many physical mask registers does the CPU have” is probably not a very high bar of interestingness to clear, to most people. ↩
The zmm, ymm and xmm registers all overlap, architecturally. That is, xmm0 is just the bottom 128 bits of ymm0, and similarly for ymm0 and zmm0. Physically, there are really only zmm registers and the other two are simply specific ranges of bits of those larger register. So the area marked YMM on the die shot really means: the upper parts of the ymm registers which are not part of the corresponding xmm register. ↩
Actually Kaby Lake, since the best die shots we have are from that chip, but it’s the same thing. ↩
Incidentally, this lines up with an inspection of the execution units, which seem to have the same over-under arrangemnet: the port 5 FMA for example, looks like it has has two rows each with 4x 64-bit FMA units, rather than say a single row with 8 units. ↩
As a trick, I guess, to allow MMX registers to be saved and restored by operating systems and other code that weren’t aware of their presence. A similar mess occurred with the transition from SSE to AVX, where code unaware of AVX could accidentally clobber the upper part of AVX registers using SSE instructions (if SSE zeroed the upper bits), so instead we get the ongoing issue with legacy SSE and dirty uppers. ↩
This is a lie: I didn’t really look around all around the die: I looked near by the execution units were the register file would be with very high probability. ↩
The integer flags (so-called EFLAGS register) also need to be renamed and I believe they pull a similar trick: writing their results to the same physical register allocated for the result: I’ve marked the file that I think holds the so-called SPAZO group on the zoomed view, and the C flag may be stored in the same place or in the thin (single bit?) file immediately to the right of the GP file. ↩
I talk of physical aliasing here, to distinguish it from the logical/architectural aliasing. Logical aliasing is that which is visible to software: the ymm and xmm registers are logically aliased in that writes to xmm0 show up in the low bits of ymm0. Similarly, the MMX and x87 register files are aliased in that writes to MMX register modify values in the FP register stack, although the rules are more complicated. Logical aliasing usually implies physical aliasing, but not the other way around. Physical aliasing, then, means that two register sets are renamed onto the same pool of physical registers, but this is usually invisible to software (except though careful performance tests, as we do here). ↩
In practice, you don’t actually get all the way to 2x: you hit something close to the ROB limit instead first. ↩
I use MMX rather than x87 so I don’t have to deal with the x87 FP stack abstraction and understand how that maps to renaming. ↩
The speculative register file because we expect some entries also to be used to hold the non-speculative values of the architectural registers. We’ll return to this point in a moment. ↩
Specifically, it mixes the same kaddd and por instructions we used in the single-type tests Test 38 and Test 43. ↩

Ice Lake Store Elimination

2020-05-18T00:00:00+00:00

Introduction

If you made it down to the hardware survey on the last post, you might have wondered where Intel’s newest mainstream architecture was. Ice Lake was missing!

Well good news: it’s here… and it’s interesting. We’ll jump right into the same analysis we did last time for Skylake client. If you haven’t read the first article you’ll probably want to start there, because we’ll refer to concepts introduced there without reexplaining them here.

As usual, you can skip to the summary for the bite sized version of the findings.

ICL Results

The Compiler Has an Opinion

Let’s first take a look at the overall performance: facing off fill0 vs fill1 as we’ve been doing for every microarchitecture. Remember, fill0 fills a region with zeros, while fill1 fills a region with the value one (as a 4-byte int).

All of these tests run at 3.5 GHz. The max single-core turbo for this chip is at 3.7 GHz, but is difficult to run in a sustained manner at this frequency, because of AVX-512 clocking effects and because other cores occasionally activate. 3.5 GHz is a good compromise that keeps the chip running at the same frequency, while remaining close to the ideal turbo. Disabling turbo is not a good option, because this chip runs at 1.1 GHz without turbo, which would introduce a large distortion when exercising the uncore and RAM.

Figure 7a

[link to this chart] [data table] [raw data]

Actually, I lied. This is the right plot for Ice Lake:

Figure 7b

[link to this chart] [data table] [raw data]

Well, which is it?

Those two have a couple of key differences. The first is this weird thing that Figure 7a has going on in the right half of the L1 region: there are two obvious and distinct performance levels visible, each with roughly half the samples.

The second thing is that while both of the plots show some of the zero optimization effect in the L3 and RAM regions, the effect is much larger in Figure 7b:

So what’s the difference between these two plots? The top one was compiled with -march=native, the second with -march=icelake-client.

Since I’m compiling this on the Ice Lake client system, I would expect these to do the same thing, but for some reason they don’t. The primary difference is that -march=native generates 512-bit instructions like so (for the main loop):

.L4:
    vmovdqu32   [rax], zmm0
    add         rax, 512
    vmovdqu32   [rax-448], zmm0
    vmovdqu32   [rax-384], zmm0
    vmovdqu32   [rax-320], zmm0
    vmovdqu32   [rax-256], zmm0
    vmovdqu32   [rax-192], zmm0
    vmovdqu32   [rax-128], zmm0
    vmovdqu32   [rax-64],  zmm0
    cmp     rax, r9
    jne     .L4

Using -march=icelake-client uses 256-bit instructions¹:

.L4:
    vmovdqu32   [rax], ymm0
    vmovdqu32   [rax+32], ymm0
    vmovdqu32   [rax+64], ymm0
    vmovdqu32   [rax+96], ymm0
    vmovdqu32   [rax+128], ymm0
    vmovdqu32   [rax+160], ymm0
    vmovdqu32   [rax+192], ymm0
    vmovdqu32   [rax+224], ymm0
    add     rax, 256
    cmp     rax, r9
    jne     .L4

Most compilers use 256-bit instructions by default even for targets that support AVX-512 (reason: downclocking, so the -march=native version is the weird one here. All of the earlier x86 tests used 256-bit instructions.

The observation that Figure 7a results from running 512-bit instructions, combined with a peek at the data lets us immediately resolve the mystery of the bi-modal behavior.

Here’s the raw data for the 17 samples at a buffer size of 9864:

Size	Trial	fill0	fill1
		GB/s
9864	0	92.3	92.0
	1	91.9	91.9
	2	91.9	91.9
	3	92.4	92.2
	4	92.0	92.3
	5	92.1	92.1
	6	92.0	92.0
	7	92.3	92.1
	8	92.2	92.0
	9	92.0	92.1
	10	183.3	93.9
	11	197.3	196.9
	12	197.3	196.6
	13	196.6	197.3
	14	197.3	196.6
	15	196.6	197.3
	16	196.6	196.6

The performance follows a specific pattern with respect to the trials for both fill0 and fill1: it starts out slow (about 90 GB/s) for the first 9-10 samples then suddenly jumps up the higher performance level (close to 200 GB/s). It turns out this is just voltage and frequency management biting us again. In this case there is no frequency change: the raw data has a frequency column that shows the trials always run at 3.5 GHz. There is only a voltage change, and while the voltage is changing, the CPU runs with reduced dispatch throughput².

The reason this effect repeats for every new set of trials (new buffer size value) is that each new set of trials is preceded by a 100 ms spin wait: this spin wait doesn’t run any AVX-512 instructions, so the CPU drops back to the lower voltage level and this process repeats. The effect stops when the benchmark moves into the L2 region, because there it is slow enough that the 10 discarded warmup trials are enough to absorb the time to switch to the higher voltage level.

We can avoid this problem simply by removing the 100 ms warmup (passing --warmup-ms=0 to the benchmark), and for the rest of this post we’ll discuss the no-warmup version (we keep the 10 warmup trials and they should be enough).

Elimination in Ice Lake

So we’re left with the second effect, which is that the 256-bit store version shows very effective elimination, as opposed to the 512-bit version. For now let’s stop picking favorites between 256 and 512 (push that on your stack, we’ll get back to it), and just focus on the elimination behavior for 256-bit stores.

Here’s the closeup of the L3 region for the 256-bit store version, showing also the L2 eviction type, as discussed in the previous post:

Figure 8

[link to this chart] [data table] [raw data]

We finally have the elusive (near) 100% elimination of redundant zero stores! The fill0 case peaks at 96% silent (eliminated³) evictions. Typical L3 bandwidth is ~59 GB/s with elimination and ~42 GB/s without, for a better than 40% speedup! So this is a potentially a big deal on Ice Lake.

Like last time, we can also check the uncore tracker performance counters, to see what happens for larger buffers which would normally write back to memory.

Figure 9

[link to this chart] [data table] [raw data]

Note: the way to interpret the events in this plot is the reverse of the above: more uncore tracker writes means less elimination, while in the earlier chart more silent writebacks means more elimination (since every silent writeback replaces a non-silent one).

As with the L3 case, we see that the store elimination appears 96% effective: the number of uncore to memory writebacks flatlines at 4% for the fill0 case. Compare this to Figure 3, which is the same benchmark running on Skylake-S, and note that only half the writes to RAM are eliminated.

This chart also includes results for the alt01 benchmark. Recall that this benchmark writes 64 bytes of zeros alternating with 64 bytes of ones. This means that, at best, only half the lines can be eliminated by zero-over-zero elimination. On Skylake-S, only about 50% of eligible (zero) lines were eliminated, but here we again get to 96% elimination! That is, in the alt01 case, 48% of all writes were eliminated, half of which are all-ones and not eligible.

The asymptotic speedup for the all zero case for the RAM region is less than the L3 region, at about 23% but that’s still not exactly something to sneeze at. The speedup for the alternating case is 10%, somewhat less than half the benefit of the all zero case⁴. In the L3 region, we also note that the benefit of elimination for alt01 is only about 7%, much smaller than the ~20% benefit you’d expect if you cut the 40% benefit the all-zeros case sees. We saw a similar effect in Skylake-S.

Finally it’s worth noting this little uptick in uncore writes in the fill0 case:

This happens right around the transition from L3 to RAM, and this, the writes flatline down to 0.04 per line, but this uptick is fairly consistently reproducible. So there’s some interesting effect there, probably, perhaps related to the adaptive nature of the L3 caching⁵.

512-bit Stores

If we rewind time, time to pop the mental stack and return to something we noticed earlier: that 256-bit stores seemed to get superior performance for the L3 region compared to 512-bit ones.

Remember that we ended up with 256-bit and 512-bit versions due to unexpected behavior in the -march flag. Rather they relying on this weirdness⁶, let’s just write slighly lazy⁷ methods that explicitly use 256-bit and 512-bit stores but are otherwise identical. fill256_0 uses 256-bit stores and writes zeros, and I’ll let you pattern match the rest of the names.

Here’s how they perform on my ICL hardware:

Figure 10

[link to this chart] [data table] [raw data]

This chart shows only the the median of 17 trials. You can look at the raw data for an idea of the trial variance, but it is generally low.

In the L1 region, the 512-bit approach usually wins and there is no apparent difference between writing 0 or 1 (the two halves of the moon mostly line up). Still, 256-bit stores are roughly competitive with 512-bit: they aren’t running at half the throughput. That’s thanks to the second store port on Ice Lake. Without that feature, you’d be limited to 112 GB/s at 3.5 GHz, but here we handily reach ~190 GB/s with 256-bit stores, and ~195 GB/s with 512-bit stores. 512-bit stores probably have a slight advantage just because of fewer total instructions executed (about half of the 256-bit case) and associated second order effects.

Ice Lake has two store ports which lets it execute two stores per cycle, but only a single cache line can be written per cycle. However, if two consecutive stores fall into the same cache line, they will generally both be written in the same cycle. So the maximum sustained throughput is up to two stores per cycle, if they fall in the same line⁸.

In the L2 region, however, the 256-bit approaches seem to pull ahead. This is a bit like the Buffalo Bills winning the Super Bowl: it just isn’t supposed to happen.

Let’s zoom in:

Figure 11

[link to this chart] [data table] [raw data]

The 256-bit benchmarks start roughly tied with their 512 bit cousins, but then steadily pull away as the region approaches the full size of the L2. By the end of the L2 region, they have nearly a ~13% edge. This applies to both fill256 versions – the zeros-writing and ones-writing flavors. So this effect doesn’t seem explicable by store elimination: we already know ones are not eliminated and, also, elimination only starts to play an obvious role when the region is L3-sized.

In the L3, the situation changes: now the 256-bit version really pulls ahead, but only the version that writes zeros. The 256-bit and 512-bit one-fill versions fall down in throughput, nearly to the same level (but the 256-bit version still seems slightly but measurably ahead at ~2% faster). The 256-bit zero fill version is now ahead by roughly 45%!

Let’s concentrate only on the two benchmarks that write zero: fill256_0 and fill512_0, and turn on the L2 eviction counters (you probably saw that one coming by now):

[link to this chart] [data table] [raw data]

Only the L2 Lines Out Silent event is shown – the balance of the evictions are non-silent as usual.

Despite the fact that I had to leave the right axis legend just kind floating around in the middle of the plot, I hope the story is clear: 256-bit stores get eliminated at the usual 96% rate, but 512-bit stores are hovering at a decidedly Skylake-like ~56%. I can’t be sure, but I expect this difference in store elimination largely explains the performance difference.

I checked also the behavior with prefetching off, but the pattern is very similar, except with both approaches having reduced performance in L3 (you can see for yourself). It is interesting to note that for zero-over-zero stores, the 256-bit store performance in L3 is almost the same as the 512-bit store performance in L2! It buys you almost a whole level in the cache hierarchy, performance-wise (in this benchmark).

Normally I’d take a shot at guessing what’s going on here, but this time I’m not going to do it. I just don’t know⁹. The whole thing is very puzzling, because everything after the L1 operates on a cache-line basis: we expect the fine-grained pattern of stores made by the core, within a line to basically be invisible to the rest of the caching system which sees only full lines. Yet there is some large effect in the L3 and even in RAM¹⁰ related to whether the core is writing a cache line in two 256-bit chunks or a single 512-bit chunk.

Summary

We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.

We find that when zero-filling writes occur to a region sized for the L2 cache or larger, 256-bit writes are often significantly faster than 512-bit writes. The effect is largest for the L2, where 256-bit zero-over-zero writes are up to 45% faster than 512-bit writes. We find a similar effect even for non-zeroing writes, but only in the L2.

Future

It is an interesting open question whether the as-yet-unreleased Sunny Cove server chips will exhibit this same optimization.

Advice

Unless you are developing only for your own laptop, as of May 2020 Ice Lake is deployed on a microscopic fraction of total hosts you would care about, so the headline advice in the previous post applies: this optimization doesn’t apply to enough hardware for you to target it specifically. This might change in the future as Ice Lake and sequels roll out in force. In that case, the magnitude of the effect might make it worth optimizing for in some cases.

For fine-grained advice, see the list in the previous post.

Thanks

Vijay and Zach Wegner for pointing out typos.

Ice Lake photo by Marcus Löfvenberg on Unsplash.

Saagar Jha for helping me track down and fix a WebKit rendering issue.

Discussion and Feedback

If you have something to say, leave a comment below. There are also discussions on Twitter and Hacker News.

Feedback is also warmly welcomed by email or as a GitHub issue.

It’s actually still using the EVEX-encoded AVX-512 instruction vmovdqu32, which is somewhat more efficient here because AVX-512 has more compact encoding of offsets that are a multiple of the vector size (as they usually are). ↩
In this case, the throughput is only halved, versus the 1/4 throughput when we looked at dispatch throttling on SKX, so based on this very preliminary result it seems like the dispatch throttling might be less severe in Ice Lake (this needs a deeper look: we never used stores to test on SKX). ↩
Strictly speaking, a silent writeback is a sufficient, but not a necessary condition for elimination, so it is a lower bound on the number of eliminated stores. For all I know, 100% of stores are eliminated, but out of those 4% are written back not-silently (but not in a modified state). ↩
One reason could be that writing only alternating lines is somewhat more expensive than writing half the data but contiguously. Of course this is obviously true closer to the core, since you touch half the number of the pages in the contiguous case, need half the number of page walks, prefetching is more effective since you cross half as many 4K boundaries (prefetch stops at 4K boundaries) and so on. Even at the memory interface, alternating line writes might be less efficient because you get less benefit from opening each DRAM page, can’t do longer than 64-byte bursts, etc. In a pathological case, alternating lines could be half the bandwidth if the controller maps alternating lines to alternating channels, since you’ll only be accessing a single channel. We could try to isolate this effect by trying more coarse grained interleaving. ↩
The L3 is capable of determining if the current access pattern would be better served by something like an MRU eviction strategy, for example when a stream of data is being accessed without reuse, it would be better to kick that data out of the cache quickly, rather than evicting other data that may be useful. ↩
After all, there’s a good chance it will be fixed in a later version of gcc. ↩
These are lazy in the sense that I don’t do any scalar head or tail handling: the final iteration just does a full width SIMD store even if there aren’t 64 bytes left: we overwrite the buffer by up to 63 bytes. We account for this when we allocate the buffer by ensuring the allocation is oversized by at least that amount. This doesn’t matter for larger buffers, but it means this version will get a boost for very small buffers versus approaches that do the fill exactly. In any case, we are interested in large buffers here. ↩
Most likely, the L1 has a single 64 byte wide write port, like SKX, and the commit logic at the head of the store buffer can look ahead one store to see if it is in the same line in order to dequeue two stores in a single cycle. Without this feature, you could execute two stores per cycle, but only commit one, so the long-run store throughput would be limited to one per cycle. ↩
Well I lied. I at least have some ideas. It may be that the CPU power budget is dynamically partitioned between the core and uncore, and with 512-bit stores triggering the AVX-512 power budget, there is less power for the uncore and it runs at a lower frequency (that could be checked). This seems unlikely given that it should not obviously affect the elimination chance. ↩
We didn’t take a close look at the effect in RAM but it persists, albeit at a lower magnitude. 256-bit zero-over-zero writes are about 10% faster than 512-bit writes of the same type. ↩

Hardware Store Elimination

2020-05-13T00:00:00+00:00

I had no plans to write another post about zeros, but when life throws you a zero make zeroaid, or something like that. Here we go!

If you want to jump over the winding reveal and just read the summary and advice, now is your chance.

When writing simple memory benchmarks I have always taken the position the value written to memory didn’t matter. Recently, while running a straightforward benchmark¹ probing the interaction between AVX-512 stores and read for ownership I ran into a weird performance deviation. This is that story².

Table of Contents
Prelude
- Data Dependent Performance
- Source
Benchmarks
Wrapping Up

Prelude

Data Dependent Performance

On current mainstream CPUs, the timing of most instructions isn’t data-dependent. That is, their performance is the same regardless of the value of the input(s) to the instruction. Unlike you³ or me your CPU takes the same time to add 1 + 2 as it does to add 68040486 + 80866502.

Now, there are some notable exceptions:

Integer division is data-dependent on most x86 CPUs: larger inputs generally take longer although the details vary widely among microarchitectures⁴.
BMI2 instructions pdep and pext have famously terrible and data-dependent performance on AMD Zen and Zen2 chips.
Floating point instructions often have slower performance when denormal numbers are encountered, although some rounding modes such as flush to zero may avoid this.

That list is not exhaustive: there are other cases of data-dependent performance, especially when you start digging into complex microcoded instructions such as cpuid. Still, it isn’t unreasonable to assume that most simple instructions not listed above execute in constant time.

How about memory operations, such as loads and stores?

Certainly, the address matters. After all the address determines the caching behavior, and caching can easily account for two orders of magnitude difference in performance⁵. On the other hand, I wouldn’t expect the data values loaded or stored to matter. There is not much reason to expect the memory or caching subsystem to care about the value of the bits loaded or stored, outside of scenarios such as hardware-compressed caches not widely deployed⁶ on x86.

Source

The full benchmark associated with this post (including some additional benchmarks not mention here) is available on GitHub.

Benchmarks

That’s enough prelude for now. Let’s write some benchmarks.

A Very Simple Loop

Let’s start with a very simple task. Write a function that takes an int value val and fills a buffer of a given size with copies of that value. Just like memset, but with an int value rather than a char one.

The canonical C implementation is probably some type of for loop, like this:

void fill_int(int* buf, size_t size, int val) {
  for (size_t i = 0; i < size; ++i) {
    buf[i] = val;
  }
}

… or maybe this⁷:

void fill_int(int* buf, size_t size, int val) {
  for (int* end = buf + size; buf != end; ++buf) {
    *buf = val;
  }
}

In C++, we don’t even need that much: we can simply delegate directly to std::fill which does the same thing as a one-liner⁸:

std::fill(buf, buf + size, val);

There is nothing magic about std::fill, it also uses a loop just like the C version above. Not surprisingly, gcc and clang compile them to the same machine code⁹.

With the right compiler arguments (-march=native -O3 -funroll-loops in our case), we expect this std::fill version (and all the others) to be implemented with with AVX vector instructions, and it is so. The part which does the heavy lifting work for large fills looks like this:

.L4:
  vmovdqu YMMWORD PTR [rax +   0], ymm1
  vmovdqu YMMWORD PTR [rax +  32], ymm1
  vmovdqu YMMWORD PTR [rax +  64], ymm1
  vmovdqu YMMWORD PTR [rax +  96], ymm1
  vmovdqu YMMWORD PTR [rax + 128], ymm1
  vmovdqu YMMWORD PTR [rax + 160], ymm1
  vmovdqu YMMWORD PTR [rax + 192], ymm1
  vmovdqu YMMWORD PTR [rax + 224], ymm1
  add     rax, 256
  cmp     rax, r9
  jne     .L4

It copies 256 bytes of data every iteration using eight 32-byte AVX2 store instructions. The full function is much larger, with a scalar portion for buffers smaller than 32 bytes (and which also handles the odd elements after the vectorized part is done), and a vectorized jump table to handle up to seven 32-byte chunks before the main loop. No effort is made to align the destination, but we’ll align everything to 64 bytes in our benchmark so this won’t matter.

Our First Benchmark

Enough foreplay: let’s take the C++ version out for a spin, with two different fill values (val) selected completely at random: zero (fill0) and one (fill1). We’ll use gcc 9.2.1 and the -march=native -O3 -funroll-loops flags mentioned above.

We organize it so that for both tests we call the same non-inlined function: the exact same instructions are executed and only the value differs. That is, the compile isn’t making any data-dependent optimizations.

Here’s the fill throughput in GB/s for these two values, for region sizes ranging from 100 up to 100,000,000 bytes.

Figure 1

[link to this chart] [data table] [raw data]

About this chart:
At each region size (that is, at each position along the x-axis) 17 semi-transparent samples¹⁰ are plotted and although they usually overlap almost completely (resulting in a single circle), you can see cases where there are outliers that don’t line up with the rest of this sample. This plot tries to give you an idea of the spread of back-to-back samples without hiding them behind error bars¹¹. Finally, the sizes of the various data caches (32, 256 and 6144 KiB for the L1D, L2 and L3, respectively) are marked for convenience.

L1 and L2

Not surprisingly, the performance depends heavily on what level of cache the filled region fits into.

Everything is fairly sane when the buffer fits in the L1 or L2 cache (up to ~256 KiB¹²). The relatively poor performance for very small region sizes is explained by the prologue and epilogue of the vectorized implementation: for small sizes a relatively large amount of time is spent in these int-at-a-time loops: rather than copying up to 32 bytes per cycle, we copy only 4.

This also explains the bumpy performance in the fastest region between ~1,000 and ~30,000 bytes: this is highly reproducible and not noise. It occurs because because some sampled values have a larger remainder mod 32. For example, the sample at 740 bytes runs at ~73 GB/s while the next sample at 988 runs at a slower 64 GB/s. That’s because 740 % 32 is 4, while 988 % 32 is 28, so the latter size has 7x more cleanup work to do than the former¹³. Essentially, we are sampling semi-randomly a sawtooth function and if you plot this region with finer granularity¹⁴ you can see it quite clearly.

Getting Weird in the L3

So while there are some interesting effects in the first half of the results, covering L1 and L2, they are fairly easy to explain and, more to the point, performance for the zero and one cases are identical: the samples are all concentric. As soon as we dip our toes into the L3, however, things start to get weird.

Weird in that we we see a clear divergence between stores of zero versus ones. Remember that this is the exact same function, the same machine code executing the same stream of instructions, only varying in the value of the ymm1 register passed to the store instruction. Storing zero is consistently about 17% to 18% faster than storing one, both in the region covered by the L3 (up to 6 MiB on my system), and beyond that where we expect misses to RAM (it looks like the difference narrows in the RAM region, but it’s mostly a trick of the eye: the relative performance difference is about the same).

What’s going on here? Why does the CPU care what values are being stored, and why is zero special?

We can get some additional insight by measuring the l2_lines_out.silent and l2_lines_out.non_silent events while we focus on the regions that fit in L2 or L3. These events measure the number of lines evicted from L2 either silently or non-silently.

Here are Intel’s descriptions of these events:

l2_lines_out.silent

Counts the number of lines that are silently dropped by L2 cache when triggered by an L2 cache fill. These lines are typically in Shared or Exclusive state.

l2_lines_out.non_silent

Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines are in Modified state. Modified lines are written back to L3.

The states being referred to here are MESI cache states, commonly abbreviated M (modified), E (exclusive, but not modified) and S (possibly shared, not modified).

The second definition is not completely accurate. In particular, it implies that only modified lines trigger the non-silent event. However, I find that unmodified lines in E state can also trigger this event. Roughly, the behavior for unmodified lines seems to be that lines that miss in L2 and L3 usually get filled into the L2 in a state where they will be evicted non-silently, but unmodified lines that miss in L2 and hit in L3 will generally be evicted silently¹⁵. Of course, lines that are modified must be evicted non-silently in order to update the outer levels with the new data.

In summary: silent evictions are associated with unmodified lines in E or S state, while non-silent evictions are associated with M, E or (possibly) S state lines, with the silent vs non-silent choice for E and S being made in some unknown matter.

Let’s look at silent vs non-silent evictions for the fill0 and fill1 cases:

Figure 2

[link to this chart] [data table] [raw data]

About this chart:
For clarity, I show only the median single sample for each size¹⁶. As before, the left axis is fill speed and on the right axis the two types eviction events are plotted, normalized to the number of cache lines accessed in the benchmark. That is, a value of 1.0 means that for every cache line accessed, the event occurred one time.

The total number of evictions (sum of silent and non-silent) is the same for both cases: near zero¹⁷ when the region fits in L2, and then quickly increases to ~1 eviction per stored cache line. In the L3, fill1 also behaves as we’d expect: essentially all of the evictions are non-silent. This makes sense since modified lines must be evicted non-silently to write their modified data to the next layer of the cache subsystem.

For fill0, the story is different. Once the buffer size no longer fits in L2, we see the same total number of evictions from L2, but 63% of these are silent, the rest non-silent. Remember, only unmodified lines even have the hope of a silent eviction. This means that at least 63% of the time, the L2¹⁸ is able to detect that the write is redundant: it doesn’t change the value of the line, and so the line is evicted silently. That is, it is never written back to the L3. This is presumably what causes the performance boost: the pressure on the L3 is reduced: although all the implied reads¹⁹ still need to go through the L3, only about 1 out of 3 of those lines ends up getting written back.

Once the test starts to exceed the L3 threshold, all of the evictions become non-silent even in the fill0 case. This doesn’t necessarily mean that the zero optimization stops occurring. As mentioned earlier¹⁵, it is a typical pattern even for read-only workloads: once lines arrive in L2 as a result of an L3 miss rather than a hit, their subsequent eviction becomes non-silent, even if never written. So we can assume that the lines are probably still detected as not modified, although we lose our visibility into the effect at least as far as the l2_lines_out events go. That is, although all evictions are non-silent, some fraction of the evictions are still indicating that the outgoing data is unmodified.

RAM: Still Weird

In fact, we can confirm that this apparent optimization still happens as move into RAM using a different set of events. There are several to choose from – and all of those that I tried tell the same story. We’ll focus on unc_arb_trk_requests.writes, documented as follows:

Number of writes allocated including any write transaction including full, partials and evictions.

Important to note that the “uncore tracker” these events monitor is used by data flowing between L3 and memory, not between L2 and L3. So writes here generally refers to writes that will reach memory.

Here’s how this event scales for the same test we’ve been running this whole time (the size range has been shifted for focus on the area of interest)²⁰:

Figure 3

[link to this chart] [data table] [raw data]

The number of writes for well-behaved fill1 approaches one write per cache line as the buffer exceeds the size of L3 – again, this is as expected. For the more rebellious fill0, it is almost exactly half that amount. For every two lines written by the benchmark, we only write one back to memory! This same 2:1 ratio is reflected also if we measure memory writes at the integrated memory controller²¹: writing zeros results in only half the number of writes at the memory controller.

Wild, Irresponsible Speculation and Miscellanous Musings

This is all fairly strange. It’s not weird that there would be a “redundant writes” optimization to avoid writing back identical values: this seems like it could benefit some common write patterns.

It is perhaps a bit unusual that it only apparently applies to all-zero values. Maybe this is because zeros overwriting zeros is one of the most common redundant write cases, and detecting zero values can done more cheaply than a full compare. Also, the “is zero?” state can be communicated and stored as a single bit, which might be useful. For example, if the L2 is involved in the duplicate detection (and the l2_lines_out results suggest it is), perhaps the detection happens when the line is evicted, at which point you want to compare to the line in L3, but you certainly can’t store the entire old value in or near the L2 (that would require storage as large as the L2 itself). You could store an indicator that the line was zero, however, in a single bit and compare the existing line as part of the eviction process.

Predicting a New Predictor

What is the weirdest of all, however, is that the optimization doesn’t kick in 100% of the time but only for 40% to 60% of the lines, depending on various parameters²². What would lead to that effect? One could imagine that there could be some type of predictor which determines whether to apply this optimization or not, depending on e.g., whether the optimization has recently been effective – that is, whether redundant stores have been common recently. Perhaps this predictor also considers factors such as the occupancy of outbound queues²³: when the bus is near capacity, searching for eliminating redundant writes might be more worth the power or latency penalty compared to the case when there is little apparent pressure on the bus.

In this benchmark, any predictor would find that the optimization is 100% effective: every write is redundant! So we might guess that the second condition (queue occupancy) results in a behavior where only some stores are eliminated: as more stores are eliminated, the load on the bus becomes lower and so at some point the predictor no long thinks it is worth it to eliminate stores and you reach a kind of stable state where only a fraction of stores are eliminated based on the predictor threshold.

Predictor Test

We can kind of test that theory: in this model, any store is capable of being eliminated, but the ratio of eliminated stores is bounded above by the predictor behavior. So if we find that a benchmark of pure redundant zero stores is eliminated at a 60% rate, we might expect that any benchmark with at least 60% redundant stores can reach the 60% rate, and with lower rates, you’d see full elimination of all redundant stores (since now the bus always stays active enough to trigger the predictor).

Apparently analogies are helpful, so an analogy here would be a person controlling the length of a line by redirecting some incoming people. For example, in an airport security line the handler tries to keep the line at a certain maximum length by redirecting (redirecting -> store elimination) people to the priority line if they are eligible and the main line is at or above its limit. Eligible people are those without carry-on luggage (eligible people -> zero-over-zero stores).

If everyone is eligible (-> 100% zero stores), this control will always be successful and the fraction of people redirected will depend on the relative rate of ingress and egress through security. If security only has a throughput of 40% of the ingress rate, 60% of people will redirected in the steady state. Now, consider what happens if not everyone is eligible: if the eligible fraction is at least 60%, nothing changes. You still redirect 60% of people. Only if the eligible rate drops below 60% is there a problem: now you’ll be redirecting 100% of eligible people, but the primary line will grow beyond your limit.

Whew! Not sure if that was helpful after all?

Let’s try a benchmark which adds a new implementation, alt01 which alternates between writing a cache line of zeros and a cache line of ones. All the writes are redundant, but only 50% are zeros, so under the theory that a predictor is involved we expect that maybe 50% of the stores will be eliminated (i.e., 100% of the redundant stores are eliminated and they make up 50% of the total).

Here we focus on the L3, similar to Fig. 2 above, showing silent evictions (the non-silent ones make up the rest, adding up to 1 total as before):

Figure 4

[link to this chart] [data table] [raw data]

We don’t see 50% elimination. Rather we see less than half the elimination of the all-zeros case: 27% versus 63%. Performance is better in the L3 region than the all ones case, but only slightly so! So this doesn’t support the theory of a predictor capable of eliminating on any store and operating primarily on outbound queue occupancy.

Similarly, we can examine the region where the buffer fits only in RAM, similar to Fig. 3 above:

Figure 5

[link to this chart] [data table] [raw data]

Recall that the lines show the number of writes reaching the memory subsystem. Here we see that alt01 again splits the difference between the zero and ones case: about 75% of the writes reach memory, versus 48% in the all-zeros case, so the elimination is again roughly half as effective. In this case, the performance also splits the difference between all zeros and all ones: it falls almost exactly half-way between the two other cases.

So I don’t know what’s going on exactly. It seems like maybe only some fraction are of lines are eligible for elimination due to some unknown internal mechanism in the uarch.

Hardware Survey

Finally, here are the performance results (same as Figure 1) on a variety of other Intel and AMD x86 architectures, as well as IBM’s POWER9 and Amazon’s Graviton 2 ARM processor, one per tab.

Sandy Bridge

[data table] [raw data]

Haswell

[data table] [raw data]

Skylake-S

[data table] [raw data]

Skylake-X

[data table] [raw data]

Cannon Lake

[data table] [raw data]

Zen2

[data table] [raw data]

POWER9

[data table] [raw data]

Graviton 2

[data table] [raw data]

Some observations on these results:

The redundant write optimization isn’t evident in the performance profile for any of the other non-SKL hardware tested. Not even closely related Intel hardware like Haswell or Skylake-X. I also did a few spot tests with performance counters, and didn’t see any evidence of a reduction in writes. So for now this might a Skylake client only thing (of course, Skylake client is perhaps the most widely deployed Intel uarch even due to the many identical-uarch-except-in-name variants: Kaby Lake, Coffee Lake, etc, etc). Note that the Skylake-S result here is for a different (desktop i7-6700) chip than the rest of this post, so we can at least confirm this occurs on two different chips.
Except in the RAM region, Sandy Bridge throughput is half of its successors: a consequence of having only a 16-byte load/store path in the core, despite supporting 32-byte AVX instructions.
AMD Zen2 has excellent write performance in the L2 and L3 regions. All of the Intel chips drop to about half throughput for writes in the L2: slightly above 16 bytes per cycle (around 50 GB/s for most of these chips). Zen2 maintains its L1 throughput and in fact has its highest results in L2: over 100 GB/s. Zen2 also manages more than 70 GB/s in the L3, much better than the Intel chips, in this test.
Both Cannon Lake and Skylake-X exhibit a fair amount of inter-sample variance in the L2 resident region. My theory here would be prefetcher interference which behaves differently than earlier chips, but I am not sure.
Skylake-X, with a different L3 design than the other chips, has quite poor L3 fill throughput, about half of contemporary Intel chips, and less than a third of Zen2.
The POWER9 performance is neither terrible nor great. The most interesting part is probably the high L3 fill throughput: L3 throughput is as high or higher than L1 or L2 throughput, but still not in Zen2 territory.
Amazon’s new Graviton processor is very interesting. It seems to be limited to one 16-byte store per cycle²⁴, giving it a peak possible store throughput of 40 GB/s, so it doesn’t do well in the L1 region versus competitors that can hit 100 GB/s or more (they have both higher frequency and 32 byte stores), but it sustains the 40 GB/s all the way to RAM sizes, with a RAM result flat enough to serve drinks on, and this on a shared 64-CPU host where I paid for only a single core²⁵! The RAM performance is the highest out of all hardware tested.

You might notice that Ice Lake, Intel’s newest microarchitecture, is missing from this list: that’s because there is a whole separate post on it.

Further Notes

Here’s a grab back of notes and observations that don’t get their a full section, don’t have any associated plots, etc. That doesn’t mean they are less important! Don’t say that! These ones matter too, they really do.

Despite any impressions I may have given above: you don’t need to fully overwrite a cache line with zeros for this to kick in, and you can even write non-zero values if you overwrite them “soon enough” with zeros. Rather, the line must initially fully zero, and then all the escaping²⁶ writes must be zero. Another way of thinking about this is that the thing that matters is the value of the cache line written back, as well as the old value of the line that the writeback is replacing: these must both be fully zero, but that doesn’t mean that you need to overwrite the line with zeros: any locations not written are still zero “from before”. I directly test this in the one_per0 and one_per1 tests in the benchmark. These write only a single int value in each cache line, leaving the other values unchanged. In that benchmark the optimization kicks triggers in exactly the same way when writing a single zero.
Although we didn’t find evidence of this happening on other x86 hardware, nor on POWER or ARM, that doesn’t mean it isn’t or can’t happen. It is possible the conditions for it to happen aren’t triggered, or that it is happening but doesn’t make a difference in performance. Similarly for the POWER9 and ARM chips: we didn’t check any performance counters, so maybe the thing is happening but it just doesn’t make any difference in performance. That’s especially feasible in the ARM case where the performance is totally limited by the core-level 16-bytes-per-cycle write throughput: even if all writes are eliminated later on in the path to memory, we expect the performance to be the same.
We could learn more about this effect by setting up a test which writes ones first to a region of some fixed size, then overwrites it with zeros, and repeats this in a rolling fashion over a larger buffer. This test basically lets the 1s escape to a certain level of the cache hierarchy, and seeing where the optimization kicks in will tell us something interesting.

Wrapping Up

Findings

Here’s a brief summary of what we found. This will be a bit redundant if you’ve just read the whole thing, but we need to accommodate everyone who just skipped down to this part, right?

Intel chips can apparently eliminate some redundant stores when zeros are written to a cache line that is already all zero (the line doesn’t need to be fully overwritten).
This optimization applies at least as early as L2 writeback to L3, so would apply to the extend that working sets don’t fit in L2.
The effect eliminates both write accesses to L3, and writes to memory depending on the working set size.
For the pure store benchmark discussed here effect of this optimization is a reduction in the number of writes of ~63% (to L3) and ~50% (to memory), with a runtime reduction of between 15% and 20%.
It is unclear why not all redundant zero-over-zero stores are eliminated.

Tuning “Advice”

So is any of actually useful? Can we use this finding to quadruple the speed of the things that really matter in computation: tasks like bitcoin mining, high-frequency trading and targeting ads in real time?

Nothing like that, no – but it might provide a small boost for some cases.

Many of those cases are probably getting the benefit without any special effort. After all, zero is already a special value: it’s how memory arrives comes from the operating system, and at the language allocation level for some languages. So a lot of cases that could get this benefit, probably already are.

Redundant zero-over-zero probably isn’t as rare as you might think either: consider that in low level languages, memory is often cleared after receiving it from the allocator, but in many cases this memory came directly from the OS so it is already zero²⁷. Consider also cases like fairly-sparse matrix multiplication: where your matrix isn’t sparse enough to actually use dedicated sparse routines, but still has a lot of zeros. In that case, you are going to be writing 0 all the time in your final result and scratch buffers. This optimization will reduce the writeback traffic in that case.

If you are making a ton of redundant writes, the first thing you might want to do is look for a way to stop doing that. Beyond that, we can list some ways you might be able to take advantage of this new behavior:

In the case you are likely to have redundant writes, prefer zero as the special value that is likely to be redundantly overwritten. For example if you are doing some blind writes, something like card marking where you don’t know if your write is redundant, you might consider writing zeros, rather than writing non-zeros, since in the case that some region of card marks gets repeatedly written, it will be all-zero and the optimization can apply. Of course, this cuts the wrong way when you go to clear the marked region: now you have to write non-zero so you don’t get the optimization during clearing (but maybe this happens out of line with the user code that matters). What ends up better depends on the actual write pattern.
In case you might have redundant zero-over-zero writes, pay a bit more attention to 64-byte alignment than you normally would because this optimization only kicks in when a full cache line is zero. So if you have some 64-byte structures that might often be all zero (but with non-zero neighbors), a forced 64-byte alignment will be useful since it would activate the optimization more frequently.
Probably the most practical advice of all: just keep this effect in mind because it can mess up your benchmarks and make you distrust performance counters. I found this when I noticed that the scalar version of a benchmark was writing 2x as much memory as the AVX version, despite them doing the same thing other than the choice of registers. As it happens, the dummy value in vector register I was storing was zero, while in the scalar case it wasn’t: so there was a large difference that had nothing to do with scalar vs vector, but non-zero vs zero instead. Prefer non-zero values in store microbenchmarks, unless you really expect them to be zero in real life!
Keep an eye for a more general version of this optimization: maybe one day we’ll see this effect apply to redundant writes that aren’t zero-over-zero.
Floating point has two zero values: +0 and -0. The representation of +0 is all-bits-zero, so using +0 gives you the chance of getting this optimization. Of course, everyone is already using +0 whenever they explicitly want zero.

Of course, the fact that this seems to currently only apply on Skylake and Ice Lake client hardware makes specifically targeting this quite dubious indeed.

Thanks

Thanks to Daniel Lemire who provided access to the hardware used in the Hardware Survey part of this post.

Thanks Alex Blewitt and Zach Wegner who pointed out the CSS tab technique (I used the one linked in the comments of this post) and others who replied to this tweet about image carousels.

Thanks to Tarlinian, 不良大脑的所有者, Bruce Dawson, Zach Wegner and Andrey Penechko who pointed out typos or omissions in the text.

Discussion and Feedback

Leave a comment below, or discuss on Twitter, Hacker News, reddit or RWT.

Feedback is also warmly welcomed by email or as a GitHub issue.

Specifically, I was running uarch-bench.sh --test-name=memory/bandwidth/store/* from uarch-bench. ↩
Like many posts on this blog, what follows is essentially a reconstruction. I encountered the effect originally in a benchmark, as described, and then worked backwards from there to understand the underlying effect. Then, I wrote this post the other way around: building up a new benchmark to display the effect … but at that point I already knew what we’d find. So please don’t think I just started writing the benchmark you find on GitHub and then ran into this issue coincidentally: the arrow of causality points the other way. ↩
Probably? I don’t like to assume too much about the reader, but this seems like a fair bet. ↩
Starting with Ice Lake, it seems like Intel has implemented a constant-time integer divide unit. ↩
Latency-wise, something like 4-5 cycles for an L1 hit, versus 200-500 cycles for a typical miss to DRAM. Throughput wise there is also a very large gap (256 GB/s L1 throughput per core on a 512-bit wide machine versus usually less than < 100 GB/s per socket on recent Intel). ↩
Is it deployed anywhere at all on x86? Ping me if you know. ↩
It’s hard to say which is faster if they are compiled as written: x86 has indexed addressing modes that make the indexing more or less free, at least for arrays of element size 1, 2, 4 or 8, so the usual arguments againt indexed access mostly don’t apply. Probably, it doesn’t matter: this detail might have made a big difference 20 years ago, but it is unlikely to make a difference on a decent compiler today, which can transform one into the other, depending on the target hardware. ↩
For benchmarking purposes, we wrap this in another function so we can slap a noinline attribute on this function to ensure that we have a single non-inlined version to call for different values. If we just called std::fill with a literal int value, it highly likely to get inlined at the call site and we’d have code with different alignment (and possibly other differences) for each value. ↩
Admittedly I didn’t go line-by-line though the long vectorized version produced by clang but the line count is identical and if you squint so the assembly is just a big green and yellow blur they look the same… ↩
There are 27 samples total at each size: the first 10 are discarded as warmup and the remaining 17 are plotted. ↩
The main problem with error bars are that most performance profiling results, and especially microbenchmarks, are mightly non-normal in their distribution, so displaying an error bar based a statistic like the variance is often highly misleading. ↩
The ~ is there in ~256 KiB because unless you use huge pages, you might start to see L2 misses even before 256 KiB since only a 256 KiB virtually contiguous buffer is not necessarily well behaved in terms of evictions: it depends how those 4k pages are mapped to physical pages. As soon as you get too many 4k pages mapping to the same group of sets, you’ll see evictions even before 256 KiB. ↩
It is worth noting²⁸ that this performance variation with buffer size isn’t exactly inescapable. Rather, it is just a consequence of poor remainder handling in the compiler’s auto-vectorizer. An approach that would be much faster and generate much less code to handle the remaining elements would be to do a final full-width vector store but aligned to the end of the buffer. So instead of doing up to 7 additional scalar stores, you do one additional vector store (and suffer up to one fewer branch mispredictions for random lengths, since the scalar outro involves conditional jumps). ↩
Those melty bits where the pattern gets all weird, in the middle and near the right side are not random artifacts: they are consistently reproducible. I suspect a collision in the branch predictor history. ↩
This behavior is interesting and a bit puzzling. There are several reasons why you might want to do a non-silent eviction. (1a) would be to keep the L3 snoop filter up to date: if the L3 knows a core no longer has a copy of the line, later requests for that line can avoid snooping the core and are some 30 cycles faster. (1b) Similarly, if the L3 wants to evict this line, this is faster if it knows it can do it without writing back, versus snooping the owning core for a possibly modified line. (2) Keeping the L3 LRU more up to date: the L3 LRU wants to know which lines are hot, but most of the accesses are filtered through the L1 and L2, so the L3 doesn’t get much information – a non-silent eviction can provide some of the missing info (3) If the L3 serves as a victim cache, the L2 needs to write back the line for it to be stored in L3 at all. SKX L3 actually works this way, but despite being a very similar uarch, SKL apparently doesn’t. However, one can imagine that on a miss to DRAM it may be advantageous to send the line directly to the L2, updating the L3 tags (snoop filter) only, without writing the data into L3. The data only gets written when the line is subsequently evicted from the owning L2. When lines are frequently modified, this cuts the number of writes to L3 in half. This behavior warrants further investigation. ↩ ↩²
You’ve already seen in Fig. 1 that there is little inter-sample variation, and this keeps the noise down. You can always check the raw data if you want the detailed view. ↩
This shows that the L2 is a write-back cache, not write-through: modified lines can remain in L2 until they are evicted, rather than immediately being written to the outer levels of the memory hierarchy. This type of design is key for high store throughput, since otherwise the long-term store throughput is limited to the bandwidth of the slowest write-through cache level. ↩
I say the L2 because the behavior is already reflected in the L2 performance counters, but it could be teamwork between the L2 and other components, e.g., the L3 could say “OK, I’ve got that line you RFO’d and BTW it is all zeros”. ↩
Although only stores appear in the source, at the hardware level this benchmark does at least as many reads as stores: every store must do a read for ownership (RFO) to get the current value of the line before storing to it. ↩
Eagle-eyed readers, all two of them, might notice that the performance in the L3 region is different than the previous figure: here the performance slopes up gradually across most of the L3 range, while in the previous test it was very flat. Absolute performance is also somewhat lower. This is a testing artifact: reading the uncore performance counters necessarily involves a kernel call, taking over 1,000 cycles versus the < 100 cycles required for rdpmc to measure the CPU performance counters needed for the prior figure. Due to “flaws” (laziness) in the benchmark, this overhead is captured in the shown performance, and larger regions take longer, meaning that this fixed measurement overhead has a smaller relative impact, so you get this measured = actual - overhead/size type effect. It can be fixed, but I have to reboot my host into single-user mode to capture clean numbers, and I am feeling too lazy to do that right now, although as I look back at the size of the footnote I needed to explain it I am questioning my judgement. ↩
On SKL client CPUs we can do this with the uncore_imc/data_writes/ events, which polls internal counters in the memory controller itself. This is a socket-wide event, so it is important to do this measurement on as quiet a machine as possible. ↩
I tried a bunch of other stuff that I didn’t write up in detail. Many of them affect the behavior: we still see the optimization but with different levels of effectiveness. For example, with L2 prefetching off, only about 40% of the L2 evictions are eliminated (versus > 60% with prefetch on), and the performance difference between is close to zero despite the large number of eliminations. I tried other sizes of writes, and with narrow writes the effect is reduced until it is eliminated at 4-byte writes. I don’t think the write size directly affects the optimization, but rather narrower writes slow down the maximum possible performance which interacts in some way with the hardware mechanisms that support this to reduce how often it occurs (a similar observation could apply to prefetching). ↩
By outbound queues I mean the path between an inner and outer cache level. So for the L2, the outbound bus is the so-called superqueue that connects the L2 to the uncore and L3 cache. ↩
The Graviton 2 uses the Cortex A76 uarch, which can execute 2 stores per cycle, but the L1 cache write ports limits sustained execution to only one 128-bit store per cycle. ↩
It was the first full day of general availability for Graviton, so perhaps these hosts are very lightly used at the moment because it certainly felt like I had the whole thing to myself. ↩
By escaping I mean that a store that visibly gets to the cache level where this optimization happens. For example, if I write a 1 immediately followed by a 0, the 1 will never make it out of the L1 cache, so from the point of view of the L2 and beyond only a zero was written. I expect the optimization to still trigger in this case. ↩
This phenomenon is why calloc is sometimes considerably faster than malloc + memset. With calloc the zeroing happens within the allocator, and the allocator can track whether the memory it is about to return is known zero (usually because it the block is fresh from the OS, which always zeros memory before handing it out to userspace), and in the case of calloc it can avoid the zeroing entirely (so calloc runs as fast as malloc in that case). The client code calling malloc doesn’t receive this information and can’t make the same optimization. If you stretch the analogy almost to the breaking point, one can see what Intel is doing here as “similar, but in hardware”. ↩
Ha! To me, everything is “worth noting” if it means another footnote. ↩

Adding Staticman Comments

2020-02-05T00:00:00+00:00

I’ve added comments to my blog. You can find the existing comments, if any, and the new comment form at the bottom of any post.

I thought this would take a couple hours, but it actually took [REDACTED]. Estimates are hard.

Here’s what I did.

Table of Contents
Introduction
Set Up GitHub Bot Account
- Generate Personal Access Token
Set Up the Blog Repository Configuration
- Configuring staticman.yml
- Configuring _config.yml
Set Up the API Bridge
Invite and Accept Bot to Blog Repo
Enable reCAPTCHA
- Sign Up for reCAPTCHA
- Configure reCAPTCHA
Integrate Comments Into Site
- Markdown Part
Testing on This Post
Thanks
References

Introduction

I am using staticman, created by Eduardo Bouças, as my comments system for this static site.

The basic flow for comment submission is as follows:

A reader submits the comment form on a blog post.
Javascript¹ attached to the form submits it to my staticman API bridge² running on Heroku.
The API bridge does some validation of the request and submits a pull request to the github repo hosting my blog, consisting of a .yml file with the post content and meta data.
When I accept the pull request, it triggers a regeneration and republishing of the content (this is a GitHub pages feature), so the reply appears almost immediately³.

Here are the detailed steps to get this working. There are several other tutorials out there, with varying states of exhaustiveness, some of which I found only after writing most of this, but I’m going to add the pile anyways. There have been several changes to deploying staticman which mean that existing resources (and this one, of course) are marked by which “era” they were written in.

The major changes are:

At one point the idea was that everyone would use the public staticman API bridge, but this proved unsustainable. A large amount of the work in setting up staticman is associated with running your own instance of the bridge.
There are three version of the staticman API: v1, v2 and v3. This guide uses v2 (although v3 is almost identical⁴), but the v1 version is considerably different.

Set Up GitHub Bot Account

You’ll want to create a GitHub bot account which will be the account that the API bridge uses to actually submit the pull requests to your blog repository. In principle, you can skip this step entirely and simply use your existing GitHub account, but I wouldn’t recommend it:

You’ll be generating a personal access token for this account, and uploading it to the cloud (Heroku) and if this somehow gets compromised, it’s better that it’s a throwaway bot account than your real account.
Having a dedicated account makes it easy to segregate work done by the bot, versus what you’ve done yourself. That is, you probably don’t want all the commits and pushes the bot does to show up on your personal account.

The bot account is nothing special: it is just a regular personal account that you’ll only be using from the API bridge. So, open a private browser window, go to GitHub and choose “Sign Up”. Call your bot something specific, which I’ll refer to as GITHUB-BOT-NAME from here forwards.

Generate Personal Access Token

Next, you’ll need to generate a GitHub personal access token, for your bot account. The GitHub doc does a better job of explaining this than I can. If you just want everything to work for sure now and in the future, select every single scope when it prompts you, but if you care about security you should only need the repo and user scopes (today):

Repo scope:

User scope:

Copy and paste the displayed token somewhere safe: you’ll need this token in a later step where I’ll refer to it as ${github_token}. Once you close this page there is no way to recover the access token.

Set Up the Blog Repository Configuration

You’ll need to include configuration for staticman in two separate places in your blog repository: _config.yml (the primary Jekyll config file) and staticman.yml, both at the top level of the repository.

In general, the stuff that goes in _config.yml is for use within the static generation phase of your site, e.g., controlling the generation of the comment form and the associated javascipt. The stuff in staticman.yml isn’t used during generation, but is used dynamically by the API bridge (read directly from GitHub on each request) to configure the activities of the bridge. A few thigns are duplicated in both places.

Configuring staticman.yml

Most of the configuration for the ABI bridge is set in staticman.yml which lives in the top level of your blog repository. This means that one API bridge can support many different blog repositories, each with their own configuration (indeed, this feature was critical for the original design of a shared ABI bridge).

Here’s a sample file from the staticman GitHub repository, but you might want to use this one from my repository as it is a bit more fleshed out.

The main things you want to change are shown below.

# all of these fields are nested under the comments key, which corresponds to the final element
# of the API bridge enpoint, i.e., you can different configurations even within the same staticman.yml
# file all under different keys
comments:

  # There are many more required config values here, not shown:
  # use the file linked above as a template

  # I guess used only for email notifications?
  name: "Performance Matters Blog"

  # You may want a different set of "required fields". Staticman will
  # reject posts without all of these fields
  requiredFields: ["name", "email", "message"]

  # you are going to want reCaptcha set up, but for now leave it disabled because we need the API
  # bridge up and running in order to encrypt the secrets that go in this section
  reCaptcha:
    enabled: false
  #  siteKey: 6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1
  #  secret: a big encrypted secret (see Note above)

Configuring _config.yml

The remainder of the configuration goes in _config.yml. Here’s the configuration I added to start with (we’ll add a bit more later):

# The URL for the staticman API bridge endpoint
# You will want to modify some of the values:
#  ${github-username}: the username of the account with which you publish your blog
#  ${blog-repo}: the name of your blog repository in github
#  master: this the branch out of which your blog is published, often master or gh-pages
#  ${bridge_app_name}: the name you chose in Heroku for your bridge API
#  comments: the so-called property, this defines the key in staticman.yml where the configuration is found
#
# for me, this line reads:
# https://staticman-travisdownsio.herokuapp.com/v2/entry/travisdowns/travisdowns.github.io/master/comments
staticman_url: https://${bridge_app_name}.herokuapp.com/v2/entry/${github-username}/${blog-repo}/master/comments

Set Up the API Bridge

This section covers deploying a private instances of the API bridge to Heroku.

Generate an RSA Keypair

This keypair will be used to encrypt secrets that will be stored in public places, such as your reCAPTCHA site secret. The sececrets will be encrypted with the public half of the keypair, and decriped in the Bridge API server with the private part.

Use the following on your local to generate to generate the pair:

ssh-keygen -m PEM -t rsa -b 4096 -C "staticman key" -f ~/.ssh/staticman_key

Don’t use any passphrase⁵. You can change the -f argument if you want to save the key somewhere else, in which case you’ll have to use the new location when setting up the Heroku config below.

You can verify the key was genreated by running:

head -2 ~/.ssh/staticman_key

Which should output something like:

-----BEGIN RSA PRIVATE KEY-----
MIIJKAIBAAKCAgEAud7+fPWXzuxCoyyGbQTYCGi9C1N984roI/Tr7yJi074F+Cfp

Your second line will vary of course, but the first line must be -----BEGIN RSA PRIVATE KEY-----. If you see something else, perhaps mentioning OPENSSH PRIVATE KEY, it won’t work.

The original idea of staticman was to have a public API bridge that everyone uses for free. However, in practice this hasn’t proved sustainable as whatever free tier the thing was running on tends to hit its limits and then the fun stops. So the current recommendation is to set up a free instance of the API bridge on Heroku. So let’s do that.

Sign up for a free account on Heroku. No credit card is required and a free account should give you enough juice for at least 1,000 comments a month⁶.

Deploy Staticman Bridge to Heroku

The easiest way to do this is simply to click the Deploy to Heroku button in the README on the staticman repo:

You’ll see probably some logging indicating that the project is downloading, building and then successfully deployed.

Configure Bridge Secrets

The bridge needs a couple of secrets to do its job:

The GitHub personal access token of your bot account. This lets it do work on behalf of your bot account (in particular, submit pull requests to your blog repository).
The private key of the keypair you generated earlier.

If you want, you can add both of these through the Heroku web dashboard: go to Settings -> Reveal Config Vars, and enter them like this).

However, you might as well get familiar with the Heroku command line, because it’s pretty cool and allows you to complete this flow without having your GitHub token flow through your clipboard and makes it easy to remove the newline characters in the private key.

Follow the instructions to install and login to the Heroku CLI, then issue the following commands from any directory (note that ${github_token} is the personal access token you generated earlier: copy and paste it into the command):

heroku config:add --app ${bridge_app_name} "RSA_PRIVATE_KEY=$(cat ~/.ssh/staticman_key | tr -d '\n')"
heroku config:add --app ${bridge_app_name} "GITHUB_TOKEN=${github_token}"

Here, the tr -d '\n' part of the pipeline is removing the newlines from the private key, since Heroku config variables can’t handle them and/or the API bridge can’t handle them.

You can check that the config was correctly set by outputting it as follows:

heroku config --app ${bridge_app_name}

Invite and Accept Bot to Blog Repo

Finally, you need to invite your GitHub bot account that you created earlier to your blog repository⁷ and accept the invite.

Open your blog repository, go to Settings -> Collaborators and search for and add the GitHub bot account that you created earlier as a collaborator:

Next, accept⁸ the invitation using the bridge API, by going to the following URL:

https://${bridge_app_name}.herokuapp.com/v2/connect/${github-username}/${blog-repo}

You should see OK! as the output if it worked: this only appears once when the invitation got accepted, at all other times it will show Invitation not found.

Enable reCAPTCHA

You are going to want to gate comment submission using reCAPTCHA or a similar system so you don’t get destroyed by spam (even if you have moderation enabled, dealing with all the pull requests will probably be tiring).

Here we’ll cover setting up reCAPTCHA, which has built-in support in staticman. Although it involves modifying the same _config.yml and staticman.yml files that we’ve modified before, this part of the configuration needs to occur after the bridge is running because we use the /encrypt endpoint on the bridge as part of the setup.

Go to reCAPTCHA and sign up if you haven’t already, and create a new site. We are going to use the “v2, Checkbox” variant (docs here), although I’m interested to hear how it works out with other variants.

You will need the reCAPTCHA site key and secret key for configuration in the next section.

Configure reCAPTCHA

Next, we need to add the site key and secret key to the _config.yml and staticman.yml config files.

The site key will be used as-is, but the secret key property will be encrypted so that it is not exposed in plaintext in your configuration files. To encrypt the secret key copy the secret from the reCAPTCHA admin console, and load the following URL from your API bridge, replacing YOUR_SITE_KEY at with the copied secret key.

https://${bridge_app_name}.herokuapp.com/v2/encrypt/YOUR_SITE_KEY

You should get a blob of characters back as a result (considerably longer than the original secret) – it is this value that you need to include as reCaptcha.secret in both staticman.yml and in _config.yml.

The reCAPTCHA configuration for both files is almost the same. It looks like this for staticman.yml:

comments:

  # more stuff

  # note that reCaptcha is nested under comments
  reCaptcha:
    enabled: true
    # the siteKey is used as-is (no encryption)
    siteKey: 6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1
    # the secret is the encrypted blob you got back from the encrypt call
    secret: a big encrypted secret (see description above)

The _config.yml version is similar except that they key appears at the top level and there is no enabled property:

# reCaptcha configuration info: the exact same site key and *encrypted* secret that you used in staticman.yml
# I personally don't think the secret needs to be included in the generated site, but the staticman API bridge uses
# it to ensure the site configuration and bridge configuration match (but why not just compare the site key?)
reCaptcha:
  siteKey: 6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1
  secret: exactly the same secret as the staticman.yml file

Integrate Comments Into Site

Finally, you need to integrate code to display the existing comments and submit new comments.

I used a mash up of commenting code from the spinningnumbers.org blog as well as the staticman integration in the minimal mistakes theme. The advantage the former has over the latter is that the comments allow one level of nesting (replies to top-level comments are nested beneath it).

I planed to extract the associated markdown, liquid and JavaScript code to a separate repository as a single point where people could collaborate on this part of the integration, but man I’ve already spent way to long on this. I may still do it, but for now here’s how I did the integration.

Markdown Part

The key thing you need to do is include a blob of HTML and associated JavaScript in any page where you want to display and accept comments. I do this as follows:

{% if page.comments == true %}
  {% include comments.html %}
{% endif %}

You can paste it into any post, or better add it to the footer.html include or something like that (details depend on your theme). The invariant is that wherever this appears, the existing comments appear, followed by a form to submit new comments. You can see the comments.html include here – in turn, it includes comment.html (once per comment, generates the comment html) and comment_form.html which generates the new comment form.

This ultimately includes external JavaScript for JQuery and reCAPTCHA, as well as main.js which includes the JavaScript to implement the replies (moving the form when the “reply to” button is clicked, and submitting the form via AJAX to the API bridge).

You can try to use this same integration in your Jekyll blog. You’d need to:

Copy the _includes/comment.html, _includes/comments.html, _includes/comment_form.html, assets/main.js, and _sass/comment-styles.css from my blog files to your blog repository.
In assets/main.js, replace the link to https://github.com/travisdowns/travisdowns.github.io/pulls with a link to your own repository (or otherwise customize the “success” message as you see fit).
Include @import "comment-styles"; in your assets/main.scss file. If you don’t have one, you’ll need to create it following the rules for your theme. Usually this just means a main.scss with empty front-matter and an @import "your-theme"; line to import the theme SCSS. Alternately, you could avoid putting anything in main.scss and just include the comment styles as a separate file, but this adds another request to each post.
Do the include comments.html thing shown above in an appropriate place in your template/theme.
Set comments: true in the front matter of posts you want to have comments (or set it as a default in _config.yml).

Testing on This Post

If you want to leave a non-trivial comment on the content of this post, you can do so below. However, if you’d instead like to make a “just testing” post, to see how the http request works, or check the created pull request, etc, please do it over here. Testing only comments made on this post will be closed without accepting them.

Thanks

Thanks to Eduardo Boucas for creating staticman.

Thanks to Willy McAllister for nested comment display work I unabashedly cribbed, and helping me sort out an RSA key genreation problem, and pointing out some inconsistencies in the doc.

References

Things that were handy references while getting this working.

This comment on GitHub issue #318 was the list that I more or less follwed (I didn’t use the dev branch though).

Willy McAllister describes setting up staticman in this post – his implemented of nested comments forms the basis the one I used.

Another list of steps to get staticman working and some troubleshooting.

Michael Rose, the author of minimal mistakes Jekyll theme describes setting up nested staticman comments – I cribbed some stuff from here such as the submitting spinner.

Willy McAllister subsequently wrote a great guide to setting up Staticman, similar in nature to this one but with some additional sections such as troubleshooting and setting up reply notifications via MailGun. If that one was around when I started out, I wouldn’t have felt the need to write this one.

If javascript is disabled, a regular POST action takes over. ↩
I don’t think you’ll find this bridge term in the official documentation, but I’m going to use it here. ↩
Well, subject to whatever edge caching GitHub pages is using – btw you can bust the cache by appending any random query parameter to the page: ...post.html?foo=1234. ↩
v3 mostly just extends to the URL format for the /event endpoint to include the hosting provider (either GitHub or GitLab), allowing the use of GitHub in addition to GitLab. Almost everything in this guide would remain unchanged. ↩
You could use a passphrase, but then you’ll have to change the cat used below to echo the key into the Heroku config. If you want to be super safe, best is to generate the key to a transient location like ramfs and then simply delete the private portion after you’ve uploaded it to the Heroku config. ↩
In particular, the unverified (no credit card) free tier gives you 550 hours of uptime a month, and since the dyno (heroku speak for their on-demand host) sleeps after 30 minutes, I figure you can handle 550/0.5 = 1100 sparsely submitted comments. Of course, if comments come in bursts, you could handle much more than that, since you’ve already “paid” for the 30 minute uptime. ↩
The bot needs to be a collaborator to, at a minimum, commit comments to the repository, and to delete branches (using the delete branches webhook which cleans up comment related branches). However, it is possible to not use either of these features if you have moderation enabled (in which case comments arrive as a PR, which doesn’t require any particular permissions), and aren’t using the webhook. So maybe you could do without the collaborator status in that case? I haven’t tested it. ↩
I guess you can also just accept the invitation by opening the email sent to you by github and following the link there. This workflow involving the v2/connect endpoint probably made more sense when the API was meant to be shared among many uses using a common github bot account. ↩

The Hunt for the Fastest Zero

2020-01-20T00:00:00+00:00

Let’s say I ask you to fill a char array of size n with zeros. I don’t know why, exactly, but please play along for now.

If this were C, we would probably reach for memset, but let’s pretend we are trying to write idiomatic C++ instead.

You might come up with something like¹:

void fill1(char *p, size_t n) {
    std::fill(p, p + n, 0);
}

I’d give this solution full marks. In fact, I’d call it more or less the canonical modern C++ solution to the problem.

What if told you there was a solution that was up to about 29 times faster? It doesn’t even require sacrificing any goats to the C++ gods, either: just adding three characters:

void fill2(char *p, size_t n) {
    std::fill(p, p + n, '\0');
}

Yes, switching 0 to '\0' speeds this up by nearly a factor of thirty on my SKL box², at least with my default compiler (gcc) and optimization level (-O2):

Function	Bytes / Cycle
fill1	1.0
fill2	29.1

The why becomes obvious if you look at the assembly:

fill1:

fill1(char*, unsigned long):
        add     rsi, rdi
        cmp     rsi, rdi
        je      .L1
.L3:
        mov     BYTE PTR [rdi], 0  ; store 0 into memory at [rdi]
        add     rdi, 1             ; increment rdi
        cmp     rsi, rdi           ; compare rdi to size
        jne     .L3                ; keep going if rdi < size
.L1:
        ret

This version is using a byte-by-byte copy loop, which I’ve annotated – it is more or less a 1:1 translation of how you’d imagine std::fill is written. The result of 1 cycle per byte is exactly what we’d expect using speed limit analysis: it is simultaneously limited by two different bottlenecks: 1 taken branch per cycle, and 1 store per cycle.

The fill2 version doesn’t have a loop at all:

fill2:

fill2(char*, unsigned long):
        test    rsi, rsi
        jne     .L8                ; skip the memcpy call if size == 0
        ret
.L8:
        mov     rdx, rsi
        xor     esi, esi
        jmp     memset             ; tailcall to memset

Rather, it simply defers immediately to memset. We aren’t going to dig into the assembly for memset here, but the fastest possible memset would run at 32 bytes/cycle, limited by 1 store/cycle and maximum vector the width of 32 bytes on my machine, so the measured value of 29 bytes/cycle indicates it’s using an implementation something along those lines.

So that’s the why, but what’s the why of the why (second order why)?

I thought this had something to do with the optimizer. After all, at -O3 even the fill1 version using the plain 0 constant calls memset.

I was wrong, however. The answer actually lies in the implementation of the C++ standard library (there are various, gcc is using libstdc++ in this case). Let’s take a look at the implementation of std::fill (I’ve reformatted the code for clarity and removed some compile-time concept checks):

  /*
   *  ...
   *
   *  This function fills a range with copies of the same value.  For char
   *  types filling contiguous areas of memory, this becomes an inline call
   *  to @c memset or @c wmemset.
  */
  template<typename _ForwardIterator, typename _Tp>
  inline void fill(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    std::__fill_a(std::__niter_base(__first), std::__niter_base(__last), __value);
  }

The included part of the comment³ already hints at what is to come: the implementor of std::fill has apparently considered specifically optimizing the call to a memset in some scenarios. So we keep following the trail, which brings us to the helper method std::__fill_a. There are two overloads that are relevant here, the general method and an overload which handles the special case:

  template<typename _ForwardIterator, typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<!__is_scalar<_Tp>::__value, void>::__type
  __fill_a(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    for (; __first != __last; ++__first)
      *__first = __value;
  }

  // Specialization: for char types we can use memset.
  template<typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tp& __c)
  {
    const _Tp __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

Now we see how the memset appears. It is called explicitly by the second implementation shown above, selected by enable_if when the SFINAE condition __is_byte<_Tp> is true. Note, however, that unlike the general function, this variant has a single template argument: template<typename _Tp>, and the function signature is:

__fill_a(_Tp* __first, _Tp* __last, const _Tp& __c)

Hence, it will only be considered when the __first and __last pointers which delimit the range have the exact same type as the value being filled. When when you write std::fill(p, p + n, 0) where p is char *, you rely on template type deduction for the parameters, which ends up deducing char * and int for the iterator type and value-to-fill type, because 0 is an integer constant.

That is, it is if you had written:

std::fill<char *, int>(p, p + n, 0);

This prevents the clever memset optimization from taking place: the overload that does it is never called because the iterator value type is different than the value-to-fill type.

This suggests a fix: we can simply force the template argument types rather than rely on type deduction:

void fill3(char *p, size_t n) {
    std::fill<char *, char>(p, p + n, 0);
}

This way, we get the memset version.

Finally, why does fill2 using '\0' get the fast version, without forcing the template arguments? Well '\0' is a char constant, so the value-to-assign type is char. You could achieve the same effect with a cast, e.g., static_cast<char>(0) – and for buffers which have types like unsigned char this is necessary because '\0' does not have the same type as unsigned char (at least on gcc).

One might reasonably ask if this could be fixed in the standard library. I think so.

One idea would be to keying off of only the type of the first and last pointers, like this:

  template<typename _Tp, typename _Tvalue>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tvalue& __c)
  {
    const _Tvalue __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

This says: who cares about the type of the value, it is going to get converted during assignment to the value type of the pointer anyways, so just look at the pointer type. E.g., if the type of the value-to-assign _Tvalue is int, but _Tp is char then this expands to this version, which is totally equivalent:

  __fill_a(char* __first, char* __last, const int& __c)
  {
    const int __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

This works … for simple types like int. Where it fails is if the value to fill has a tricky, non-primitive type, like this:

struct conv_counting_int {
    int v_;
    mutable size_t count_ = 0;

    operator char() const {
        count_++;
        return (char)v_;
    }
};

size_t fill5(char *p, size_t n) {
    conv_counting_int zero{0};
    std::fill(p, p + n, zero);
    return zero.count_;
}

Here, the pointer type passed to std::fill is char, but you cannot safely apply the memset optimization above, since the conv_counting_int counts the number of times it is converted to char, and this value will be wrong (in particular, it will be 1, not n) if you perform the above optimization.

This can be fixed. You could limit the optimization to the case where the pointer type is char-like and the value-to-assign type is “simple” in the sense that it won’t notice how many times it has been converted. A sufficient check would be that the type is scalar, i.e. std::is_scalar<T> – although there is probably a less conservative check possible. So something like this for the SNIFAE check:

  template<typename _Tpointer, typename _Tp>
    inline typename
    __gnu_cxx::__enable_if<__is_byte<_Tpointer>::__value && __is_scalar<_Tp>::__value, void>::__type
    __fill_a( _Tpointer* __first,  _Tpointer* __last, const _Tp& __value) {
      ...

Here’s an example of how that would work. It’s not fully fleshed out but shows the idea.

Finally, one might ask why memset is used when gcc is run at -O3 or when clang is used (like this). The answer is the optimizer. Even if the compile-time semantics of the language select what appears to a byte-by-byte copy loop, the compiler itself can transform that into memset, or something else like a vectorized loop, if it can prove it is as-if equivalent. That recognition happens at -O3 for gcc but at -O2 for clang.

What Does It Mean

So what does it all mean? Is there a moral to this story?

Some use this as evidence that the somehow C++ and/or the STL are irreparably broken. I don’t agree. Some other languages, even “fast” ones, will never give you the memset speed, although many will - but many of those that do (e.g., java.util.Arrays.fill()) do it via special recognition or handling of the function by the compiler or runtime. In the C++ standard library, the optimization the library writers have done is available to anyone, which is a big advantage. That the optimization fails, perhaps unexpectedly, in some cases is unfortunate but it’s nice that you can fix it yourself.

Also, C++ gets two shots at this one: many other languages rely on the compiler to optimize these patterns, and this also occurs in C++. It’s just a bit of a quirk of gcc that optimization doesn’t help here: it doesn’t vectorize at -O2, nor does it do idiom recognition. Both of those result in much faster code: we’ve seen the effect of idiom recognition already: it results in a memset. Even if idiom recognition wasn’t enabled or didn’t work, vectorization would help a lot: here’s gcc at -O3, but with idiom recognition disabled. It uses 32-byte stores (vmovdqu YMMWORD PTR [rax], ymm0) which will be close to memset speed (but a bit of unrolling woudl have helped). In many other languages it would only be up to the compiler: there wouldn’t be a chance to get memset even with no optimization as there is in C++.

Do we throw out modern C++ idioms, at least where performance matters, for example by replacing std::fill with memset? I don’t think so. It is far from clear where memset can even be used safely in C++. Unlike say memcpy and trivially copyable, there is no type trait for “memset is equivalent to zeroing”. It’s probably OK for byte-like types, and is widely used for other primitive types (which we can be sure are trivial, but can’t always be sure of the representation), but even that may not be safe. Once you introduced even simple structures or classes, the footguns multiply. I recommend std::fill and more generally sticking to modern idioms, except in very rare cases where profiling has identified a hotspot, and even then you should take the safest approach that still provides the performance you need (e.g., by passing (char)0 in this case).

Source

The source for my benchmark is available on GitHub.

Thanks

Thanks to Matt Godbolt for creating Compiler Explorer, without which this type of investigation would be much more painful – to the point where it often wouldn’t happen at all.

Matt Godbolt, tc, Nathan Kurz and Pocak for finding typos.

Discuss

I am still working on my comments system (no, I don’t want Disqus), but in the meantime you can discuss this post on Hacker News, Reddit or lobste.rs.

If you liked this post, check out the homepage for others you might enjoy.

Of course, you wouldn’t wrap the std::fill function in another fill function that just forwards directly to the standard function: you’d just call std::fill directly. We use a function here so you can see the parameter types and we can examine the disassembly easily. ↩
On quickbench, the difference varies slightly from run to run but is usually around 31 to 32 times. ↩
Interestingly, the comment mentions wmemset in addition to memset which would presumably be applied for values of type wchar_t (32-bits on this platform), but I don’t find any evidence that is actually the case via experiment or by examining the code – the optimization appears to only be currently implemented for byte-like values and memset. ↩

Gathering Intel on Intel AVX-512 Transitions

2020-01-17T00:00:00+00:00

Introduction

This is a post about AVX and AVX-512 related frequency scaling¹.

Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines², so do we really need to add to the pile?

Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.

Now I wasn’t really planning on writing about this just now, but I got off on a (nested) tangent³, so let’s examine the AVX-512 downclocking behavior using target tests. At a minimum, this is necessary background for the next post, but I hope that it is also standalone interesting.

Note: For the really short version, you can skip to the summary, but then what will do you for the rest of the day?

You could perhaps trying skipping ahead to a section that interests you using this obligatory table of contents, but sections are not self contained, so you’ll be better off reading the whole thing linearly.

Introduction
- Table of Contents
The Source
Test Structure
- Hardware
Tests
- 256-bit Integer SIMD (AVX)
- 512-bit Integer SIMD (AVX-512)
What Was Left Out
Summary
Thanks
Discuss

The Source

All of the code underlying this post is available in the post1 branch of freq-bench, so you can follow along at home, check my work, and check out the behavior on your own hardware. It requires Linux and the README gives basic clues on getting started.

The source includes the data generation scripts as well as those to generate the plots. Neither shell scripting nor Python are my forte, so be gentle.

Test Structure

We want to investigate what happens when instruction stream related performance transitions occur. The most famous example is what happens when you execute an AVX-512 instruction⁴ for the first time in a while, but as we will see there are other cases.

The basic idea is that the test has a duty period and every time this period elapses, we run a test-specific payload for the duration of the payload period which consists of one or more “interesting” instructions (which depend on the test). During the entire test we sample various metrics at a best-effort fixed frequency. This repeats for the entire test period. The sample period will generally be much smaller than the duty period⁵: in our tests we use a 5,000 μs duty period and a sample period of 1 μs, mostly.

Visually, it is something like this (showing a single duty period: one benchmark is composed of multiple duty cycles back to back):

This diagram shows the payload period as occupying a non-negligible amount of time. However, in the first few first tests, the payload period is essentially zero: we run the payload function (which consists of only a couple instructions) only once, so it is really a payload moment rather than period.

Hardware

We are running these tests on a SKX architecture W-series CPU: a W-2104⁶ with the following license-based frequencies⁷:

Name	License	Frequency
Non-AVX Turbo	L0	3.2 GHz
AVX Turbo	L1	2.8 GHz
AVX-512 Turbo	L2	2.4 GHz

For one (voltage) test I also use my Skylake (mobile) i7-6700HQ, running at either it’s nominal frequency of 2.6 GHz, or the turbo frequency of 3.5 GHz.

Tests

The basic approach this post will take is examining the CPU behavior using the test framework above, primarily varying what the payload is, and what metrics we look at. Let’s get the ball rolling with 256-bit instructions.

256-bit Integer SIMD (AVX)

For the first test will use as payload the vporymm_vz function, which is just a single 256-bit vpor instruction, followed by a vzeroupper⁸:

vporymm_vz:
  vpor   ymm0,ymm0,ymm0
  vzeroupper
  ret

We call the payload function only once at the start of each duty period⁹. The duty period is set to 5000 μs and the sample period to 1 μs, and the total test time is set to 31,000 μs (so the payload will execute 7 times).

Here’s the result (plot notes¹⁰), with time along the x axis¹¹, showing the measured frequency at each sample (there are three separate test runs shown¹²):

SVG version of this plot

Well that’s really boring. The entire test runs consistently at 3.2 GHz, the nominal (L0 license) frequency, if we ignore the a few uninteresting outliers¹³.

512-bit Integer SIMD (AVX-512)

Before the crowd gets too rowdy, let’s quickly move on to the next test, which is identical except that it uses 512-bit zmm registers:

vporzmm_vz:
  vpor   zmm0,zmm0,zmm0
  vzeroupper
  ret

Here is the result:

SVG version of this plot

We’ve got something to sink our teeth into!

Remember that the duty cycle is 5000 μs, so at each x-axis tick we execute the payload. Now the behavior is clear: every time the payload instruction executes (at multiples of 5000 μs), the frequency drops from the 3.2 GHz L0 license down to 2.8 GHz L1 license frequency. So far this is all pretty much as expected.

Let’s zoom in on one of the transition points at 15,000 μs:

SVG version of this plot

We can make the following observations:

There is a transition period (the rightmost of the two shaded regions, in orange¹⁴) of ~11 μs¹⁵ where the CPU is halted: no samples occur during this period¹⁶. For fun, I’ll call this a frequency transition.
The leftmost shaded region, shown in purple¹⁷, immediately following the payload execution at 15,000 μs and prior to the halted region, is ~9 μs long and the frequency remains unchanged. This is not just a test issue or measurement error: this period occurs after the payload and is consistently reproducible¹⁸. Although it looks like nothing interesting is going on in this region, we’ll soon see it is indeed special and will call this region a voltage-only transition.
Although not fully shown in the zoomed plot, the lower 2.8 GHz frequency period lasts for ~650 μs.
Not shown in the zoomed plot (but seen as a second downwards spike on the full plot, after the ~650 μs period of low frequency), there is another fully halted period of ~11 us, after which the CPU returns to it’s maximum speed of 3.2 GHz (L0 license).
These attributes are mostly consistent across the three runs (so much that the series, in green, mostly overlaps and obscures the others) – but there are a few outliers in where the return to 3.2 GHz takes somewhat longer. This is consistent across runs: recovery is never faster than ~650 μs, but sometimes longer. I believe it occurs when an interrupt during the L1 region “resets the timer”.

Enter IPC

Although it is not visible in this plot, there is something special about the behavior of 512-bit instructions in the first shaded (purple) region – that is, in the 9 microseconds between the execution of the payload instruction and before the subsequent halted period: they execute much slower than usual.

This is easiest to see if we extend the payload period: instead executing the payload function once every 5000 μs, then looping on rdtsc, waiting for the next sample, we will continue to execute the payload function for 100 μs after a new duty period starts (that is, the payload period is set to 100 μs). During this time we still take samples as usual, every 1 μs – but in between samples we are executing the payload instruction(s)¹⁹. So one duty period now looks like 100 μs of payload followed by 4850 μs of normal payload-free hot spinning.

We lengthen the payload period in order to examine the performance of the payload instructions. There are several metrics we could look at, but a simple one is to look at instructions per second. As long as we make sure the large majority of the executed instructions are payload instructions, the IPC will largely reflect the execution of the payload²⁰.

As payload, we will use a function composed simply of 1,000 dependent 512-bit vpord instructions:

vpord  zmm0,zmm0,zmm0
vpord  zmm0,zmm0,zmm0
; ... 997 more like this
vpord  zmm0,zmm0,zmm0
vzeroupper
ret

We know these vpord instructions have a latency of 1 cycle and here they are serially dependent so we expect this function to take 1,000 cycles, give or take²¹, for an IPC of 1.0.

Here’s what a the same zoomed transition point for this looks like, with IPC plotted on the secondary axis:

SVG version of this plot

First, note that in the unshaded regions on the left (before 15,000 μs) and right (after 15,100 μs), the IPC is basically irrelevant: no payload instructions are being executed, so the IPC there is just the whatever the measurement code happens to net out to. We only care about the IPC in the shaded regions, where the payload is executing.

Let’s tackle the regions from right to left, which happens to correspond to obvious to less obvious.

We have the blue region, running from ~15020 μs to 15100 μs (where the extra payload period ends). Here the IPC is right at 1 instruction per cycle. So the payload is executing right at the expected rate, i.e., full speed. Keeners may point out that the very beginning of the blue period, the IPC (and the measured frequency) is a bit noisier and slightly above 1. This is not a CPU effect, but rather a measurement one: during this phase the benchmark is catching up on samples missed during the previous halted period, which changes the payload to overhead ratio and bumps up the IPC (details²²).

The middle, orange, region shows us what we’ve already seen: the CPU is halted, so no samples occur. IPC doesn’t tell us much here.

Voltage Only Transitions

The most interesting part is the first shaded region (purple): after the payload starts running but before the halt which I call a voltage only transition for reasons that will soon become clear.

Here, we see that the payload executes much more slowly, with an IPC of ~0.25. So in this region, the vpord instructions are apparently executing at four times their normal latency. I also observe an identical 4x slowdown for vpord throughput, using an identical test except with independent vpord instructions²³.

Perhaps surprisingly, this same slowdown occurs for 256-bit ymm instructions as well. This contradicts the conventional wisdom that on AVX-512 chips there is no penalty to using light 256-bit instructions:

SVG version of this plot

The results shown above are for a test identical to the 512-version except that it uses 256-bit vpor ymm0, ymm0, ymm0 as the payload. It shows the same slowdown for ~9 μs after the payload starts executing, but no subsequent halt and no frequency transition. That is, it shows a voltage-only transition (lack of frequency transition is expected because we don’t expect a turbo license change for light 256-bit instructions).

By now, you are probably wondering about 128-bit xmm registers. The good news is that these show no effect at all:

SVG version of this plot

Here, the IPC jumps immediately to the expected value. So it appears that the CPU runs in a state where the 128-bit lanes are ready to go at all times²⁴.

The conventional wisdom regarding this “warmup” period is that the upper part²⁵ of the vector units is shut down when not in use, and takes time to power up. The story goes that during this power-up period the CPU does not need to halt but it runs SIMD instructions at a reduced throughput by splitting up the input into 128-bit chunks and passing the data two or more times through the powered-on 128-bit lanes²⁶.

However, there are some observations that seem to contradict this hypothesis (in rough order from least to most convincing):

The observed impact to latency and throughput is ~4x, whereas I would expect 2x for simple instructions such as vpor.
The timing is the same for 256-bit and 512-bit instructions: despite that 512-bit instructions take at least 2x the work, i.e., need to be passed through the 128-bit unit at least 4 times.
Some instructions are more difficult to implement using this type of splitting, e.g., instructions where both high and low output lanes depend on all of the input lanes²⁷ (see how slow they are on Zen). I expected that maybe these instructions would be slower when running in split mode, but I tested vpermd and found that it runs at 4L4T²⁸, compared to 3L1T normally. So vpermd (including the 512-bit version) didn’t slow more than vpor, and in fact in a relative sense it slowed down less (e.g., the latency only changed from 3 to 4). The fact that the latency and throughput reacted differently for this instruction seems odd, and that it has now the exact same 4L4T timing as vpor seems like a strange coincidence.
Oddly, when I tried to time the slowdown more precisely, I kept coming with fractional value around 4.2x, not 4.0x, kind of contradicting the idea that the instruction is simply operating in a different mode, which should still have an integral latency.
As it turns out, all ALU²⁹ instructions are slower in this mode, not just wide SIMD ones.

It was 5 that sealed the deal on this not being a slowdown related to split execution. I believe what is actually happening is the CPU is doing very fine-grained throttling when wider instructions are executing in the core. That is, the upper lanes are being used in this mode (they are either not gated at at all, or are gated but enabling them is very quick, less than 1 μs) but execution frequency is reduced by 4x because CPU power delivery is not a state that can handle full-speed execution of these wider instructions, yet. While the CPU waits (e.g., for voltage to rise, fattening the guardband) for higher power execution to be allowed, this fine-grained throttling occurs.

This throttling affects non-SIMD instructions too, causing them to execute at 4x their normal latency and inverse throughput. We can show with the following test, which combines a single vpor ymm0, ymm0, ymm0 with N chained add eax, 0 instructions, shown here for N = 3:

vpor   ymm0,ymm0,ymm0
add    eax,0x0
add    eax,0x0
add    eax,0x0
; repeated 9 more times

If only vpor is slowed down, each block of 4 instructions will take 4 cycles, limited by the vpor chain (the add chain is 3 cycles long). However, I actually measure ~12 cycles, indicating that we are instead limited by the add chain, each of which takes 4 cycles for a total of 12.

We can vary the number of add instructions (N) to see how long this effect persists. This table is the result:

ADD instructions (N)	Cycles/ADD	Delta Cycles (slow)	Delta Cycles (fast)
2	4.1	2.3	-0.2
3	4.1	4.0	0.8
4	4.1	4.1	1.1
5	4.0	3.9	1.1
6	4.1	4.3	0.7
7	4.0	3.4	1.1
8	4.0	4.2	0.9
9	4.1	3.9	0.8
10	4.1	4.3	1.1
20	4.0	4.0	1.0
30	4.0	4.0	1.0
40	4.1	4.3	1.0
50	4.1	3.9	1.0
60	4.1	4.4	1.0
70	4.2	4.5	1.0
80	4.1	3.4	1.0
90	3.6	-0.2	1.0
100	3.3	1.1	1.0
120	2.9	0.9	1.0
140	2.7	1.1	1.0
160	2.5	1.2	1.0
180	2.3	0.7	0.9
200	2.2	0.8	1.0

The Cycles/ADD column shows the number of cycles taken per add instruction over the entire slow region (roughly the first 8-10 μs after the payload starts executing). The Delta Cycles (slow) shows how many cycles each additional add instruction took compared to the previous row: i.e., for row N = 30, it determines how much longer the 10 additional add instructions took compared to the row N = 20. The Delta Cycles (fast) column is the same thing, but applies to the samples after ~10 μs when the CPU is back up to full speed (that column shows the expected 1.0 cycles per additional add).

Here we clearly see that up to roughly 70 add instructions, interleaved with a single vpor, all the add instructions are taking 4 cycles, i.e., the CPU is throttled. Somewhere between 80 and 90 a transition happens: additional add instructions now take 1 cycle, but the overall time per add is (initially) close to 4. This shows that when add (and presumably any non-wide instruction) is far enough away from the closest wide SIMD instruction, they start executing at full speed. So the timings for the larger N values can be understood as a blend of a slow section of ~70-80 add instructions near the vpor which run at 1 per 4 cycles, and the remaining section where they run at full speed: 1 per cycle.

We can probably conclude the CPU is not just throttling frequency or “duty cycling”: in that case every instruction would be slowed down by the same factor, but instead the rule is more like “latency extended to the next multiple of 4 cycles”, e.g., a latency 3 instruction like imul eax, eax, 0 ends up taking 4 cycles when the CPU is throttling. It is likely that the throttling happens at some part of the pipeline before execution, e.g., at issue or dispatch.

The transition to fast mode when the vpor instructions are spread sufficiently apart probably reflects the size of some structure such as the IDQ (64 entries in Skylake) or scheduler (97 entries claimed³⁰). The core could track whether any wide instruction currently in that structure, and enforce the slow mode if so. The vpor instructions are close enough together, there is always at least one present, but once they are spaced out enough, you get periods of fast mode.

Voltage Effects

We can actually test the theory that this transition is associated with waiting for a change in power delivery configuration. Specifically, we can observe the CPU core voltage, using bits 47:32 of the MSR_PERF_STATUS MSR. Volume 4 of the Intel Software Development Manual let’s us on a secret: these bits expose the core voltage³¹:

Let’s zoom as usual on a transition point, in this case using a 256-bit (ymm) payload of 1000 dependent vpor instructions. This 256-bit payload means no frequency transition, only a dispatch throttling period associated with running 256-bit instructions for the first time in a while. We plot the time it takes to run an iteration of the payload³², along with the measured voltage:

SVG version of this plot

The length of the throttling period is around 10 μs as usual, as shown by the period where the payload takes ~4,000 cycles (the usual 4x throttling), and the voltage is unchanged from the pre-transition period (at about 0.951 V) during the throttling period. At the moment the throttling stops, the voltage jumps to about 0.957, a change of about 6 mV. This happens at 2.6 GHz, the nominal non-turbo speed of my i7-6700HQ. At 3.5 GHz, the transition is from 1.167 to 1.182, so both the absolute voltages and the difference (about 15 mV) are larger, consistent the basic idea that higher frequencies need higher voltages.

So one theory is that this type of transition represents the period between when the CPU has requested a higher voltage (because wider 256-bit instructions imply a larger worst-case current delta event, hence a worst-case voltage drop) and when the higher voltage is delivered. While the core waits for the change to take effect, throttling is in effect in order to reduce the worst-case drop: without throttling³³ there is no guarantee that a burst of wide SIMD instructions won’t drop the voltage below the minimum voltage required for safe operation at this frequency.

Attenuation

We might check if there is any attenuation of either type of transition. By attenuation I mean that if a core is transitioning between frequencies too frequently, the power management algorithm may decide to simply keep the core at the lower frequency, which can provide more overall performance when considering the halted periods needed in each transition. This is exactly what happens for active core count³⁴ transitions: too many transitions in a short period of time and the CPU will just decide to run at the lower frequency rather than incurring the halts need to transition between e.g. the 1-core and 2-core turbos³⁵.

We check this by setting a duty period which is just above the observed recovery time from 2.8 to 3.2 GHz, to see if we still see transitions. Here’s a duty cycle of 760 μs, about 10 μs more than the observed recovery period for this test³⁶:

SVG version of this plot

I’m not going to color the regions here, as by now I think you are probably (over?) familiar with them. The key points are:

The payload starts executing at 7600 μs, which is before the upwards frequency transition, we are still executing at 2.8 GHz - so initially the IPC is high, 1 per cycle.
Despite the fact that we are already executing again 512-bit instructions, the frequency adjusts upwards a few μs later. Most likely what happened is that the power logic already evaluated earlier (say at ~7558 μs, just before the payload started) that an upwards transition should occur, but as we’ve seen the response is generally delayed by 8 to 10 μs so it occurs after the payload has already started executing.
Of course, as soon as the transition occurs, the core is no longer in a suitable state for full-speed wide SIMD execution, so IPC drops to ~0.25.
Another transition back to low frequency occurs ~10 μs later and then full speed execution can resume.

So there is no attenuation, but attenuation isn’t really needed: the long (~650 μs) cooldown period between the last wide instruction and subsequent frequency boost means that the damage from halt periods are fairly limited: this is unlike the active core count scenario where the CPU has no control over the transition frequency (rather it is driven by interrupts and changes in runnable processes and threads). Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide SIMD payload represents a vanishingly small part of the total work).

What Was Left Out

There’s lots we’ve left out. We haven’t even touched:

Checking whether xmm registers also cause a voltage-only transition, if they haven’t been used for a while. We didn’t find any effect, but it also certain that some 128-bit instructions appear in the measurement loop which would hide the effect.
Checking whether the voltage-only transition implied by 256-bit instructions are disjoint from those for 512-bit. That is, if you execute a 256-bit instruction after a while without any, you get a voltage-only transition (confirmed above). If you then execute a 512-bit instruction, before the relaxation period expires, do you get a second throttling period prior to the frequency transition? I believe so but I haven’t checked it.
Any type of investigation of “heavy” 256-bit or 512-bit instructions. These require a license one level (numerically) higher than light instructions, and knowing if any of the key timings change would be interesting³⁷.
Almost no investigation was made how any of these timings (and the magnitude of voltage changes) vary with frequency. For example, if we are already running at a lower frequency, frequency transitions are presumably not needed, and voltage-only transitions may be shorter.

Summary

For the benefit of anyone who just skipped to the bottom, or whose eyes glazed over at some point, here’s a summary of the key findings:

After a period of about 680 μs not using the AVX upper bits (255:128) or AVX-512 upper bits (511:256) the processor enters a mode where using those bits again requires at least a voltage transition, and sometimes a frequency transition.
The processor continues executing instructions during a voltage transition, but at a greatly reduced speed: 1/4th the usual instruction dispatch rate. However, this throttling is fine-grained: it only applies when wide instructions are in flight (details).
Voltage transitions end when the voltage reaches the desired level, this depends on the magnitude of the transition but 8 to 20 μs is common on the hardware I tested.
In some cases a frequency transitions is also required, e.g., because the involved instruction requires a higher power license. These transitions seem to first incur a throttling period similar to a voltage-only transition, and then a halted period of 8 to 10 μs while the frequency changes.
A key motivator for this post was to give concrete, qualitative guidance on how to write code that is as fast as possible given this behavior. It got bumped to part 2.

We also summarize the key timings in this beautifully rendered table:

What	Time	Description	Details
Voltage Transition	~8 to 20 μs	Time required for a voltage transition, depends on frequency	³⁸
Frequency Transition	~10 μs	Time required for the halted part of a frequency transition	³⁹
Relaxation Period	~680 μs	Time required to go back to a lower power license, measured from the last instruction requiring the higher license	⁴⁰

Thanks

Daniel Lemire who provided access to the AVX-512 system I used for testing.

David Kanter of RWT for a fruitful discussion on power and voltage management in modern chips.

RWT forum members anon³, Ray, Etienne, Ricardo B, Foyle and Tim McCaffrey who provided feedback on this post and helped me understand the VR landscape for recent Intel chips.

Alexander Monakov, Kharzette and Justin Lebar for finding typos.

Jeff Smith for teaching me about spread spectrum clocking.

Discuss

Discussion on Hacker News, Twitter and lobste.rs.

Direct feedback also welcomed by email or as a GitHub issue.

If you liked this post, check out the homepage for others you might enjoy.

… and also non frequency related performance events, which I mention only in a footnote not to spoil the fun for non-footnote type people and also to pad by footnote count. That’s why I call this Performance Transitions, instead of Frequency Transitions. ↩
Note that Daniel has written much more than just that one. ↩
This was going to be the actual post I was trying to write when I went off on this tangent about clang-format. In fact I was writing that post, when I went off this current tangent, but then a footnote turned into several paragraphs, then got its own section and ultimately graduated into a whole post: the one you are reading. So consider this background reading for the “interesting” post still to come, although honestly the stuff here is probably more generally useful than the next part. ↩
We should be clear here: when I say AVX-512 instruction in this context, I mean specifically a 512-bit wide instruction (which currently only exist in AVX-512). The distinction is that AVX-512 includes 128-bit and 256-bit versions of almost every instruction it introduces, yet these narrower instructions behave just like 128-bit SSE* and 256-bit AVX* instructions in terms of performance transitions. So, for example, vpermw is unabiguously AVX-512 instruction: it only exists in AVX-512BW, but only the version that takes zmm registers causes “AVX-512 like” performance transitions: the versions that take ymm or xmm registers act as any other integer 256-bit AVX2 or 128-bit SSE instruction. ↩
In fact, we generally want the sample period to be as small as possible, to give the best resolution and insight into short-lived events. We can’t make it too short though as the samples themselves have a minimum time to capture, and very short samples tend to be noisy due to non-atomicity, quantization effects, etc. ↩
The CPU is an Intel W-2104, a Xeon-W chip based on the SKX uarch. It has no turbo and an on-the-box speed of 3200 MHz, but accurate tools will probably report it running at 3192 MHz due to spread spectrum clocking (SSC). In fact, we can see the typical 0.5% spread spectrum triangle wave on almost any of the plots in this post, at the right zoom level, like this one. This occurs because we measure time (the x-axis) based on rdtsc which is based off of a different clock not subject to SSC, while the unhalted cycles counter counts CPU cycles, which are based off of the 100 MHz BLCK which is subject to SSC. ↩
Measured with avx-turbo. ↩
This is the part where I just gloss over what that vzeroupper is doing there. It’s there due to implicit widening. That’s a new term I just invented and it’s the first and last time I’m mentioning it this post, because it really deserves an entire post of its own. The very short version is that any time an SIMD instruction writes to an N-bit register (N in {256, 512}), all subsequent SIMD instructions are implicitly N bits wide, regardless of their actual width, for the purposes of determining turbo licenses and other transitions discussed here. This sounds a bit like the mixed-VEX penalties thing, but it is very different. This a mini-bombshell hidden in a footnote, so if you want to scoop me you can, but I’m not coming to your birthday party. ↩
I.e., the payload period is zero, but the structure of the test ensures the function is called once, at the start of the payload period, no matter how small the payload period is. ↩
All of the plots in this post are SVG images, meaning they can be zoomed arbitrarily: so if you want to zoom in on any region feel free (the browser limits the zoom amount, but just save as a file and open it with any SVG viewer). ↩
These are basically a poor man’s version of the Falk Diagrams that Brandon Falk describes in this post among others. Poor in the sense that they have ~3000 cycle resolution instead of 1 cycle resolution, and because the measurement code has to be integrated directly into the system under test. Basically they are nothing like Falk Diagrams except that they have time on the x-axis and some performance counter event on the y-axis, but they are good enough for our purposes. ↩
Specifically, all three runs are identical and I show them just to give a rough impression of which effects are reproducible and which are outliers. The second and third runs have the suffixes _1 and _2, respectively, in the legends. ↩
These outliers usually occur when an interrupt occurs during the measurement. Originally I used a 3 μs sample time, and there were almost no visible outliers with that period, but the 1 μs value I settled on is much better in most respects other than outliers. The main problem is when an interrupt takes more than the sample time of ~1 μs: this causes one or more samples to be very short, because a long sample will cause subsequent ones to be short (maybe very short) until we catch up to the fixed sample schedule, and very short samples are subject to much more noise because the absolute metric values are much smaller but the error sources usually have fixed absolute error. Another source of outliers is when an interrupt splits the stamp: the stamp is the series of metrics we calculate at the sample point. These various metrics aren’t sampled atomically: if an interrupt occurs in the middle of the sampling, some metrics will reflect a much shorter time period (before the interrupt) and some a longer one (after). This effect tends to cause bidirectional spike outliers: where an upspike and downspike occur in consecutive samples. I try to avoid this by retrying the stamp if I think I’ve detected an interrupt during measurement (up to a retry limit). We could avoid all this nonsense by running the benchmark itself in the kernel, where we can disable interrupts (although some SMIs might still sneak in). Maybe next time! ↩
More precisely, the color is peachpuff. ↩
Note that the resolution of the sampling is 1 μs, so when we say things like 9 μs and 11 μs it could be off by up to 1 μs. This 1 μs “error” isn’t exactly randomly distributed, because the sampling interval is “exact” and in phase with the payload execution. So the samples are always at 1.0, 2.0, 3.0, … μs after the payload executes (plus or minus small variation on the order of 10 nanoseconds), so if the true time until halt is anywhere between 9.00 and 9.99 μs, we will always read 9 μs (because time shown for the sample is the end of the sample and covers the preceding 1 us). For the halt time, the scenario is reversed: the start of the interval is uncertain, but the end should be exact modulo the delay in taking the sample. ↩
Generally you can detect halted periods when rdtsc jumps forward, but performance counters like “non-halted cycles” do not, and there are not indications of a larger interruption such as a context switch. You can find a similar case of halted periods in this question which was also caused by frequency transitions (in this case, to obey the different active core count turbo ratios). ↩
More precisely, the color is thistle. ↩
In particular, I have confirmed that these three samples all occur after the payload has executed and retired using the period column in the output, which indicates clearly which sames are before or after a given execution of the payload. The payload itself is followed by an lfence to ensure it has retired before taking further samples (and in any case the number of instructions per sample is too large be accommodated by the OoO window). ↩
Those who are still awake at this point might be wondering why introduce this new variant of the test now: why not just execute the payload instructions during the wait period in the original tests too? One reason is that by hot spinning on rdtsc we get somewhat more consistent results when we care only about measuring the frequency in that we almost always sample at exactly the specified period (plus or minus the rdtsc latency, more or less). When we execute the longish payload function during the wait, the sample point diverges a bit more from the ideal, and the number of spins per sample suffers more quantization effects (i.e., the pattern of spin counts might be 3,3,3,2,3,3,3,2… rather than 450,450,451,450…), which can sometimes lead to a slight sawtooth effect in the samples. ↩
In practice, we don’t exactly reach this ideal: we execute the payload function 2 or 3 times, for 2,000 or 3,000 vpor instructions, but there are about 600 additional instructions of overhead associated with taking a sample, so the overhead instructions are a significant portion. Probably 600 instructions is too many, I haven’t optimized that and it could likely be lowered significantly. However, we can also improve the ratio simply by decreasing the sampling resolution (i.e., increasing the sampling time). I selected 1 μs as a tradeoff between these competing factors. Note: Since I wrote this footnote I optimized several things in the sampling loop, so measured IPCs are now very close to their theoretical values, but I guess this footnote still has value?. ↩
The main uncertainty in the timing of the function itself concerns the boundary conditions: if we run this function 10 times without touching zmm0 in between, the dependency on zmm0 will be carried between functions and the total time will be very close to 10 x 1000 = 10,000 cycles. However, if some compiler generated code in between calls to the payload function happens to write to zmm0, breaking the dependency, the individual chains for each function no longer depend on each other, so some overlap is possible. The amount of this overlap is limited by the size of the RS, so the effect won’t be huge but it could be noticeable (with say 100 payload instructions rather than 1,000 it could be very significant). We basically sidestep this whole issue by putting an lfence between each call to the payload function, which forms an execution barrier between calls. ↩
The way the sampling works in this test could be described locked interval without skipping. Here, locked interval means we calculate the next target sample time based on the previous target sample time (rather than the actual sample time), plus the period. So if we are sampling at 10 μs, the target sample times will always be 10 μs, 20 μs, etc. In particular, the series of target sample time doesn’t depend on what happens during the test, e.g., it doesn’t depend on the actual sample time: even if we actually end up sampling at time = 12 μs rather than 10, we target time = 20 μs as the next sample, not 12 + 10 = 22 μs. This raises the question about what happens when some delay (e.g., an interrupt, a frequency transition) causes us to miss more than 1 entire sample period.

E.g., with 10 μs resolution we just sampled at 90 μs, so the next sample target is 100 μs, but a delay causes the next sample to be taken at 125 μs. We are now behind! The next sample should occur at 110 μs, but of course that is in the past. The current test design still takes all the required samples, as quickly as possible (but with a minimum of one payload call if we are in the extra payload period) – that’s the no skipping part. In the current example, it means we would take subsequent samples quickly until we are caught up, at say 125 μs (target 110 μs), 126 μs (target 120 μs), 130 μs (target 130 μs), with that last sample being “on target”.

These samples occur quickly: very quickly in the case of normal samples which take less than 0.2 μs each, or more slowly in the case of the extra payload region, where the payload call bumps that to about 0.5 μs. So that’s what’s happening in the green region: we just had a frequency transition halt of ~10 μs, so we are ~10 samples beyind and so the following samples are taken more quickly (as you can see because the data point markers are spaced more closely), with only a single payload call each (versus 2 or 3 usually). This changes the ratio of payload instructions to overhead instructions, which tends to bump the IPC a bit (for the same reason the caught-up blue region almost shows IPC > 1). This effect can be reduced or eliminated by increasing the sampling period, since that reduces the number of catchup samples.

Incidentally, this also explains the oscillating pattern you see in the blue region: the ideal number of payload calls to get a 1 μs sample rate is ~2.5, so the sampling strategy tends to alternative between two and three calls: more calls means IPC closer to 1: so the peaks of the oscillating are two calls and the valleys, three. ↩
I’m not going to fully analyze the throughput case, but the thorough among us can find the chart here. Note that the overhead here cuts the other way: pushing the IPC below the expected value of 2.0 (there are 2 512-bit vector units capable of executing vpord) because here the throughput limited instructions compete for ports with the overhead instructions. ↩
Of course, another possibility is that something in the rest of the test loop uses xmm registers so they remain “hot”: they are “baseline” for x86-64 after all, so the compiler is free to use them without any special flags. We could test this theory with a more compact test loop audited to be free of any xmm use… but I’m not going to bother. I’m pretty sure these guys are powered up all the time as their use is pervasive in most x86-64 code. ↩
Specifically, the the part handing the second (bits 128:255) for the ymm case, and the upper 3 lanes (bits 128:511) in the zmm case. ↩
This isn’t far fetched - that’s exactly how AMD Zen always executes 256-bit AVX instructions with its 128 bit units, and similar to how SVB and IVB did the same for 256-bit load and store instructions. So using narrower vector units to implement wider instructions is definitely a thing. ↩
These are usually called cross lane instructions (where a lane is 128 bits on Intel). For example, vpor is not like this: it is an element-wise operation where each output element (down to each bit, in this case) depends only on the element in the same position in the input vectors. On other the hand, vpermd is: each 32-bit element in the output can come from any position in the input vector (but it still behaves as element-wise wrt the mask register⁴¹). ↩
The notation 4L4T means: “4 cycles of latency and 4 cycles of inverse throughput”. That is, an given instance of this instruction takes 4 cycles to finish (latency), and a new instruction can start every 4 cycles (inverse throughput, hence the throughput is 0.25). When the CPU is running normally, most single-uop SIMD instructions have latency of 1 (in lane), 3 (most cross lane integer or shuffle ops) or 4 (most FP arithmetic). ↩
I say ALU instructions here, but I strongly suspect it might be all instructions: you can certainly test that at home using the same test (the vporxymm250_* group of tests) but with other types of instructions such as loads replacing the add. I didn’t really test all ALU instructions either: just a few - but it is fair to assume that if add is slowed down, it is something generic, probably affecting at least all ALU stuff, not something instruction specific. ↩
Documentation claims 97 entries, but my testing seems to indicate they are not unified in Skylake: apparently only 64 can be used for ALU ops, and 33 for memory ops. ↩
Try as I might, I can’t determine if this refers to measured core voltage, i.e., true voltage as determine at some sensor within the core, or demanded voltage, i.e., the voltage the processor wants right now based on the various power-relevant parameters, sometimes called the VID. In any case, we expect those values to track each other fairly closely, perhaps with some offset and since we are looking for voltage changes either one works. That said, I am very interested if you know the answer to this question. ↩
This payload time series is meant to show the exact same thing as the IPC series in earlier charts: we just want an indication of when the Type 1 throttling starts and stops. I used IPC initially because it was easy (I didn’t have to instrument the payload section specially) - but it doesn’t work when reading volts because that measurement involves a ton of additional instructions and a user-kernel transition, so it throws the IPC off completely. So I went ahead and instrumented the payload section directly, so we can still see the throttling, but no way I want to go back and change the other plots that use IPC. ↩
The throttling here is quite conservative I think: this is only a very small voltage change (less than 1%), so it is hard to believe that 4x throttling is necessary. It seems likely, for example, that cutting the dispatch rate in half would be enough to compensate for the missing 6 mV – but it is easy to imagine that just having a big conservative throttling number for all these voltage-too-low throttling scenarios is easy and safe, and these periods don’t occur often enough for it to really matter. ↩
Essentially all modern Intel CPUs have varying maximum turbo ratios depending on the number of active (not halted or in a sleep state). E.g., my Skylake CPU can run at at a max speed of 3.5, 3.3, 3.2 or 3.1 GHz if 1, 2, 3 or 4 cores are active, respectively. If only one core is currently running, at 3.5 GHz, and another core becomes active (e.g., because the scheduler found something to run) – the first core has to immediately transition down to 3.3 GHz and as we’ve seen above, it takes an ~8-10 μs halt to do so. When the other core stops running, it can return to 3.3 GHz. If cores are flipping between inactive and active quickly enough, those halts add up and cut into your effective frequency. At some point, you might get less work done by trying to run at max turbo, versus just running at 3.3 GHz all the time since in this case no halts need to be taken when the second core starts. Further explored over here. ↩
Avoiding these transitions are a hidden bonus of making the 1 and 2-core turbos the same, and more generally “grouping” the turbo ratios in a more coarse grained way across core counts: you don’t need any transition when the “from” and “to” core counts have the same max turbo ratio. ↩
Earlier we mentioned a low frequency duration of 650 μs, but that was the test that ran only a single payload instruction. The recovery period is measured from the last wide instruction and in this test (that shows the IPC) we execute 100 μs of payload, so the recovery time will be 100 μs + 650 μs = ~750 μs. ↩
Unlike the transitions discussed here, transitions related to heavy instructions are soft transitions: they do not occur unconditionally after a single instruction of the relevant type is executed, but rather only after some density threshold is reached for those instructions. Exploring this threshold would be interesting. There is another effect mentioned in the Intel optimization manual: heavy instructions may cause a license transition even in cases where it wouldn’t normally occur, when light instructions of one license level are mixed with half-width heavy instructions. That is, 128-bit heavy instructions can use the fastest L0 license, as can 256-bit light instructions. However, apparently, mixing these can cause a request for the L1 license. Similarly for 256-bit heavy instructions and 512-bit light instructions, where a downgrade from L1 to L2 could occur. ↩
I give a range of 8 to 20 μs because that’s what I measured in my testing, but the highest frequency I measured for a voltage-only transition at was 3.5 GHz, with a 15 mV delta. It is entirely possible that at higher frequencies and voltages, times are much longer. It could also depend on the hardware, e.g., the presence or absence of a FIVR. ↩
This transition time seems to be required by any frequency transition, whether up or down in frequency and regardless of the cause. This includes transition causes not tested here: for example, when the max turbo ratio changes because the active core count changes, or when the ideal frequency changes for any other reason. ↩
This same relaxation period appears to apply to both of the transitions types discussed in this post, e.g., both voltage and frequency. The relaxation timer is reset any type an instruction that needs the current license is executed. In this case, the 680 μs period is measured from the instruction that causes the transition (which is also the last relevant instruction since only a single payload instruction is used), until the time that the CPU resumes executing again at the higher frequency. This period includes one dispatch throttling period and two frequency transitions, so only about 650 μs of the 680 μs is spent executing instructions at full speed. ↩
Bonus question: are there any single-uop AVX/AVX-512 instructions which are cross-lane in at least two of their inputs? There are 3-input shuffles, like VPERMI2B, which have 2 of their 3 inputs as cross-lane (the two input tables), but they need 2 uops. ↩

A Note on Mask Registers

2019-12-05T16:30:00+00:00

AVX-512 introduced eight so-called mask registers¹, k0² through k7, which apply to most ALU operations and allow you to apply a zero-masking or merging³ operation on a per-element basis, speeding up code that would otherwise require extra blending operations in AVX2 and earlier.

If that single sentence doesn’t immediately indoctrinate you into the mask register religion, here’s a copy and paste from Wikipedia that should fill in the gaps and close the deal:

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register k0 is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, k0 is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be “zero”, which zeros everything not selected by the mask, or “merge”, which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

So mask registers⁴ are important, but are not household names unlike say general purpose registers (eax, rsi and friends) or SIMD registers (xmm0, ymm5, etc). They certainly aren’t going to show up on Intel slides disclosing the size of uarch resources, like these:

In particular, I don’t think the size of the mask register physical register file (PRF) has ever been reported. Let’s fix that today.

We use an updated version of the ROB size probing tool originally authored and described by Henry Wong⁵ (hereafter simply Henry), who used it to probe the size of various documented and undocumented out-of-order structures on earlier architecture. If you haven’t already read that post, stop now and do it. This post will be here when you get back.

You’ve already read Henry’s blog for a full description (right?), but for the naughty among you here’s the fast food version:

Fast Food Method of Operation

We separate two cache miss load instructions⁶ by a variable number of filler instructions which vary based on the CPU resource we are probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the total execution time is roughly⁷ as long as a single miss.

However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the baseline cache miss latency.

Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular GP instructions usually consume one physical register from the GP PRF so are a good choice to measure the size of that resource.

Mask Register PRF Size

Here, we use instructions that write a mask register, so can measure the size of the mask register PRF.

To start, we use a series of kaddd k1, k2, k3 instructions, as such (shown for 16 filler instructions):

mov    rcx,QWORD PTR [rcx]  ; first cache miss load
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]  ; second cache miss load
lfence                      ; stop issue until the above block completes
; this block is repeated 16 more times

Each kaddd instruction consumes one physical mask register. If number of filler instructions is equal to or less than the number of mask registers, we expect the misses to happen in parallel, otherwise the misses will be resolved serially. So we expect at that point to see a large spike in the running time.

That’s exactly what we see:

Let’s zoom in on the critical region, where the spike occurs:

Here we clearly see that the transition isn’t sharp – when the filler instruction count is between 130 and 134, we the runtime is intermediate: falling between the low and high levels. Henry calls this non ideal behavior and I have seen it repeatedly across many but not all of these resource size tests. The idea is that the hardware implementation doesn’t always allow all of the resources to be used as you approach the limit⁸ - sometimes you get to use every last resource, but in other cases you may hit the limit a few filler instructions before the theoretical limit.

Under this assumption, we want to look at the last (rightmost) point which is still faster than the slow performance level, since it indicates that sometimes that many resources are available, implying that at least that many are physically present. Here, we see that final point occurs at 134 filler instructions.

So we conclude that SKX has 134 physical registers available to hold speculative mask register values. As Henry indicates on the original post, it is likely that there are 8 physical registers dedicated to holding the non-speculative architectural state of the 8 mask registers, so our best guess at the total size of the mask register PRF is 142. That’s somewhat smaller than the GP PRF (180 entires) or the SIMD PRF (168 entries), but still quite large (see this table of out of order resource sizes for sizes on other platforms).

In particular, it is definitely large enough that you aren’t likely to run into this limit in practical code: it’s hard to imagine non-contrived code where almost 60%⁹ of the instructions write¹⁰ to mask registers, because that’s what you’d need to hit this limit.

Are They Distinct PRFs?

You may have noticed that so far I’m simply assuming that the mask register PRF is distinct from the others. I think this is highly likely, given the way mask registers are used and since they are part of a disjoint renaming domain¹¹. It is also supported by the fact that that apparent mask register PFR size doesn’t match either the GP or SIMD PRF sizes, but we can go further and actually test it!

To do that, we use a similar test to the above, but with the filler instructions alternating between the same kaddd instruction as the original test and an instruction that uses either a GP or SIMD register. If the register file is shared, we expect to hit a limit at size of the PRF. If the PRFs are not shared, we expect that neither PRF limit will be hit, and we will instead hit a different limit such as the ROB size.

Test 29 alternates kaddd and scalar add instructions, like this:

mov    rcx,QWORD PTR [rcx]
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]
lfence

Here’s the chart:

We see that the spike is at a filler count larger than the GP and PRF sizes. So we can conclude that the mask and GP PRFs are not shared.

Maybe the mask register is shared with the SIMD PRF? After all, mask registers are more closely associated with SIMD instructions than general purpose ones, so maybe there is some synergy there.

To check, here’s Test 35, which is similar to 29 except that it alternates between kaddd and vxorps, like so:

mov    rcx,QWORD PTR [rcx]
vxorps ymm0,ymm0,ymm1
kaddd  k1,k2,k3
vxorps ymm2,ymm2,ymm3
kaddd  k1,k2,k3
vxorps ymm4,ymm4,ymm5
kaddd  k1,k2,k3
vxorps ymm6,ymm6,ymm7
kaddd  k1,k2,k3
vxorps ymm0,ymm0,ymm1
kaddd  k1,k2,k3
vxorps ymm2,ymm2,ymm3
kaddd  k1,k2,k3
vxorps ymm4,ymm4,ymm5
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]
lfence

Here’s the corresponding chart:

The behavior is basically identical to the prior test, so we conclude that there is no direct sharing between the mask register and SIMD PRFs either.

This turned out not to be the end of the story. The mask registers are shared, just not with the general purpose or SSE/AVX register file. For all the details, see this follow up post.

An Unresolved Puzzle

Something we notice in both of the above tests, however, is that the spike seems to finish around 212 filler instructions. However, the ROB size for this microarchtiecture is 224. Is this just non ideal behavior as we saw earlier? Well we can test this by comparing against Test 4, which just uses nop instructions as the filler: these shouldn’t consume almost any resources beyond ROB entries. Here’s Test 4 (nop filer) versus Test 29 (alternating kaddd and scalar add):

The nop-using Test 4 nails the ROB size at exactly 224 (these charts are SVG so feel free to “View Image” and zoom in confirm). So it seems that we hit some other limit around 212 when we mix mask and GP registers, or when we mix mask and SIMD registers. In fact the same limit applies even between GP and SIMD registers, if we compare Test 4 and Test 21 (which mixes GP adds with SIMD vxorps):

Henry mentions a more extreme version of the same thing in the original blog entry, in the section also headed Unresolved Puzzle:

Sandy Bridge AVX or SSE interleaved with integer instructions seems to be limited to looking ahead ~147 instructions by something other than the ROB. Having tried other combinations (e.g., varying the ordering and proportion of AVX vs. integer instructions, inserting some NOPs into the mix), it seems as though both SSE/AVX and integer instructions consume registers from some form of shared pool, as the instruction window is always limited to around 147 regardless of how many of each type of instruction are used, as long as neither type exhausts its own PRF supply on its own.

Read the full section for all the details. The effect is similar here but smaller: we at least get 95% of the way to the ROB size, but still stop before it. It is possible the shared resource is related to register reclamation, e.g., the PRRT¹² - a table which keeps track of which registers can be reclaimed when a given instruction retires.

Finally, we finish this party off with a few miscellaneous notes on mask registers, checking for parity with some features available to GP and SIMD registers.

Move Elimination

Both GP and SIMD registers are eligible for so-called move elimination. This means that a register to register move like mov eax, edx or vmovdqu ymm1, ymm2 can be eliminated at rename by “simply”¹³ pointing the destination register entry in the RAT to the same physical register as the source, without involving the ALU.

Let’s check if something like kmov k1, k2 also qualifies for move elimination. First, we check the chart for Test 28, where the filler instruction is kmovd k1, k2:

It looks exactly like Test 27 we saw earlier with kaddd. So we would suspect that physical registers are being consumed, unless we have happened to hit a different move-elimination related limit with exactly the same size and limiting behavior¹⁴.

Additional confirmation comes from uops.info which shows that all variants of mask to mask register kmov take one uop dispatched to p0. If the move is eliminated, we wouldn’t see any dispatched uops.

Therefore I conclude that register to register¹⁵ moves involving mask registers are not eliminated.

Dependency Breaking Idioms

The best way to set a GP register to zero in x86 is via the xor zeroing idiom: xor reg, reg. This works because any value xored with itself is zero. This is smaller (fewer instruction bytes) than the more obvious mov eax, 0, and also faster since the processor recognizes it as a zeroing idiom and performs the necessary work at rename¹⁶, so no ALU is involved and no uop is dispatched.

Furthermore, the idiom is dependency breaking: although xor reg1, reg2 in general depends on the value of both reg1 and reg2, in the special case that reg1 and reg2 are the same, there is no dependency as the result is zero regardless of the inputs. All modern x86 CPUs recognize this¹⁷ special case for xor. The same applies to SIMD versions of xor such as integer vpxor and floating point vxorps and vxorpd.

That background out of the way, a curious person might wonder if the kxor variants are treated the same way. Is kxorb k1, k1, k1¹⁸ treated as a zeroing idiom?

This is actually two separate questions, since there are two aspects to zeroing idioms:

Zero latency execution with no execution unit (elimination)
Dependency breaking

Let’s look at each in turn.

Execution Elimination

So are zeroing xors like kxorb k1, k1, k1 executed at rename without latency and without needing an execution unit?

No.

Here, I don’t even have to do any work: uops.info has our back because they’ve performed this exact test and report a latency of 1 cycle and one p0 uop used. So we can conclude that zeroing xors of mask registers are not eliminated.

Dependency Breaking

Well maybe zeroing kxors are dependency breaking, even though they require an execution unit?

In this case, we can’t simply check uops.info. kxor is a one cycle latency instruction that runs only on a single execution port (p0), so we hit the interesting (?) case where a chain of kxor runs at the same speed regardless of whether the are dependent or independent: the throughput bottleneck of 1/cycle is the same as the latency bottleneck of 1/cycle!

Don’t worry, we’ve got other tricks up our sleeve. We can test this by constructing a tests which involve a kxor in a carried dependency chain with enough total latency so that the chain latency is the bottleneck. If the kxor carries a dependency, the runtime will be equal to the sum of the latencies in the chain. If the instruction is dependency breaking, the chain is broken and the different disconnected chains can overlap and performance will likely be limited by some throughput restriction (e.g., port contention). This could use a good diagram, but I’m not good at diagrams.

All the tests are in uarch bench, but I’ll show the key parts here.

First we get a baseline measurement for the latency of moving from a mask register to a GP register and back:

kmovb k0, eax
kmovb eax, k0
; repeated 127 more times

This pair clocks in¹⁹ at 4 cycles. It’s hard to know how to partition the latency between the two instructions: are they both 2 cycles or is there a 3-1 split one way or the other²⁰, but for our purposes it doesn’t matter because we just care about the latency of the round-trip. Importantly, the post-based throughput limit of this sequence is 1/cycle, 4x faster than the latency limit, because each instruction goes to a different port (p5 and p0, respectively). This means we will be able to tease out latency effects independent of throughput.

Next, we throw a kxor into the chain that we know is not zeroing:

kmovb k0, eax
kxorb k0, k0, k1
kmovb eax, k0
; repeated 127 more times

Since we know kxorb has 1 cycle of latency, we expect to increase the latency to 5 cycles and that’s exactly what we measure (the first two tests shown):

** Running group avx512 : AVX512 stuff **
                               Benchmark    Cycles     Nanos
                kreg-GP rountrip latency      4.00      1.25
    kreg-GP roundtrip + nonzeroing kxorb      5.00      1.57

Finally, the key test:

kmovb k0, eax
kxorb k0, k0, k0
kmovb eax, k0
; repeated 127 more times

This has a zeroing kxorb k0, k0, k0. If it breaks the dependency on k0, it would mean that the kmovb eax, k0 no longer depends on the earlier kmovb k0, eax, and the carried chain is broken and we’d see a lower cycle time.

Drumroll…

We measure this at the exact same 5.0 cycles as the prior example:

** Running group avx512 : AVX512 stuff **
                               Benchmark    Cycles     Nanos
                kreg-GP rountrip latency      4.00      1.25
    kreg-GP roundtrip + nonzeroing kxorb      5.00      1.57
       kreg-GP roundtrip + zeroing kxorb      5.00      1.57

So we tentatively conclude that zeroing idioms aren’t recognized at all when they involve mask registers.

Finally, as a check on our logic, we use the following test which replaces the kxor with a kmov which we know is always dependency breaking:

kmovb k0, eax
kmovb k0, ecx
kmovb eax, k0
; repeated 127 more times

This is the final result shown in the output above, and it runs much more quickly at 2 cycles, bottlenecked on p5 (the two kmov k, r32 instructions both go only to p5):

** Running group avx512 : AVX512 stuff **
                               Benchmark    Cycles     Nanos
                kreg-GP rountrip latency      4.00      1.25
    kreg-GP roundtrip + nonzeroing kxorb      5.00      1.57
       kreg-GP roundtrip + zeroing kxorb      5.00      1.57
         kreg-GP roundtrip + mov from GP      2.00      0.63

So our experiment seems to check out.

Reproduction

You can reproduce these results yourself with the robsize binary on Linux or Windows (using WSL). The specific results for this article are also available as are the scripts used to collect them and generate the plots.

Summary

SKX has a separate PRF for mask registers with a speculative size of 134 and an estimated total size of 142
This is large enough compared to the other PRF size and the ROB to make it unlikely to be a bottleneck
Mask registers are not eligible for move elimination
Zeroing idioms²¹ in mask registers are not recognized for execution elimination or dependency breaking

Part II

I didn’t expect it to happen, but it did: there is a follow up post about mask registers, where we (roughly) confirm the register file size by looking at an image of a SKX CPU captured via microcope, and make an interesting discovery regarding sharing.

Comments

Discussion on Hacker News, Reddit (r/asm and r/programming) or Twitter.

Direct feedback also welcomed by email or as a GitHub issue.

Thanks

Daniel Lemire who provided access to the AVX-512 system I used for testing.

Henry Wong who wrote the original article which introduced me to this technique and graciously shared the code for his tool, which I now host on github.

Jeff Baker, Wojciech Muła for reporting typos.

Image credit: Kellogg’s Special K by Like_the_Grand_Canyon is licensed under CC BY 2.0.

If you liked this post, check out the homepage for others you might enjoy.

These mask registers are often called k registers or simply kregs based on their naming scheme. Rumor has it that this letter was chosen randomly only after a long and bloody naming battle between MFs. ↩
There is sometimes a misconception (until recently even on the AVX-512 wikipedia article) that k0 is not a normal mask register, but just a hardcoded indicator that no masking should be used. That’s not true: k0 is a valid mask register and you can read and write to it with the k-prefixed instructions and SIMD instructions that write mask registers (e.g., any AVX-512 comparison. However, the encoding that would normally be used for k0 as a writemask register in a SIMD operation indicates instead “no masking”, so the contents of k0 cannot be used for that purpose. ↩
The distinction being that a zero-masking operation results in zeroed destination elements at positions not selected by the mask, while merging leaves the existing elements in the destination register unchanged at those positions. As as side-effect this means that with merging, the destination register becomes a type of destructive source-destination register and there is an input dependency on this register. ↩
I’ll try to use the full term mask register here, but I may also use kreg a common nickname based on the labels k0, k1, etc. So just mentally swap kreg for mask register if and when you see it (or vice-versa). ↩
H. Wong, Measuring Reorder Buffer Capacity, May, 2013. [Online]. Available: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ ↩
Generally taking 100 to 300 cycles each (latency-wise). The wide range is because the cache miss wall clock time varies by a factor of about 2x, generally between 50 and 100 naneseconds, depending on platform and uarch details, and the CPU frequency varies by a factor of about 2.5x (say from 2 GHz to 5 GHz). However, on a given host, with equivalent TLB miss/hit behavior, we expect the time to be roughly constant. ↩
The reason I have to add roughly as a weasel word here is itself interesting. A glance at the charts shows that they are certainly not totally flat in either the fast or slow regions surrounding the spike. Rather there are various noticeable regions with distinct behavior and other artifacts: e.g., in Test 29 a very flat region up to about 104 filler instructions, followed by a bump and then a linearly ramping region up to the spike somewhat after 200 instructions. Some of those features are explicable by mentally (or actually) simulating the pipeline, which reveals that at some point the filler instructions will contribute (although only a cycle or so) to the runtime, but some features are still unexplained (for now). ↩
For example, a given rename slot may only be able to write a subset of all the RAT entries, and uses the first available. When the RAT is almost full, it is possible that none of the allowed entries are empty, so it is as if the structure is full even though some free entries remain, but accessible only to other uops. Since the allowed entries may be essentially random across iterations, this ends up with a more-or-less linear ramp between the low and high performance levels in the non-ideal region. ↩
The “60 percent” comes from 134 / 224, i.e., the speculative mask register PRF size, divided by the ROB size. The idea is that if you’ll hit the ROB size limit no matter what once you have 224 instructions in flight, so you’d need to have 60% of those instructions be mask register writes¹⁰ in order to hit the 134 limit first. Of course, you might also hit some other limit first, so even 60% might not be enough, but the ROB size puts a lower bound on this figure since it always applies. ↩
Importantly, only instructions which write a mask register consume a physical register. Instructions that simply read a mask register (e.g,. SIMD instructions using a writemask) do not consume a new physical mask register. ↩ ↩²
More renaming domains makes things easier on the renamer for a given number of input registers. That is, it is easier to rename 2 GP and 2 SIMD input registers (separate domains) than 4 GP registers. ↩
This is either the Physical Register Reclaim Table or Post Retirement Reclaim Table depending on who you ask. ↩
Of course, it is not actually so simple. For one, you now need to track these “move elimination sets” (sets of registers all pointing to the same physical register) in order to know when the physical register can be released (once the set is empty), and these sets are themselves a limited resource which must be tracked. Flags introduce another complication since flags are apparently stored along with the destination register, so the presence and liveness of the flags must be tracked as well. ↩
In particular, in the corresponding test for GP registers (Test 7), the chart looks very different as move elimination reduce the PRF demand down to almost zero and we get to the ROB limit. ↩
Note that I am not restricting my statement to moves between two mask registers only, but any registers. That is, moves between a GP registers and a mask registers are also not eliminated (the latter fact is obvious if consider than they use distinct register files, so move elimination seems impossible). ↩
Probably by pointing the entry in the RAT to a fixed, shared zero register, or setting a flag in the RAT that indicates it is zero. ↩
Although xor is the most reliable, other idioms may be recognized as zeroing or dependency breaking idioms by some CPUs as well, e.g., sub reg,reg and even sbb reg, reg which is not a zeroing idiom, but rather sets the value of reg to zero or -1 (all bits set) depending on the value of the carry flag. This doesn’t depend on the value of reg but only the carry flag, and some CPUs recognize that and break the dependency. Agner’s microarchitecture guide covers the uarch-dependent support for these idioms very well. ↩
Note that only the two source registers really need to be the same: if kxorb k1, k1, k1 is treated as zeroing, I would expect the same for kxorb k1, k2, k2. ↩
Run all the tests in this section using ./uarch-bench.sh --test-name=avx512/*. ↩
This is why uops.info reports the latency for both kmov r32, k and kmov k, 32 as <= 3. They know the pair takes 4 cycles in total and under the assumption that each instruction must take at least one cycle the only thing you can really say is that each instruction takes at most 3 cycles. ↩
Technically, I only tested the xor zeroing idiom, but since that’s the groud-zero, most basic idiom we can pretty sure nothing else will be recognized as zeroing. I’m open to being proven wrong: the code is public and easy to modify to test whatever idiom you want. ↩

Performance Matters

Your CPU May Have Slowed Down on Wednesday

A Strange Performance Effect

Hump Day Strikes Back

Reproduction

Mea Culpa and an Unsustainable Path

Thanks

Discussion and Feedback

Ice Lake AVX-512 Downclocking

Ice Lake Frequency Behavior

AVX-Turbo

Ice Lake Results

license Mapping

Rocket Lake

So What?

Summary

Thanks

Discussion and Feedback

A Concurrency Cost Hierarchy

Introduction

A “Real World” Example

Source and Results

Hardware

Level 2: Contended Atomics

Level 3: System Calls

Revisiting std::mutex

Level 4: Implied Context Switch

Level 5: Catastrophe

Lock Convoy on Steroids

Level 1: Uncontended Atomics

Adaptive Multi-Counter

Level 0: Vanilla

The Table

So What?

Getting There

Level 4

Level 3

Level 2

Level 1

Level 0

Summary

Thanks

Discussion and Feedback

AVX-512 Mask Registers, Again

Exposition

Rising Action

The Die Shot

The Register Files

The Mystery Block

Let’s Get Legacy

Testing Our Theory

Some Missing Pieces

Thanks

Discussion and Feedback

Ice Lake Store Elimination

Introduction

ICL Results

The Compiler Has an Opinion

Elimination in Ice Lake

512-bit Stores

Summary

Future

Advice

Thanks

Discussion and Feedback

Hardware Store Elimination

Table of Contents

Prelude

Data Dependent Performance

Source

Benchmarks

A Very Simple Loop

Our First Benchmark

L1 and L2

Getting Weird in the L3

RAM: Still Weird

Wild, Irresponsible Speculation and Miscellanous Musings

Predicting a New Predictor

Predictor Test

Hardware Survey