A Concurrency Cost Hierarchy

Jul 6, 2020 • performancec++concurrency

Introduction

Concurrency is hard to get correct, at least for those of us unlucky enough to be writing in languages which expose directly the guts of concurrent hardware: threads and shared memory. Getting concurrency correct and fast is hard, too. Your knowledge about single-threaded optimization often won’t help you: at a micro (instruction) level we can’t simply apply the usual rules of μops, dependency chains, throughput limits, and so on. The rules are different.

If that first paragraph got your hopes up, this second one is here to dash them: I’m not actually going to do a deep dive into the very low level aspects of concurrent performance. There are a lot of things we just don’t know about how atomic instructions and fences execute, and we’ll save that for another day.

Instead, I’m going to describe a higher level taxonomy that I use to think about concurrent performance. We’ll group the performance of concurrent operations into six broad levels running from fast to slow, with each level differing from its neighbors by roughly an order of magnitude in performance.

I often find myself thinking in terms of these categories when I need high performance concurrency: what is the best level I can practically achieve for the given problem? Keeping the levels in mind is useful both during initial design (sometimes a small change in requirements or high level design can allow you to achieve a better level), and also while evaluating existing systems (to better understand existing performance and evaluate the path of least resistance to improvements).

A “Real World” Example

I don’t want this to be totally abstract, so we will use a real-world-if-you-squint¹ running example throughout: safely incrementing an integer counter across threads. By safely I mean without losing increments, producing out-of-thin air values, frying your RAM or making more than a minor rip in space-time.

Source and Results

The source for every benchmark here is available, so you can follow along and even reproduce the results or run the benchmarks on your own hardware. All of the results discussed here (and more) are available in the same repository, and each plot includes a [data table] link to the specific subset used to generate the plot.

Hardware

All of the performance results are provided for several different hardware platforms: Intel Skylake, Ice Lake, Amazon Graviton and Graviton 2. However except when I explicitly mention other hardware, the prose refers to the results on Skylake. Although the specific numbers vary, most of the qualitative relationships hold for the hardware too, but not always. Not only does the hardware vary, but the OS and library implementations will vary as well.

It’s almost inevitable that this will be used to compare across hardware (“wow, Graviton 2 sure kicks Graviton 1’s ass”), but that’s not my goal here. The benchmarks are written primarily to tease apart the characteristics of the different levels, and not as a hardware shootout.

Find below the details of the hardware used:

Micro-architecture	ISA	Model	Tested Frequency	Cores	OS	Instance Type
Skylake	x86	i7-6700HQ	2.6 GHz	4	Ubuntu 20.04
Ice Lake	x86	i5-1035G4	3.3 GHz	4	Ubuntu 19.10
Graviton	AArch64	Cortex-A72	2.3 GHz	16	Ubuntu 20.04	a1.4xlarge
Graviton 2	AArch64	Neoverse N1	2.5 GHz	16²	Ubuntu 20.04	c6g.4xlarge

Level 2: Contended Atomics

You’d probably expect this hierarchy to be introduced from fast to slow, or vice-versa, but we’re all about defying expectations here and we are going to start in the middle and work our way outwards. The middle (rounding down) turns out to be level 2 and that’s where we will jump in.

The most elementary way to safely modify any shared object is to use a lock. It mostly just works for any type of object, no matter its structure or the nature of the modifications. Almost any mainstream CPU from the last thirty years has some type of locking³ instruction accessible to userspace.

So our baseline increment implementation will use a simple mutex of type T to protect a plain integer variable:

T lock;
uint64_t counter;

void bench(size_t iters) {
    while (iters--) {
        std::lock_guard<T> holder(lock);
        counter++;
    }
}

We’ll call this implementation mutex add, and on my 4 CPU Skylake-S i7-6700HQ machine, when I use the vanilla std::mutex I get the following results for 2 to 4 threads:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The reported value is the median of all trials, and the vertical black error lines at the top of each bar indicate the interdecile range, i.e., the values at the 10th and 90th percentile. Where the error bars don’t show up, it means there is no difference between the p10 and p90 values at all, at least within the limits of the reporting resolution (100 picoseconds).

This shows that the baseline contended cost to modify an integer protected by a lock starts at about 125 nanoseconds for two threads, and grows somewhat with increasing thread count.

I can already hear someone saying: If you are just modifying a single 64-bit integer, skip the lock and just directly use the atomic operations that most ISAs support!

Sure, let’s add a couple of variants that do that. The std::atomic<T> template makes this easy: we can wrap any type meeting some basic requirements and then manipulate it atomically. The easiest of all is to use std::atomic<uint64>::operator++()⁴ and this gives us atomic add:

std::atomic<uint64_t> atomic_counter{};

void atomic_add(size_t iters) {
    while (iters--) {
        atomic_counter++;
    }
}

The other common approach would be to use compare and swap (CAS) to load the existing value, add one and then CAS it back if it hasn’t changed. If it has changed, the increment raced with another thread and we try again.

Note that even if you use increment at the source level, the assembly might actually end up using CAS if your hardware doesn’t support atomic increment⁵, or if your compiler or runtime just don’t take advantage of atomic operations even though they are available (e.g., see what even the newest version of icc does for atomic increment, and what Java did for years⁶). This caveat doesn’t apply to any of our tested platforms, however.

Let’s add a counter implementation that uses CAS as described above, and we’ll call it cas add:

std::atomic<uint64_t> cas_counter;

void cas_add(size_t iters) {
    while (iters--) {
        uint64_t v = cas_counter.load();
        while (!cas_counter.compare_exchange_weak(v, v + 1))
            ;
    }
}

Here’s what these look like alongside our existing std::mutex benchmark:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The first takeaway is that, at least in this unrealistic maximum contention benchmark, using atomic add (lock xadd at the hardware level) is significantly better than CAS. The second would be that std::mutex doesn’t come out looking all that bad on Skylake. It is only slightly worse than the CAS approach at 2 cores and beats it at 3 and 4 cores. It is slower than the atomic increment approach, but less than three times as slow and seems to be scaling in a reasonable way.

All of these operations are belong to level 2 in the hierarchy. The primary characteristic of level 2 is that they make a contended access to a shared variable. This means that at a minimum, the line containing the data needs to move out to the caching agent that manages coherency⁷, and then back up to the core that will receive ownership next. That’s about 70 cycles minimum just for that operation⁸.

Can it get slower? You bet it can. Way slower.

Level 3: System Calls

The next level up (“up” is not good here…) is level 3. The key characteristic of implementations at this level is that they make a system call on almost every operation.

It is easy to write concurrency primitives that make a system call unconditionally (e.g., a lock which always tries to wake waiters via a futex(2) call, even if there aren’t any), but we won’t look at those here. Rather we’ll take a look at a case where the fast path is written to avoid a system call, but the design or way it is used implies that such a call usually happens anyway.

Specifically, we are going to look at some fair locks. Fair locks allow threads into the critical section in the same order they began waiting. That is, when the critical section becomes available, the thread that has been waiting the longest is given the chance to take it.

Sounds like a good idea, right? Sometimes yes, but as we will see it can have significant performance implications.

On the menu are three different fair locks.

The first is a ticket lock with a sched_yield in the spin loop. The idea of the yield is to give other threads which may hold the lock time to run. This yield() approach is publicly frowned upon by concurrency experts⁹, who then sometimes go right ahead and use it anyway.

We will call it ticket yield and it looks like this:

/**
 * A ticket lock which uses sched_yield() while waiting
 * for the ticket to be served.
 */
class ticket_yield {
    std::atomic<size_t> dispenser{}, serving{};

public:
    void lock() {
        auto ticket = dispenser.fetch_add(1, std::memory_order_relaxed);

        while (ticket != serving.load(std::memory_order_acquire))
            sched_yield();
    }

    void unlock() {
        serving.store(serving.load() + 1, std::memory_order_release);
    }
};

Let’s plot the performance results for this lock alongside the existing approaches:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

This is level 3 visualized: it is an order of magnitude slower than the level 2 approaches. The slowdown comes from the sched_yield call: this is a system call and these are generally on the order of 100s of nanoseconds¹⁰, and it shows in the results.

This lock does have a fast path where sched_yield isn’t called: if the lock is available, no spinning occurs and sched_yield is never called. However, the combination of being a fair lock and the high contention in this test means that a lock convoy quickly forms (we’ll describe this in more detail later) and so the spin loop is entered basically every time lock() is called.

So have we now fully plumbed the depths of slow concurrency constructs? Not even close. We are only now just about to cross the River Styx.

Revisiting std::mutex

Before we proceed, let’s quickly revisit the std::mutex implementation discussed in level 2 in light of our definition of level 3 as requiring a system call. Doesn’t std::mutex also make system calls? If a thread tries to lock a std::mutex object which is already locked, we expect that thread to block using OS-provided primitives. So why isn’t it level 3 and slow like ticket yield?

The primary reason is that it makes few system calls in practice. Through a combination of spinning and unfairness I measure only about 0.18 system calls per increment, with three threads on my Skylake box. So most increments happen without a system call. On the other hand, ticket yield makes about 2.4 system calls per increment, more than an order of magnitude more, and so it suffers a corresponding decrease in performance.

That out of way, let’s get even slower.

Level 4: Implied Context Switch

The next level is when the implementation forces a significant number of concurrent operations to cause a context switch.

The yielding lock wasn’t resulting in many context switches, since we are not running more threads than there are cores, and so there usually is no other runnable process (except for the occasional background process). Therefore, the current thread stays on the CPU when we call sched_yield. Of course, this burns a lot of CPU.

As the experts recommend whenever one suggests yielding in a spin loop, let us try a blocking lock instead.

Blocking Locks

A more resource friendly design, and one that will often perform better is a blocking lock.

Rather than busy waiting, these locks ask the OS to put the current thread to sleep until the lock becomes available. On Linux, the futex(3) system call is the preferred way to accomplish this, while on Windows you have the WaitFor*Object API family. Above the OS interfaces, things like C++’s std::condition_variable provide a general purpose mechanism to wait until an arbitrary condition is true.

Our first blocking lock is again a ticket-based design, except this time it uses a condition variable to block when it detects that it isn’t first in line to be served (i.e., that the lock was held by another thread). We’ll name it ticket blocking and it looks like this:

void blocking_ticket::lock() {
    auto ticket = dispenser.fetch_add(1, std::memory_order_relaxed);

    if (ticket == serving.load(std::memory_order_acquire))
        return; // uncontended case

    std::unique_lock<std::mutex> lock(mutex);
    while (ticket != serving.load(std::memory_order_acquire)) {
        cvar.wait(lock);
    }
}

void blocking_ticket::unlock() {
    std::unique_lock<std::mutex> lock(mutex);
    auto s = serving.load(std::memory_order_relaxed) + 1;
    serving.store(s, std::memory_order_release);
    auto d = dispenser.load(std::memory_order_relaxed);
    assert(s <= d);
    if (s < d) {
        // wake all waiters
        cvar.notify_all();
    }
}

The main difference with the earlier implementation occurs in the case where we don’t acquire the lock immediately (we don’t return at the location marked // uncontended case). Instead of yielding in a loop, we take the mutex associated with the condition variable and wait until notified. Every time we are notified we check if it is our turn.

Even without spurious wakeups we might get woken many times, because this lock suffers from the thundering herd problem where every waiter is woken on unlock() even though only one will ultimately be able to get the lock.

We’ll try a second design too, that doesn’t suffer from thundering herd. This is a queued lock, where each lock waits on its own private node in a queue of waiters, so only a single waiter (the new lock owner) is woken up on unlock. We will call it queued fifo and if you’re interested in the implementation you can find it here.

Here’s how our new locks perform against the existing crowd:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

You’re probably seeing the pattern now: performance is again a new level of terrible compared to the previous contenders. About an order of magnitude slower than the yielding approach, which was already slower than the earlier approaches, which are now just slivers a few pixels high on the plots. The queued version of the lock does slightly better at increasing thread counts (especially on Graviton 2), as might be expected from the lack of the thundering herd effect, but is still very slow because the primary problem isn’t thundering herd, but rather a lock convoy.

Lock Convoy

Unlike unfair locks, fair locks can result in sustained convoys involving only a single lock, once the contention reaches a certain point¹¹.

Consider what happens when two threads, A and B, try to acquire the lock repeatedly. Let’s say A gets ticket 1 and B ticket 2. So A gets to go first and B has to wait, and for these implementations that means blocking (we can say the thread is parked by the OS). Now, A unlocks the lock and sees B waiting and wakes it. A is still running and soon tries to get the lock again, receiving ticket 3, but it cannot acquire the lock immediately because the lock is fair: A can’t jump the queue and acquire the lock with ticket 3 before B, holding ticket 2, gets its chance to enter the lock.

Of course, B is going to be a while: it needs to be woken by the scheduler and this takes a microsecond or two, at least. Now B wakes and gets the lock, and the same scenario repeats itself with the roles reversed. The upshot is that there is a full context switch for each acquisition of the lock.

Unfair locks avoid this problem because they allow queue jumping: in the scenario above, A (or any other thread) could re-acquire the lock after unlocking it, before B got its chance. So the use of the shared resource doesn’t grind to a halt while B wakes up.

So, are you tired of seeing mostly-white plots where the newly introduced algorithm relegates the rest of the pack to little chunks of color near the x-axis, yet?

I’ve just got one more left on the slow end of the scale. Unlike the other examples, I haven’t actually diagnosed something this bad in real life, but examples are out there.

Level 5: Catastrophe

Here’s a ticket lock which is identical to the first ticket lock we saw, except that the sched_yield(); is replaced by ;. That is, it busy waits instead of yielding (and here are the spin flavors which specialize on a shared ticket lock template). You could also replace this by a CPU-specific “relax” instruction like pause, but it won’t change the outcome. We call it ticket spin, and here’s how it performs compared to the existing candidates:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

What? That doesn’t look too bad at all. In fact, it is only slightly worse than the level 2 crew, the fastest we’ve seen so far¹².

The picture changes if we show the results for up to 6 threads, rather than just 4. Since I have 4 available cores¹³, this means that not all the test threads will be able to run at once:

Skylake

[data table] [raw data]

Increment Cost: Ticket Spin (Oversubscribed)

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

Now it becomes clear why this level is called catastrophic. As soon as we oversubscribe the number of available cores, performance gets about five hundred times worse. We go from 100s of nanoseconds to 100s of microseconds. I don’t show more threads, but it only gets worse as you add more.

We are also about an order of magnitude slower than the best solution (queued fifo) of the previous level, although it varies a lot by hardware: on Ice Lake the difference is more like forty times, while on Graviton this solution is actually slightly faster than ticket blocking (also level 4) at 17 threads. Note also the huge error bars. This is the least consistent benchmark of the bunch and exhibits a lot of variance and the slowest and fastest runs might vary by a factor of 100.

Lock Convoy on Steroids

So what happens here?

It’s similar to the lock convoy described above: all the threads queue on the lock and acquire it in a round-robin order due to the fair design. The difference is that threads don’t block when they can’t acquire the lock. This works out great when the cores are not oversubscribed, but falls off a cliff otherwise.

Imagine 5 threads, T1, T2, …, T5, where T5 is the one not currently running. As soon as T5 is the thread that needs the acquire the lock next (i.e., T5’s saved ticket value is equal to dispensing), nothing will happen because T1 through T4 are busily spinning away waiting for their turn. The OS scheduler sees no reason to interrupt them until their time slice expires. Time slices are usually measured in milliseconds. Once one thread is preempted, say T1, T5 will get the chance to run, but at most 4 total acquisitions can happen (T5, plus any of T2, T3, T4), before it’s T1’s turn. T1 is waiting for their chance to run again, but since everyone is spinning this won’t occur until another time slice expires.

So the lock can only be acquired a few times (at most $(nproc) times), or as little as once¹⁴, every time slice. Modern Linux using CFS doesn’t have a fixed timeslice, but on my system, sched_latency_ns is 18,000,000 which means that we expect two threads competing for one core to get a typical timeslice of 9 ms. The measured numbers are roughly consistent with a timeslice of single-digit milliseconds.

If I was good at diagrams, there would be a diagram here.

Another way of thinking about this is that in this over-subscription scenario, the ticket spin lock implies roughly the same number of context switches as the blocking ticket lock¹⁵, but in the former case each context switch comes with a giant delay caused by the need to exhaust the timeslice, while in the blocking case we are only limited by how fast a context switch can occur.

Interestingly, although this benchmark uses 100% CPU on every core, the performance of the benchmark in the oversubscribed case almost doesn’t depend on your CPU speed! Performance is approximately the same if I throttle my CPU to 1 GHz, or enable turbo up to 3.5 GHz. All of other implementations scale almost proportionally with CPU frequency. The benchmark does scale strongly with adjustment to sched_latency_ns (and sched_min_granularity_ns if the former is set low enough): lower scheduling latency values gives proportionally better performance as the time slices shrink, helping to confirm our theory of how this works.

This behavior also explains the large amount of variance once the available cores are oversubscribed: by definition, not all threads will be running at once, so the test becomes very sensitive to exactly where the not-running threads took their context switch. At the beginning of the test, only 4 of 6 threads will be running, and the two will be switched out, still waiting on the the barrier that synchronizes the test start. Since the two switched out threads haven’t tried to get the lock yet, the four running threads will be able to quickly share the lock between themselves, since the six-thread convoy hasn’t been set up.

This runs up the “iteration count” (work done) during an initial period which varies randomly, until the first context switch lets the fifth thread join the competition and then the convoy gets set up¹⁶. That’s when the catastrophe starts. This makes the results very noisy: for example, if you set a too-short time period for a trial, the entire test is composed of this initial phase and the results are artificially “good”.

We can probably invent something even worse, but that’s enough for now. Let’s move on to scenarios that are faster than the use of vanilla atomic add.

Level 1: Uncontended Atomics

Recall that we started at level 2: contended atomics. The name gives it away: the next faster level is when atomic operations are used but there is no contention, either by design or by luck. You might have noticed that so far we’ve only shown results for at least two threads. That’s because the single threaded case involves no contention, and so every implementation so far is level 1 if run on a single thread¹⁷.

Here are the results for all the implementations we’ve looked at so far, for a single thread:

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

The fastest implementations run in about 10 nanoseconds, which is 5x faster than the fastest solution for 2 or more threads. The slowest implementation (queued fifo) for one thread ties the fastest implementation (atomic add) at two threads, and beats it handily at three or four.

The number overlaid on each bar is the number of atomic operations¹⁸ each implementation makes per increment. It is obvious that the performance is almost directly proportional to the number of atomic instructions. On the other hand, performance does not have much of a relationship with the total number of instructions of any type, which vary a lot even between algorithms with the same performance as the following table shows:

Algorithm	Atomics	Instructions	Performance
mutex add	2	64	~21 ns
atomic add	1	4	~7 ns
cas add	1	7	~12 ns
ticket yield	1	13	~10 ns
ticket blocking	3	107	~32 ns
queued fifo	4	167	~45 ns
ticket spin	1	13	~10 ns
mutex3	2	17	~20 ns

In particular, note that mutex add has more than 9x the number of instructions compared to cas add yet still runs at half the speed, in line with the 2:1 ratio of atomics. Similarly, ticket yield and ticket spin have slightly better performance than cas add despite having about twice the number of instructions, in line with them all having a single atomic operation¹⁹.

The last row in the table shows the performance of mutex3, an implementation we haven’t discussed. It is a basic mutex offering similar functionality to std::mutex and whose implementation is described in Futexes Are Tricky. Because it doesn’t need to pass through two layers of abstraction²⁰, it has only about one third the instruction count of std::mutex, yet performance is almost exactly the same, differing by less than 10%.

So the idea that you can almost ignore things that are in a lower cost tier seems to hold here. Don’t take this too far: if you design a lock with a single atomic operation but 1,000 other instructions, it is not going to be fast. There are also reasons to keep your instruction count low other than microbenchmark performance: smaller instruction cache footprint, less space occupied in various out-of-order execution buffers, more favorable inlining tradeoffs, etc.

Here it is important to note that the change in level of our various functions didn’t require a change in implementation. These are exactly the same few implementations we discussed in the slower levels. Instead, we simply changed (by fiat, i.e., adjusting the benchmark parameters) the contention level from “very high” to “zero”. So in this case the level doesn’t depend only on the code, but also this external factor. Of course, just saying that we are going to get to level 1 by only running one thread is not very useful in real life: we often can’t simply ban multi-threaded operation.

So can we get to level 1 even under concurrent calls from multiple threads? For this particular problem, we can.

Adaptive Multi-Counter

One option is to use multiple counters to represent the counter value. We try to organize it so that that threads running concurrently on different CPUs will increment different counters. Thus the logical counter value is split across all of these internal physical counters, and so a read of the logical counter value now needs to add together all the physical counter values.

Here’s an implementation:

class cas_multi_counter {
    static constexpr size_t NUM_COUNTERS = 64;

    static thread_local size_t idx;
    multi_holder array[NUM_COUNTERS];

public:

    /** increment the logical counter value */
    uint64_t operator++(int) {
        while (true) {
            auto& counter = array[idx].counter;

            auto cur = counter.load();
            if (counter.compare_exchange_strong(cur, cur + 1)) {
                return cur;
            }

            // CAS failure indicates contention,
            // so try again at a different index
            idx = (idx + 1) % NUM_COUNTERS;
        }
    }

    uint64_t read() {
        uint64_t sum = 0;
        for (auto& h : array) {
            sum += h.counter.load();
        }
        return sum;
    }
};

We’ll call this cas multi, and the approach is relatively straightforward.

There are 64 padded²¹ physical counters whose sum makes up the logical counter value. There is a thread-local idx value, initially zero for every thread, that points to the physical counter that each thread should increment. When operator++ is called, we attempt to increment the counter pointed to by idx using CAS.

If this fails, however, we don’t simply retry. Failure indicates contention²² (this is the only way the strong variant of compare_exchange can fail), so we add one to idx to try another counter on the next attempt.

In a high-contention scenario like our benchmark, every CPU quickly ends up pointing to a different index value. If there is low contention, it is possible that only the first physical counter will be used.

Let’s compare this to the atomic add version we looked at above, which was the fastest of the level 2 approaches. Recall that it uses an atomic add on a single counter.

Skylake

[data table] [raw data]

Increment Cost: Contention Adaptive Multi-Counter

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

For 1 active core, the results are the same as we saw earlier: the CAS approach performs the same as the cas add algorithm²³, which is somewhat slower than atomic add, due to the need for an additional load (i.e., the line with counter.load()) to set up the CAS.

For 2 to 4 cores, the situation changes dramatically. The multiple counter approach performs the same regardless of the number of active cores. That is, it exhibits perfect scaling with multiple cores – in contrast to the single-counter approach which scales poorly. At four cores, the relative speedup of the multi-counter approach is about 9x. On Amazon’s Graviton ARM processor the speedup approaches eighty times at 16 threads.

This improvement in increment performance comes at a cost, however:

64 counters ought to be enough for anyone, but they take 4096 (!!) bytes of memory to store what takes only 8 bytes in the atomic add approach²⁴.
The read() method is much slower: it needs to iterate over and add all 64 values, versus a single load for the earlier approaches.
The implementation compiles to much larger code: 113 bytes versus 15 bytes for the single counter CAS approach or 7 bytes for the atomic add approach.
The concurrent behavior is considerably harder to reason about and document. For example, it is harder to explain the consistency condition provided by read() since it is no longer a single atomic read²⁵.
There is a single thread-local idx variable. So while different cas_multi_counter instances are logically independent, the shared idx variable means that things that happen in one counter can affect the non-functional behavior of the others²⁶.

Some of these downsides can be partly mitigated:

A much smaller number of counters would probably be better for most practical uses. We could also set the array size dynamically based on the detected number of logical CPUs since a larger array should not provide much of a performance increase. Better yet, we might make the size even more dynamic, based on contention: start with a single element and grow it only when contention is detected. This means that even on systems with many CPUs, the size will remain small if contention is never seen in practice. This has a runtime cost²⁷, however.
We could optimize the read() method by stopping when we see a zero counter. I believe a careful analysis shows that the non-zero counter values for any instance of this class are all in a contiguous region starting from the beginning of the counter array²⁸.
We could mitigate some of the code footprint by carefully carving the “less hot”²⁹ slow path out into a another function, and use our magic powers to encourage the small fast path (the first CAS) to be inlined while the fallback remains not inlined.
We could make the thread-local idx per instance specific to solve the “shared idx across all instances” problem. This does require some non-negligible amount of work to implement a dynamic TLS system which can create as many thread local keys as you want³⁰, and it is slower.

So while we got a good looking chart, this solution doesn’t exactly dominate the simpler ones. You pay a price along several axes for the lack of contention and you shouldn’t blindly replace the simpler solutions with this one – it needs to be a carefully considered and use-case dependent decision.

Is it over yet? Can I close this browser tab and reclaim all that memory? Almost. Just one level to go.

Level 0: Vanilla

The last and fastest level is achieved when only vanilla instructions are used (and without contention). By vanilla instructions I mean things like regular loads and stores which don’t imply additional synchronization above what the hardware memory model offers by default³¹.

How can we increment a counter atomically while allowing it to be read from any thread? By ensuring there is only one writer for any given physical counter. If we keep a counter per thread and only allow the owning thread to write to it, there is no need for an atomic increment.

The obvious way to keep a per-thread counter is use thread-local storage. Something like this:

/**
 * Keeps a counter per thread, readers need to sum
 * the counters from all active threads and add the
 * accumulated value from dead threads.
 */
class tls_counter {
    std::atomic<uint64_t> counter{0};

    /* protects all_counters and accumulator */
    static std::mutex lock;
    /* list of all active counters */
    static std::vector<tls_counter *> all_counters;
    /* accumulated value of counters from dead threads */
    static uint64_t accumulator;
    /* per-thread tls_counter object */
    static thread_local tls_counter tls;

    /** add ourselves to the counter list */
    tls_counter() {
        std::lock_guard<std::mutex> g(lock);
        all_counters.push_back(this);
    }

    /**
     * destruction means the thread is going away, so
     * we stash the current value in the accumulator and
     * remove ourselves from the array
     */
    ~tls_counter() {
        std::lock_guard<std::mutex> g(lock);
        accumulator += counter.load(std::memory_order_relaxed);
        all_counters.erase(std::remove(all_counters.begin(), all_counters.end(), this), all_counters.end());
    }

    void incr() {
        auto cur = counter.load(std::memory_order_relaxed);
        counter.store(cur + 1, std::memory_order_relaxed);
    }

public:

    static uint64_t read() {
        std::lock_guard<std::mutex> g(lock);
        uint64_t sum = 0, count = 0;
        for (auto h : all_counters) {
            sum += h->counter.load(std::memory_order_relaxed);
            count++;
        }
        return sum + accumulator;
    }

    static void increment() {
        tls.incr();
    }
};

The approach is the similar to the per-CPU counter, except that we keep one counter per thread, using thread_local. Unlike earlier implementations, you don’t create instances of this class: there is only one counter and you increment it by calling the static method tls_counter::increment().

Let’s focus a moment on the actual increment inside the thread-local counter instance:

void incr() {
    auto cur = counter.load(std::memory_order_relaxed);
    counter.store(cur + 1, std::memory_order_relaxed);
}

This is just a verbose way of saying “add 1 to this std::atomic<uint64_t> but it doesn’t have to be atomic”. We don’t need an atomic increment as there is only one writer³². Using the relaxed memory order means that no barriers are inserted³³. We still need a way to read all the thread-local counters, and the rest of the code deals with that: there is a global vector of pointers to all the active tls_counter objects, and read() iterates over this. All access to this vector is protected by a std::mutex, since it will be accessed concurrently. When threads die, we remove their entry from the array, and add their final value to tls_counter::accumulator which is added to the sum of active counters in read().

Whew.

So how does this tls add implementation benchmark?

Skylake

[data table] [raw data]

Ice Lake

[data table] [raw data]

Graviton

[data table] [raw data]

Graviton 2

[data table] [raw data]

That’s two nanoseconds per increment, regardless of the number of active cores. This turns out to be exactly as fast as just incrementing a variable in memory with a single instruction like inc [eax] or add [eax], 1, so it’s somehow as fast as possible for any solution which ends up incrementing something in memory³⁴.

Let’s take a look at the number of atomics, total instructions and performance for the three implementations in the last plot, for four threads:

Algorithm	Atomics	Instructions	Performance
atomic add	1	4	~ 110 ns
cas multi	1	7	~ 12 ns
tls add	0	12	~ 2 ns

This is a clear indication that the difference in performance has very little to do with the number of instructions: the ranking by instruction count is exactly the reverse of the ranking by performance! tls add has three times the number of instructions, yet is more than fifty times faster (so the IPC varies by a factor of more than 150x).

As we saw at the last 1, this improvement in performance doesn’t come for free:

The total code size is considerably larger than the per-CPU approach, although most of it is related to creation of the initial object on each thread, and not on the hot path.
We have one object per thread, instead of per CPU. For an application with many threads using the counter, this may mean the creation of many individual counters which use both more memory³⁵ and result in a slower read() function.
This implementation only supports one counter: the key methods in tls_counter are static. This boils down to the need for a thread_local object for the physical counter, which must be static by the rules of C++. A template parameter could be added to allow multiple counters based on dummy types used as tags, but this is still more awkward to use than instances of a class (and some platforms have limits on the number of thread_local variables). This limitation could be removed in the same way as discussed earlier for the cas multi idx variable, but at a cost in performance and complexity.
A lock was introduced to protect the array of all counters. Although the important increment operation is still lock-free, things like the read() call, the first counter access on a given thread and thread destruction all compete for the same lock. This could be eased with a read-write lock or a concurrent data structure, but at a cost as always.

The Table

Let’s summarize all the levels in this table.

The ~Cost column is a very approximate estimate of the cost of each “occurrence” of the expensive operation associated with the level. It should be taken as a very rough ballpark for current Intel and AMD hardware, but especially the later levels can vary a lot.

The Perf Event column lists a Linux perf event that you can use to count the number of times the operation associated with this level occurs, i.e., the thing that is slow. For example, in level 1, you count atomic operations using the mem_inst_retired.lock_loads counter, and if you get three counts per high level operation, you can expect roughly 3 x 10 ns = 30 ns cost. Of course, you don’t necessarily need perf in this case: you can inspect the assembly too.

The Local column records whether the behavior of this level is core local. If yes, it means that operations on different cores complete independently and don’t compete and so the performance scales with the number of cores. If not, there is contention or serialization, so the throughput of the entire system is often limited, regardless of how many cores are involved. For example, only one core at a time performs an atomic operation on a cache line, so the throughput of the whole system is fixed and the throughput per core decreases as more cores become involved.

The Key Characteristic tries to get across the idea of the level in one bit-sized chunk.

Level	Name	~Cost (ns)	Perf Event	Local	Key Characteristic
0	Vanilla	low	depends	Yes	No atomic instructions or contended accesses at all
1	Uncontended Atomic	10	`mem_inst_retired.` `lock_loads`	Yes	Atomic instructions without contention
2	True Sharing	40 - 400	`mem_load_l3_hit_retired.` `xsnp_hitm`	No	Contended atomics or locks
3	Syscall	1,000	`raw_syscalls:sys_enter`	No	System call
4	Context Switch	10,000	`context-switches`	No	Forced context switch
5	Catastrophe	huge	depends	No	Stalls until quantum exhausted, or other sadness

So What?

What’s the point of all this?

Primarily, I use the hierarchy as a simplification mechanism when thinking about concurrency and performance. As a first order approximation you mostly only need to care about the operations related to the current level. That is, if you are focusing on something which has contended atomic operations (level 2), you don’t need to worry too much about uncontended atomics or instruction counts: just focus on reducing contention. Similarly, if you are at level 1 (uncontended atomics) it is often worth using more instructions to reduce the number of atomics.

This guideline only goes so far: if you have to add 100 instructions to remove one atomic, it is probably not worth it.

Second, when optimizing a concurrent system I always try to consider how I can get to a (numerically) lower level. Can I remove the last atomic? Can I avoid contention? Successfully moving to a lower level can often provide an order-of-magnitude boost to performance, so it should be attempted first, before finer-grained optimizations within the current level. Don’t spend forever optimizing your contended lock, if there’s some way to get rid of the contention entirely.

Of course, this is not always possible, or not possible without tradeoffs you are unwilling to make.

Getting There

Here’s a quick look at some usual and unusual ways of achieving levels lower on the hierarchy.

Level 4

You probably don’t want to really be in level 4 but it’s certainly better than level 5. So, if you still have your job and your users haven’t all abandoned you, it’s usually pretty easy to get out of level 5. More than half the battle is just recognizing what’s going on and from there the solution is often clear. Many times, you’ve simply violated some rule like “don’t use pure spinlocks in userspace” or “you built a spinlock by accident” or “so-and-so accidentally held that core lock during IO”. There’s almost never any inherent reason you’d need to stay in level 5 and you can usually find an almost tradeoff-free fix.

A better approach than targeting level 4 is just to skip to level 2, since that’s usually not too difficult.

Level 3

Getting to level 3 just means solving the underlying reason for so many context switches. In the example used in this post, it means giving up fairness. Other approaches include not using threads for small work units, using smarter thread pools, not oversubscribing cores, and keeping locked regions short.

You don’t usually really want to be in level 3 though: just skip right to level 2.

Level 2

Level 3 isn’t a terrible place to be, but you’ll always have that gnawing in your stomach that you’re leaving a 10x speedup on the table. You just need to get rid of that system call or context switch, bringing you to level 2.

Most library provided concurrency primitives already avoid system calls on the happy path. E.g., pthreads mutex, std::mutex, Windows CRITICAL_SECTION will avoid a system call while acquiring and releasing an uncontended lock. There are, however, some notable exceptions: if you are using a Win32 mutex object or System V semaphore object, you are paying a system call on every operation. Double check if you can use an in-process alternative in this case.

For more general synchronization purposes which don’t fit the lock-unlock pattern, a condition variable often fits the bill and a quality implementation generally avoids system calls on the fast path. A relatively unknown and higher performance alternative to condition variables, especially suitable for coordinating blocking for otherwise lock-free structures, is an event count. Paul’s implementation is available in concurrency kit and we’ll mention it again at Level 0.

System calls often creep in when home-grown synchronization solutions are used, e.g., using Windows events to build your own read-write lock or striped lock or whatever the flavor of the day is. You can often remove the call in the fast path by making a check in user-space to see if a system call is necessary. For example, rather than unconditionally unblocking any waiters when releasing some exclusive object, check to see if there are waiters³⁶ in userspace and skip the system call if there are none.

If a lock is generally held for a short period, you can avoid unnecessary system calls and context switches with a hybrid lock that spins for an appropriate³⁷ amount of time before blocking. This can trade tens of nanoseconds of spinning for hundreds or thousands of nanoseconds of system calls.

Ensure your use of threads is “right sized” as much as possible. A lot of unnecessary context switches occur when many more threads are running than there are CPUs, and this increases the chance of a lock being held during a context switch (and makes it worse when it does happen: it takes longer for the holding thread to run again as the scheduler probably cycles through all the other runnable threads first).

Level 1

A lot of code that does the work to get to level 2 actually ends up in level 1. Recall that the primary difference between level 1 and 2 is the lack of contention in level 1. So if your process naturally or by design has low contention, simply using existing off-the-shelf synchronization like std::mutex can get you to level 1.

I can’t give a step-by-step recipe for reducing contention, but here’s a laundry list of things to consider:

Keep your critical sections as short as possible. Ensure you do any heavy work that doesn’t directly involve a shared resource outside of the critical section. Sometimes this means making a copy of the shared data to work on it “outside” of the lock, which might increase the total amount of work done, but reduce contention.
For things like atomic counters, try to batch your updates: e.g., if you update the counter multiple times during some operation, update a local on the stack rather than the global counter and only “upload” the entire value once at the end.
Consider using structures that use fine-grained locks, striped locks or similar mechanisms that reduce contention by locking only portions of a container.
Consider per-CPU structures, as in the examples above, or some approximation of them (e.g., hashing the current thread ID into an array of structures). This post used an atomic counter as a simple example, but it applies more generally to any case where the mutations can be done independently and aggregated later.

For all of the advice above, when I say consider doing X I really mean consider finding and using an existing off-the shelf component that does X. Writing concurrent structures yourself should be considered a last resort – despite what you think, your use case is probably not all that unique.

Level 1 is where a lot of well written, straightforward and high-performing concurrent code lives. There is nothing wrong with this level – it is a happy place.

Level 0

It is not always easy or possible to remove the last atomic access from your fast paths, but if you just can’t live with the extra ~10 ns, here are some options:

The general approach of using thread local storage, as discussed above, can also be extended to structures more complicated than counters.
You may be able to achieve fewer than one expensive atomic instruction per logical operation by batching: saving up multiple operations and then committing them at all once with a small fixed number of atomic operations. Some containers or concurrent structures may have a batched API which does this for you, but even if not you can sometimes add batching yourself, e.g., by inserting collections of elements rather than a single element³⁸.
Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem, but there are still some good options out there.
I find that RCU is especially powerful and fairly general if you are using a garbage collected language, or can satisfy the requirements for an efficient reclamation method in a non-GC language.
The seqlock³⁹ is an underrated and little known alternative to RCU without reclaim problems, although not as general. Concurrencykit has an implementation. It has an atomic-free read path for readers. Unfortunately, seqlocks don’t integrate cleanly with either the Java⁴⁰ or C++ memory models.
It is also possible in some cases to do a per-CPU rather than a per-thread approach using only vanilla instructions, although the possibility of interruption at any point makes this tricky. Restartable sequences (rseq) can help, and there are other tricks lurking out there.
Event counts, mentioned earlier, can even be level 0 in a single writer scenario, as Paul shows.
This is the last point, but it should be the first: you can probably often redesign your algorithm or application to avoid sharing data in the first place, or to share much less. For example, rather than constantly updating a shared collection with intermediate results, do as much private computation as possible before only merging the final results.

Summary

We looked at the six different levels that make up this concurrency cost hierarchy. The slow half (3, 4 and 5) are all basically performance bugs. You should be able to achieve level 2 or level 1 (if you naturally have low contention) for most designs fairly easily and those are probably what you should target by default. Level 1 in a contended scenario and level 0 are harder to achieve and often come with difficult tradeoffs, but the performance boost can be significant: often one or more orders of magnitude.

Thanks

Thanks to Paul Khuong who showed me something that made me reconsider in what scenarios level 0 is achievable and typo fixes.

Thanks to @never_released for help on a problem I had bringing up an EC2 bare-metal instance (tip: just wait).

Special thanks to matt_dz and Zach Wenger for helping fix about sixty typos between them.

Thanks to Alexander Monakov, Dave Andersen, Laurent and Kriau for reporting typos, and Aaron Jacobs for suggesting clarifications to the level 0 definition.

Traffic light photo by Harshal Desai on Unsplash.

Discussion and Feedback

You can leave a comment below or discuss on Hacker News, r/programming or r/cpp.

If you liked this post, check out the homepage for others you might enjoy.

Well, this is quite real world: such atomic counters are used widely for a variety of purposes. I throw the if you squint in there because, after all, we are using microbenchmarks which simulate a probably-unrealistic density of increments to this counter, and it is a bit of a stretch to make this one example span all five levels – but I tried! ↩
The Graviton 2 bare metal hardware has 64 cores, but this instance size makes 16 of them available. This means that in principle the results can be affected by the coherency traffic of other tenants on the same hardware, but the relatively stable results seem to indicate it doesn’t affect the results much. ↩
Some hardware supports very limited atomic operations, which may be mostly useful only for locking, although you can get tricky. ↩
We could actually use either the pre or post-increment version of the operator here. The usual advice is to prefer the pre-increment form ++c as it can be faster as it can return the mutated value, rather than making a copy to return after the mutation. Now this advice rarely applies to primitive values, but atomic increment is actually an interesting case which turns it on its head: the post-increment version is probably better (at least, never slower) since the underlying hardware operation returns the previous value. So it’s at least one extra operation to calculate the pre-increment value (or much worse, apparently, if icc gets involved). ↩
Many ISAs, including POWER and ARM, traditionally only included support for a CAS-like or LL/SC operation, without specific support for more specific atomic arithmetic operations. The idea, I think, was that you could build any operation you want on top of of these primitives, at the cost of “only” a small retry loop and that’s more RISC-y, right? This seems to be changing as ARMv8.1 got a bunch of atomic operations. ↩
From its introduction through Java 7, the AtomicInteger and related classes in Java implemented all their atomic operations on top of a CAS loop, as CAS was the only primitive implemented as an intrinsic. In Java 8, almost exactly a decade later, these were finally replaced with dedicated atomic RMWs where possible, with good results. ↩
On my system and most (all?) modern Intel systems this is essentially the L3 cache, as the caching home agent (CHA) lives in or adjacent to the L3 cache. ↩
This doesn’t imply that each atomic operation needs to take 70 cycles under contention: a single core could do multiple operations on the cache line after it gains exclusive ownership, so the cost of obtaining the line could be amortized over all of these operations. How much of this occurs is a measure of fairness: a very fair CPU will not let any core monopolize the line for long, but this makes highly concurrent benchmarks like this slower. Recent Intel CPUs seem quite fair in this sense. ↩
It also might not work how you think, depending on details of the OS scheduler. ↩
They used to be cheaper: based on my measurements the cost of system calls has more than doubled, on most Intel hardware, after the Spectre and Meltdown mitigations have been applied. ↩
An interesting thing about convoys is that they exhibit hysteresis: once you start having a convoy, they become self-sustaining, even if the conditions that started it are removed. Imagine two threads that lock a common lock for 1 nanosecond every 10,000 nanoseconds. Contention is low: the chance of any particular lock acquisition being contended is 0.01%. However, as soon as a contended acquisition occurs, the lock effectively becomes held for the amount of time it takes to do a full context switch (for the losing thread to block, and then to wake up). If that’s longer than 10,000 nanoseconds, the convoy will sustain itself indefinitely, until something happens to break the loop (e.g., one thread deciding to work on something else). A restart also “fixes” it, which is one of many possible explanations for processes that suddenly shoot to 100% CPU (but are still making progress), but can be fixed by a restart. Everything becomes worse with more than two threads, too. ↩
Actually I find it remarkable that this performs about as well as the CAS-based atomic add, since the fairness necessarily implies that the lock is acquired in a round-robin order, so the cache line with the lock must at a minimum move around to each acquiring thread. This is a real stress test of the arbitrary coherency mechanisms offered by the CPU. ↩
And no SMT enabled, so there are 4 logical processors from the point of view of the OS. ↩
In fact, the once scenario is the most likely, since one would assume with homogeneous threads the scheduler will approximate something like round-robin scheduling. So the thread that is descheduled is most likely the one that is also closest to the head of the lock queue, because it had been spinning the longest. ↩
In fact, you can measure this with perf and see that the total number of context switches is usually within a factor of 2 for both tests, when oversubscribed. ↩
There is another level of complication here: the convoy only gets set up when the fifth thread joins the fun if the thread that gets switched out had expressed interest in the lock before it lost the CPU. That is, after a thread unlocks the lock, there is a period before it gets a new ticket as it tries to obtain the lock again. Before it gets that ticket, it is essentially invisible to the other threads, and if it gets context switched out, the catastrophic convoy won’t be set up (because the new set of four threads will be able to efficiently share the lock among themselves). ↩
This won’t always necessarily be the case. You could write a primitive that always makes a system call, putting it at level 3, even if there is no contention, but here I’ve made sure to always have a no-syscall fast path for the no-contention case. ↩
On Intel hardware you can use details.sh to collect the atomic instruction count easily, taking advantage of the mem_inst_retired.lock_loads performance counter. ↩
The cas add implementation comes off looking slightly worse than the other single-atomic implementations here because the load required to set up the CAS value effectively adds to the dependency chain involving the atomic operation, which explains the 5-cycle difference with atomic add. This goes away if you can do a blind CAS (e.g., in locks’ acquire paths), but that’s not possible here. ↩
Actually three layers, libstdc++, then libgcc and then finally pthreads. I’ll count the first two as one though because those can all inline into the caller. Based on a rough accounting, probably 75% of the instruction count comes from pthreads, the rest from the other two layers. The pthreads mutexes are more general purpose than what std::mutex offers (e.g., they support recursion), and the features are configured at runtime on a per-mutex basis, so that explains a lot of the additional work these functions are doing. It’s only due to cost of atomic operations that std::mutex doesn’t take a significant penalty compared to a more svelte design. ↩
Padded means that the counters are aligned such that each falls into its own 64 byte cache line, to avoid false sharing. This means that even though each counter only has 8 bytes of logical payload, it requires 64 bytes of storage. Some people claim that you need to pad out to 128 bytes, not 64, to avoid the effect of the adjacent line prefetcher which fetches the 64-byte that completes an aligned 128-byte pair of lines. However, I have not observed this effect often on modern CPUs. Maybe the prefetcher is conservative and doesn’t trigger unless past behavior indicates the fetches are likely to be used, or the prefetch logic can detect and avoid cases of false sharing (e.g., by noticing when prefetched lines are subsequently invalidated by a snoop). ↩
Actually, not all failure indicates contention: there is a small chance also that a context switch exactly splits the load and the subsequent CAS, and in this case the CAS would fail when the thread was scheduled again if any thread that ran in the meantime updated the same counter. Treating this as contention doesn’t really cause any serious problems. ↩
Not surprising, since there is no contention and the fast path looks the same for either algorithm: a single CAS that always succeeds. ↩
Here I’m assuming that sizeof(std::atomic<uint64_t>) is 8, and this is the case on all current mainstream platforms. Also, you may or may not want to pad out the single-counter version to 64 bytes as well, to avoid some potential false sharing with nearby values, but this is different than the multi-counter case where padding is obligatory to avoid guaranteed false sharing. ↩
In this limited case, I think read() provides the same guarantees as the single-counter case. Informally, read() returns some value that the counter had at some point between the start and end of the read() call. Formally, there is a linearization point within read() although this point can only be determined in retrospect by examining the returned value (unlike the single-counter approaches, where the linearization is clear regardless of the value). However, this is only true because the only mutating operation is increment(). If we also offered a decrement() method, this would no longer be true: you could read values that the logical counter never had based on the sequence of increments and decrements. Specifically, if you execute increment(); decrement(); increment() and even if you know these operations are strictly ordered (e.g., via locking), a concurrent call to read() could return 2, even though the counter never logically exceeded 1. ↩
In particular, if contention is seen on one object, the per-thread index will change to avoid it, which changes the index of all other objects as well, even if they have not seen any contention. This doesn’t seem like much of a problem for this simple implementation (which index we write to doesn’t matter much), but it could make some other optimizations more difficult: e.g., if we size the counter array dynamically, we don’t want to unnecessarily change the idx for uncontended objects, since it requires a larger counter array, unnecessarily. ↩
At least, an extra indirection to access the array which is no longer embedded in the object⁴¹, and checks to ensure the array is large enough. Furthermore, we have another decision to make: when to expand the array. How much contention should we suffer before we decide the array is too small? ↩
The intuition is later counter positions only get written when an earlier position failed a compare and swap, which necessarily implies it was written to by some other thread and hence non-zero. There is some subtlety here: this wouldn’t hold if compare_exchange_weak was used instead of compare_exchange_strong, and it more obviously wouldn’t apply if we allowed decrements or wanted to change the “probe” strategy. ↩
I’m not sure if “less hot” means __attribute__((cold)) necessarily, that might be too cold. We mostly just want to separate the first-cas-succeeds case and the rest of the logic so we don’t pay the dynamic code size impact except when the fallback path is taken. ↩
A sketch of an implementation would be to use something like a single static thread_local pointer to an array or map, which maps an ID contained in the dynamic TLS key to the object data. Lookup speed is important, which favors an array, but you also need to be able to remove elements, which can favor some type of hash map. All of this is probably at least twice as slow as a plain thread_local access … or just use folly or boost. ↩
On x86, what’s vanilla and what isn’t is fairly cut and dried: regular memory accesses and read-modify-write instructions are vanilla while LOCKed RMW instructions, whether explicit like lock inc or implicit like xchg [mem], reg, are not and are an order of magnitude slower. Out of the fences, mfence is also a slow non-vanilla instruction, comparable in cost to a LOCKed instruction. On other platforms like ARM or POWER, there may be shades of grey: you still have vanilla accesses on one end, and expensive full barriers like dmb on ARM or sync on POWER at the other, but you also have things in the middle with some additional ordering guarantees but still short of sequential consistency. This includes things like LDAR and LDAPR on ARM which implement sort of a sliding scale of load ordering and performance. Still, on any given hardware, you might find that instructions generally fall into a “cheap” (vanilla) and “expensive” (non-vanilla) bucket. ↩
The only reason we even need std::atomic<uint64_t> at all is because it is undefined behavior to have concurrent access to any variable if at least one access is a write. Since the owning thread is making writes, this would technically be a violation of the standard if there was a concurrent tls_counter::read() call. Most actual hardware has no problem with concurrent reads and writes like this, but it’s better to stay on the right side of the law. Some hardware could also exhibit tearing of the writes, and std::atomic guarantees this doesn’t happen. That is, the read and write are still individually atomic. ↩
If you use the default std::memory_order_seq_cst, on x86 gcc inserts an mfence which makes this even slower than an atomic increment since mfence is generally slower than instructions with a lock prefix (it has slightly stronger barrier semantics). ↩
This is a very small white lie. I’ll explain more elsewhere. ↩
On the other hand, the TLS approach doesn’t need padding since the counters will generally appear next to other thread-local data, and not subject to false sharing, which means an 8x reduction (from 64 to 8 bytes) in the per-counter size, so if your process has a number of threads roughly equal to the number of cores, you will probably save memory over the per-CPU approach. ↩
It’s easy to introduce a missed wakeup problem if this isn’t done correctly. The usual cause is a race condition between some waiter arriving at a lock-like thing, seeing that it’s locked and then indicating interest, but in the critical region of that check-then-act the owning thread left the lock and didn’t see any waiters. The waiter blocks but there is nobody to unblock them. These bugs often go undetected since the situation resolves itself as soon as another thread arrives, so in a busy system you might not notice the temporarily hung threads. The futex system call is basically designed to make solving this easy, while the Event stuff in Windows requires a bit more work (usually based on a compare-and-swap). ↩
An “appropriate” time is probably something like the typical runtime of the locked region. Basically you want to spin in any case where the lock is held by a currently running thread, which will release it soon. As soon as you’ve been spinning for more than the typical hold time of the lock, it becomes much more likely you are simply waiting for a lock held by a thread that is not running (e.g., it was unlucky enough to incur a context switch while it held the lock). In that case, you are better off sleeping. ↩
An interesting design point is a data type that implements batching internally behind an API offering single-element operations. For example, a queue might decide that added elements won’t be immediately consumed (because there are already some elements in the queue), and hold them in a local staging area until several can be added as a batch, or until their absence would be noticed. ↩
Despite the current claim in wikipedia that seqlocks are somehow a Linux-specific construct involving the kernel, they work great in userspace only and are not tied to Linux. It is likely they were not invented for use in Linux but pre-dated the OS, although maybe the use in Linux was where the name seqlock first appeared? ↩
Java does provide StampedLock which offers seqlock functionality. ↩
Of course, we could go even one step further and embed a small array of 1 or 2 elements in the counter object, in the hope that this is enough and only use a dynamically allocated array and suffer the additional indirection if we observe contention. ↩

Comments

Tzach • November 11th, 2022 15:08

Great article! If the counter size was int32 and not int64 would the implementation for level 1 (and level 0?) suffer from false sharing? How would that impact performance?

↪︎ Reply to Tzach

Oleg • June 13th, 2022 13:46

Regarding cas_add() shouldn’t load be inside inner loop?

↪︎ Reply to Oleg

Author Travis Downs • June 18th, 2022 18:29

Yes, there needs to be an update of the expected value in the loop, but it’s already there, hidden inside compare_exchange_weak: if the comparison fails, this function updates the first argument with the observed value of the location, so no separate load is needed. This is possible because the first argument is taken by reference.

From cppreference:

[the first parameter is a] reference to the value expected to be found in the atomic object. Gets stored with the actual value of *this if the comparison fails.

Other languages which don’t have by-reference out parameters (e.g., Java’s AtomicInteger) may require a separate load.

Josh Scheid • April 29th, 2022 23:42

What was interesting and non-interesting (so that it didn’t make this post) about “plain_add”?

↪︎ Reply to Josh Scheid

Author Travis Downs • April 30th, 2022 02:20

It didn’t make the post because unlike the other solutions it isn’t even correct so it doesn’t really fit in this exploration of valid solutions to the given problem. It was just kind of there as a baseline because it was easy to add and I wanted to see how it performed without any synchronization.

The actual behavior of this implementation is interesting: after a careful review of cache protocols, you might assume that it would actually be about as bad as the atomic solutions in the contended case, because the hard work of moving the cache line back and forth has to occur in any case. However, the difference is that with the unlocked writes each core may blast away on the local copy of the value in its store buffer which isn’t subject to MESI as described here. This can make the apparent performance comparable to uncontended increments at least on a heavily buffered out-of-order architectures like modern x86.

Noah Goldstein • January 19th, 2021 07:55

An interesting thing about restartable sequences is that the critical section only needs to be atomic s.t any interrupt to the processor at any time returning RIP to the abort handler needs to be graceful. On x86_64 for example this means your commit instruction could be an AVX2 store (so 32 byte atomic operation percpu which in itself is pretty cool). But there may be a way using Travis’ investigation of where interrupts happen to build a multi instruction commit if all the instructions retired at the same time.

↪︎ Reply to Noah Goldstein

Author Travis Downs • January 29th, 2021 00:22

Yeah definitely! It’s a whole new way of thinking about atomic operations, since now you’re looking for instructions which aren’t necessarily atomic in the usual sense, but just do a lot of work in a single instruction. For example, you can use movsq to copy 64 bits from one place to another “instruction atomically” (in the rseq sense) which isn’t even possible with existing atomic instructions.

I wonder about instructions like rep movsb which have a rep prefix. These are “interruptible” in that they may partially execute: updating their to/from and count pointers. I suppose these aren’t supported as “commit” instructions for rseq since the implementation probably expects instructions to fully execute, or not.

Chia-Lun Liu • December 3th, 2020 02:42

Thanks for sharing. This is really nice.

↪︎ Reply to Chia-Lun Liu

Matthew Fernandez • September 19th, 2020 22:38

I’m not sure if “less hot” means attribute((cold)) necessarily, that might be too cold.

Maybe you could mark the initial CAS branch with __builtin_expect(..., 1)? Then again, I wouldn’t be surprised if compilers already consider a strong CAS as “likely.”

↪︎ Reply to Matthew Fernandez

Author Travis Downs • September 25th, 2020 20:24

Indeed, I didn’t get any difference in the code generation with or without a likely hint: in both cases the compiler favored the success case (putting it as the fall through).

What I was suggesting was something along the lines of this. Here, the inc2 version carves off the slow path into its own function (not visible, here) so that the slow path doesn’t add much to the inlined version of the increment function. Here the difference isn’t all that much since the entire slow path (the stuff at labels L2 and L9 in the gcc version) is fairly small, not much bigger than the code to call the slow path (L14 and L15 in the gcc version).

Tuan Phuc PHAN • July 22th, 2020 20:07

Thank you for a great post

↪︎ Reply to Tuan Phuc PHAN

Rene Damm • July 22th, 2020 16:03

Great post. Thank you for the detailed writeup.

↪︎ Reply to Rene Damm

Darkmercy • July 8th, 2020 02:33

Loads and stores are atomic so in CRDT “level 0” case why do you need synchronization at all?

↪︎ Reply to Darkmercy

Author Travis Downs • July 8th, 2020 03:08

Do you mean why is std::atomic used at all for the TLS based approach? It is because there is still concurrent writers and readers, and at the C++ level this requires std::atomic even if you “know” reads and writes are atomic. More detail in this footnote.

It doesn’t really add any overhead, because we “build our own” non-atomic increment, so the code ends up basically just as nice as if we went the UB route and used plain uint64_t without std::atomic.

Jason • July 7th, 2020 14:12

another question, do you think the overhead of ARM CPUs could partially come from virtualization?

↪︎ Reply to Jason

Author Travis Downs • July 8th, 2020 03:04

I am not seeing any particular overhead, at least for Graviton 2: the Graviton 2 results look fine.

Of course, the numbers for most of the non-scalable ones (level 2 and above) get worse the more CPUs you add, so 16 CPUs looks “worse” than 4, but that’s not fault of the CPU design.

I don’t think virtualization will cause any noticeable overhead here, at least for the majority of the tests. Most of these are “pure CPU” and virtualization generally runs those loads with basically zero overhead. Of course, this would be different if you were getting a fraction of a CPU (e.g., the hypervisor is time-slicing multiple guests onto one physical CPU), but that’s not the case for these EC2 instances.

What is possible is interference with other tenants in the same box, but I didn’t see much evidence of it.

Jason • July 7th, 2020 14:07

This is very helpful for someone working on atomics and lock right now. But one missing part is the benchmark of Intel xeon CPUs, which I believe is critical. I am picking two CPUs and running the same benchmark, if you are interested, I can send the results over.

↪︎ Reply to Jason

Author Travis Downs • January 29th, 2021 00:17

Sure, you can send them over.

Introduction

A “Real World” Example

Source and Results

Hardware

Level 2: Contended Atomics

Level 3: System Calls

Revisiting std::mutex

Level 4: Implied Context Switch

Level 5: Catastrophe

Lock Convoy on Steroids

Level 1: Uncontended Atomics

Adaptive Multi-Counter

Level 0: Vanilla

The Table

So What?

Getting There

Level 4

Level 3

Level 2

Level 1

Level 0

Summary

Thanks

Discussion and Feedback

Comments

Add Comment