Beating up on qsort
Recently, Daniel Lemire tackled the topic of selecting N distinct numbers at random. In the case we want sorted output, an obvious solution presents itself: sorting randomly chosen values and deduplicating the list, which is easy since identical values are now adjacent.^{1}
While Daniel suggests a clever method of avoiding a sort entirely^{2}, I’m also interested in they why for the underlying performace of the sort method: it takes more than 100 ns per element, which means 100s of CPU clock cycles and usually even more instructions than that (on a superscalar processor)! As a sanity check, a quick benchmark (perf record ./bench && perf report
) shows that more than 90% of the time spent in this approach is in the sorting routine, qsort  so we are right to focus on this step, rather than say the deduplication step or the initial random number generation. This naturally, this raises the question: how fast is qsort when it comes to sorting integers and can we do better?
All of the code for this post is available on GitHub, so if you’d like to follow along with the code open in an editor, go right ahead (warning: there are obviously some spoilers if you dig through the code first).
Benchmarking Qsort
First, let’s take a look at what qsort
is doing, to see if there is any delicous lowhanging performance fruit. We use perf record ./bench qsort
to capture profiling data, and perf report stdio
to print a summary^{3}:
# Samples: 101K of event 'cycles:ppp'
# Event count (approx.): 65312285835
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................................
#
64.90% bench libc2.23.so [.] msort_with_tmp.part.0
21.45% bench bench [.] compare_uint64_t
8.65% bench libc2.23.so [.] __memcpy_sse2
0.87% bench libc2.23.so [.] __memcpy_avx_unaligned
0.83% bench bench [.] main
0.41% bench [kernel.kallsyms] [k] clear_page_erms
0.34% bench [kernel.kallsyms] [k] native_irq_return_iret
0.31% bench bench [.] bench_one
The assembly for the biggest offender, msort_with_tmp
looks like this^{4} :
Percent  Address  Disassembly

30.55 : 39200: mov rax,QWORD PTR [r15]
0.61 : 39203: sub rbp,0x1
0.52 : 39207: add r15,0x8
7.30 : 3920b: mov QWORD PTR [rbx],rax
0.39 : 3920e: add rbx,0x8
0.07 : 39212: test r12,r12
0.09 : 39215: je 390e0 ; merge finished
1.11 : 3921b: test rbp,rbp
0.01 : 3921e: je 390e0 ; merge finished
5.24 : 39224: mov rdx,QWORD PTR [rsp+0x8]
0.42 : 39229: mov rsi,r15
0.19 : 3922c: mov rdi,r13
6.08 : 3922f: call r14
0.59 : 39232: test eax,eax
3.52 : 39234: jg 39200
32.69 : 39236: mov rax,QWORD PTR [r13+0x0]
1.31 : 3923a: sub r12,0x1
1.01 : 3923e: add r13,0x8
1.09 : 39242: jmp 3920b <bsearch@@GLIBC_2.2.5+0x205b>
Depending on your level of assembly reading skill, it may not be obvious, but this is basically a classic merge routine: it is merging two lists by comparing the top elements of each list (pointed to by r13
and r15
), and then storing the smaller element (the line QWORD PTR [rbx],rax
) and loading the next element from that list. There are also two checks for termination (test r12,r12
and test rbp,rbp
). This hot loop corresponds directly to this code from glibc
(from the filemsort.c
^{5}) :
while (n1 > 0 && n2 > 0)
{
if ((*cmp) (b1, b2, arg) <= 0)
{
*(uint64_t *) tmp = *(uint64_t *) b1;
b1 += sizeof (uint64_t);
n1;
}
else
{
*(uint64_t *) tmp = *(uint64_t *) b2;
b2 += sizeof (uint64_t);
n2;
}
tmp += sizeof (uint64_t);
}
This loop suffers heavily from branch mispredictions, since the “which element is larger” branch is highly unpredictable (at least for randomlooking input data). Indeed, we see roughly 128 million mispredicts while sorting ~11 million elements: close to 12 mispredicts per element.
We also note the presence of the indirect call at the call r14
line. This corresponds to the (*cmp) (b1, b2, arg)
expression in the source: it is calling the user provided comparator function through a function pointer. Since the qsort()
code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator, passed as a function pointer, can be inlined.
The comparator function I provide looks like:
int compare_uint64_t(const void *l_, const void *r_) {
uint64_t l = *(const uint64_t *)l_;
uint64_t r = *(const uint64_t *)r_;
if (l < r) return 1;
if (l > r) return 1;
return 0;
}
which on gcc compiles to branchfree code:
mov rax,QWORD PTR [rsi]
mov edx,0xffffffff
cmp QWORD PTR [rdi],rax
seta al
movzx eax,al
cmovb eax,edx
ret
Note that the comparator has to redundantly load from memory the two locations to compare, something the merge loop already did (the merge loop reads them because it is responsible for moving the elements).
How much better could things get if we inline the comparator into the merge loop? That’s what we do in qsortinlined
^{6}, and here’s the main loop which now includes the comparator function^{7} :
0.07 : 401dc8: test rbp,rbp
0.66 : 401dcb: je 401e0c <void msort_with_tmp<CompareU64>(msort_param const*, void*, unsigned long, CompareU64)+0xbc>
3.51 : 401dcd: mov rax,QWORD PTR [r9]
5.00 : 401dd0: lea rdx,[rbx+0x8]
1.62 : 401dd4: mov rcx,QWORD PTR [rbx]
0.24 : 401dd7: lea r8,[r9+0x8]
6.96 : 401ddb: cmp rax,rcx
20.83 : 401dde: cmovbe r9,r8
8.88 : 401de2: cmova rbx,rdx
0.27 : 401de6: cmp rcx,rax
6.23 : 401de9: sbb r8,r8
0.74 : 401dec: cmp rcx,rax
4.93 : 401def: sbb rdx,rdx
0.24 : 401df2: not r8
6.69 : 401df5: add rbp,rdx
0.44 : 401df8: cmp rax,rcx
5.34 : 401dfb: cmova rax,rcx
5.96 : 401dff: add rdi,0x8
7.48 : 401e03: mov QWORD PTR [rdi0x8],rax
0.00 : 401e07: add r15,r8
0.71 : 401e0a: jne 401dc8 <void msort_with_tmp<CompareU64>(msort_param const*, void*, unsigned long, CompareU64)+0x78>
A key differences is that the loop is now basically branch free. There are still two conditional jumps, but they are both just checking for the termination condition (that one of the lists to merge is exhausted), so we expect this loop to be free of branch mispredictions other than the final iteration. Indeed, we measure with perf stat
that the misprediction rate has dropped from to close to 12 mispredicts per element to around 0.75 per element. The loop has only two loads and one store, so the memory access redundancy between the merge code and the comparator has been eliminated^{8}. Finally, the comparator does a threeway compare (returning distrinct results for <
, >
and ==
), but the merge code only needs a twoway compare (<=
or >
)  inlining the comparator manages to remove extra code associated with distinguishing the <
and ==
cases.
What’s the payoff? It’s pretty big:
The speedup hovers right around 1.77x. Note that this is much larger than simply eliminating all the time spent in the separate comparator function in the original version (about 17% of the time implying a speedup of 1.2x if all the function time disapeared). This is a good example of how inlining isn’t just about removing function call overhead but enabling further knock on optimizations which can have a much larger effect than just removing the overhead associated with function calls.
What about C++?
Short of copying the existing glibc (note: LGPL licenced) sorting code to allow inlining, what else can we do to speed things up? I’m writing in C++, so how about the C++ sort functions available in the <algorithm>
header? Unlike C’s qsort
which is generic by virtue of taking a function pointer and information about the object size, the C++ sort functions use templates to achieve genericity and so are implemented directly in header files. Since the sort code and the comparator are being compiler together, we expect the comparator to be easily inlined, and perhaps other optimizations may occur.
Without further ado, let’s just throw std::sort
, std::stable_sort
and std::partial_sort
into the mix:
The C++ sort functions, other than perhaps std::partial_sort
^{9}, put in a good showing. It is interesting that std::stable_sort
which has stricly more requirements on its implementation than std::sort
(i.e., any stable sort is also suitable for std::sort
) ends up faster. I rewrote this paragaph several times, since sometimes after a reboot stable_sort
was slower and sometimes it was faster (as shown above). When it was “fast” it had less than 2% branch mispredictions, and when it was slow it was at 15%. So perhaps there was some type of aliasing issue in the branch predictor which depends on the physical addresses assigned, which can vary from run to run, I’m not sure. See ^{10} for an old note from when std::stable_sort
was slower.
Can we do better?
So that’s as fast as it gets, right? We aren’t going to beat std::sort
or std::stable_sort
without a huge amount of effort, I think? After all, these are presumably highly optimized sorting routines written by the standard library implementors. Sure, we might expect to be able to beat qsort()
, but that’s mostly because of builtin disadvantages that qsort
has, lacking the ability to inline the comparator, etc.
Radix Sort Attempt 1
Well, one thing we can try is a noncomparison sort. We know we have integer keys, so why stick to comparing numbers pairwise  maybe we can use something like radix sort to stick them directly in their final location.
We can pretty much copy the description from the wikipedia article into C++ code that looks like this:
const size_t RADIX_BITS = 8;
const size_t RADIX_SIZE = (size_t)1 << RADIX_BITS;
const size_t RADIX_LEVELS = (63 / RADIX_BITS) + 1;
const uint64_t RADIX_MASK = RADIX_SIZE  1;
using queuetype = std::vector<uint64_t>;
void radix_sort1(uint64_t *a, size_t count)
{
for (size_t pass = 0; pass < RADIX_LEVELS; pass++) {
uint64_t shift = pass * RADIX_BITS;
std::array<queuetype, RADIX_SIZE> queues;
// copy each element into the appropriate queue based on the current RADIX_BITS sized
// "digit" within it
for (size_t i = 0; i < count; i++) {
size_t value = a[i];
size_t index = (value >> shift) & RADIX_MASK;
queues[index].push_back(value);
}
// copy all the queues back over top of the original array in order
uint64_t* aptr = a;
for (auto& queue : queues) {
aptr = std::copy(queue.begin(), queue.end(), aptr);
}
}
}
That’s about as simple as it gets. We decide to use one byte (i.e., radix256) as the size of our “digit” (although it’s easy to change by adjusting the RADIX_BITS
constant) and so we make 8 passes over our uint64_t
array from the least to most significant byte. At each pass we assign the current value to one of 256 “queues” (vectors in this case) based on the value of the current byte, and once all elements have been processed we copy each queue in order back to the original array. We’re done  the list is sorted.
How does it perform against the usual suspects?
Well it’s not terrible, and while it certainly has some issues at low element counts, it actaully squeezes into first place at 1,000,000 elements and is competitive at 100,000 and 10,000,000. Not bad for a dozen lines of code.[^rewrite]
A quick check of time ./bench Radix1
shows something interesting:
real 0m1.099s
user 0m0.552s
sys 0m0.548s
We spent about the same amount of time in the kernel (sys
time) as in user space. The other algorithms spend only a few lonely % in the kernel, and almost most of that is in the setup code, not in the actual sort.
A deeper look with perf record && perf report
shows:
# Samples: 4K of event 'cycles:ppp'
# Event count (approx.): 2858148287
#
# Overhead Command Shared Object Symbol
# ........ ....... ................... ................................................
#
29.02% bench bench [.] radix_sort1
26.16% bench libc2.23.so [.] __memmove_avx_unaligned
4.93% bench [kernel.kallsyms] [k] clear_page_erms
4.46% bench [kernel.kallsyms] [k] native_irq_return_iret
3.61% bench [kernel.kallsyms] [k] error_entry
3.04% bench [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode
2.99% bench [kernel.kallsyms] [k] sync_regs
1.94% bench bench [.] main
1.91% bench [kernel.kallsyms] [k] get_page_from_freelist
1.64% bench libc2.23.so [.] __memcpy_avx_unaligned
1.59% bench [kernel.kallsyms] [k] release_pages
1.40% bench [kernel.kallsyms] [k] __handle_mm_fault
1.37% bench [kernel.kallsyms] [k] _raw_spin_lock
1.01% bench [kernel.kallsyms] [k] __pagevec_lru_add_fn
0.88% bench [kernel.kallsyms] [k] handle_mm_fault
0.86% bench [kernel.kallsyms] [k] __alloc_pages_nodemask
0.83% bench [kernel.kallsyms] [k] unmap_page_range
0.75% bench [kernel.kallsyms] [k] try_charge
0.74% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.70% bench bench [.] bench_one
0.63% bench [kernel.kallsyms] [k] __do_page_fault
0.63% bench [kernel.kallsyms] [k] __mod_zone_page_state
0.49% bench [kernel.kallsyms] [k] free_pcppages_bulk
0.45% bench [kernel.kallsyms] [k] page_add_new_anon_rmap
0.43% bench [kernel.kallsyms] [k] up_read
0.40% bench [kernel.kallsyms] [k] page_remove_rmap
0.36% bench [kernel.kallsyms] [k] __mod_node_page_state
I’m not even going to try to explain what __pagevec_lru_add_fn
does, but the basic idea here is that we are spending a lot of time in the kernel, and we are doing that because we are allocating and freeing a lot of memory. Every pass we push_back
every element into one of 256 vectors, which will be constantly growing to accomodate new elements, and then finally all the nowgiant vectors are freed at the end of every allocation. That’s a lot of stress on the memory allocation paths in the kernel^{11}.
Radix Sort Attempt 2
Let’s try the firstthingyoudowhenvectorisinvolvedandperformancematters; that is, let us reserve()
memory for each vector before we start adding elements. Just throw this at the start of each pass:
for (auto& queue : queues) {
queue.reserve(count / RADIX_SIZE * 1.2);
}
Here, 1.2
is an arbitrary fudge factor to account for the fact that some vectors will get more than the average number of elements. The exact value doesn’t matter much as long as it’s not too small (0.9 is a bad value, almost evey vector needs a final doubling). This gives use radix_sort2
and let’s jump straight to the results (I’ve removed a couple of the less interesting sorts to reduce clutter):
I guess it’s a bit better? It does better for small array sizes, probably because the overhead of constantly resizing the small vectors is more significant there, but it is actually a bit slower for the middle sizes. System time is lower but still quite high:
real 0m0.904s
user 0m0.523s
sys 0m0.380s
What we really want is to stop throwing away the memory we allocated every pass. Let’s move the queues outside of the loop and just clear them every iteration:
void radix_sort3(uint64_t *a, size_t count)
{
using queuetype = std::vector<uint64_t>;
std::array<queuetype, RADIX_SIZE> queues;
// we keep the reservation code (now outside the loop),
// although it matters less now since the resizing will
// generally only happen in the first iteration
for (auto& queue : queues) {
queue.reserve(count / RADIX_SIZE * 1.2);
}
for (size_t pass = 0; pass < RADIX_LEVELS; pass++) {
// ... as before
// copy all the queues back over top of the original array in order
uint64_t* aptr = a;
for (auto& queue : queues) {
aptr = std::copy(queue.begin(), queue.end(), aptr);
queue.clear(); // < this is new, clear the queues
}
}
}
Yes another graph with Radix3 included this time:
That looks a lot better! This radix sort is always faster than our earlier attempts and the fastest overall for sizes 10,000 and above. It still falls behind the std::
algorithms for the 1,000 element size, where the O(n)
vs O(n*log(n))
difference doesn’t play as much of a role. Despite this minor victory, and while system is reduced, we are still spending 30% of our time in the kernel:
real 0m0.612s
user 0m0.428s
sys 0m0.184s
Presizing the Queues
Sorting should be about my code, not the kernel  so let’s get rid of this kernel time for good.
To do that, we’ll move away from std::vector
entirely and just allocate one large temporary region for all of our queues. Although we know the total final size of all the queues (it’s the same size as the input array), we don’t know how big any particular queue will be. This means we don’t know exactly how to divide up the region. A wellknown solution to this problem is to first count the number of number of values that will fall into each queue so they can be sized appropriately (also known as taking the histogram of the data). As a bonus, we can count the frequencies for all radix passes in a single trip over the data, so we expect this part to be much cheaper than the radix sort proper which needs a separate pass for each “digit”.
Knowing the size of each queue allows us to pack all the values exactly within a single temporary region. The copy at the end of each stage is just a single linear copy. The code is longer now since we need to implement the frequency counting gives us radix_sort4. Results:
It’s a significant speedup over Radix 3, especially at small sizes (speedup about 3x) but still good at large sizes (about 1.3x for 10m elements). The speedup over poor qsort
ranges from 3.7x to 5.45x, increasing at larger sizes. Even compared to the best contented from the standard library, std::stable_sort
, the speedup averages about 2x.
Are We Done Yet?
So are we done yet? Can we squeeze out some more performance?
One little trick is to note that the temporary “queue area” and the original array are now of the same size and type, so rather than always performing the radix passes from the original array to the temporary area (which requires a copy back to the original array each time), we can instead copy back and forth between these two areas, alternating the “from” and “to” areas each time. This saves a copy each pass.
The results in radix_sort5 and it provides a small but measurable benefit:
It’s also interesting how small the improvement is. This change actually cuts the memory bandwidth requirements of the algorithm almost exactly in half: rather than reading and writing each element twice during each pass (once during the sort and once in the final copy), we read them only once (it’s not exactly half because the single histogramming pass adds another read). Yet the overall speedup is small, in the range of 1.05x to 1.2x. From this we can conclude that we are not approaching the memory bandwidth limits in the radix passes.
There is a catch here: at the end of the sort, if we have done an odd number of passes, the final sorted results will be in the temporary area, not in the original array, so we need to copy back to the original array  but 1 extra copy is better than 8! In any case, with RADIX_BITS == 8
as we’ve chosen, there are an even number of copies, so this code never executes in our benchmark.
Pointless Work is Pointless
Another observation we can make is that for this input (and many inputs in the real world), many of the radix passes do nothing. All the input values are less than 40,000,000,000. In 64bit hex that looks like 0x00000009502F9000
 the top 28 bits are always zero. Any radix pass that uses these allzero bits is pointless: every element will be copied to the first queue entry, one by one: essentially it’s a slow, convoluted memcpy
.
We can simply skip these “trivial” passes by examining the frequency count: if all counts are zero except a single entry, the pass does nothing. This gives us radix_sort6, which ends up cutting out 3 of the 8 radix passes leading to performance like this (I’ve changed the scale to emphasize the faster algorithms as they were getting crowed down at the bottom):
In relative terms this provides a significant speedup ranging from 1.2x to 1.5x over radix_sort5
. The theoretical speedup from skipping 3 of the 8 passes is 1.6x, but we don’t achieve that because there is work outside of the core passes (counting the frequencies, for example) and also because the 3 trivial passes were actually slightly faster than the nontrivial ones because of better caching behavior.
Unpointless Prefetch
So how much more juice can we squeeze from this performance orange? Will this post ever come to an end? Has anyone even made it this far?
As it turns out we’re not done yet, and the next change is perhaps the easiest one yet, a oneliner. First, let us observe that the core radix sort loop does a linear read through the elements (very prefetch and cacheline locality friendly), and then a scattered store to one of 256 locations depending on the value. We might expect that those scattered stores are problematic, since we don’t expect the prefetcher to track 256 different streams. On the other hand, stores are somewhat “fire and forget”, because after we execute the store, they can just sit around in the store buffer for as long as needed while the associated cache lines are fetched: they don’t participate in any dependency chains^{12}. So maybe they aren’t causing a problem?
Let’s check that theory using the resource_stalls.sb
event, which tells us how many cycles we stalled store buffer was full, using this magical invocation:
for i in {3..8}; do s=$((10**$i)); rm f bench.o; echo "SIZE=$s, KB=$(($s*8/1000))"; make EFLAGS=DSIZE=$s; perf stat e cycles,instructions,resource_stalls.any,resource_stalls.sb ./bench Radix6; done
This tests a variety of different sizes and here’s typical output when the array to sort has 1 million elements (8 MB):
SIZE=1000000, KB=8000
...
Performance counter stats for './bench Radix6':
5,544,710,461 cycles
8,219,301,287 instructions # 1.48 insn per cycle
2,917,340,919 resource_stalls.any
2,454,399,834 resource_stalls.sb
out of 5.54 billion cycles, we are stalled because the store buffer is full in 2.45 billion of them. So … a lot.
One fix for this is a oneliner in the main radixsort loop:
for (size_t i = 0; i < count; i++) {
size_t value = from[i];
size_t index = (value >> shift) & RADIX_MASK;
*queue_ptrs[index]++ = value;
__builtin_prefetch(queue_ptrs[index] + 1); // < the magic happens here
}
That’s it. We prefetch the next plus one position in the queue after writing to the queue. 87.5% of the time this does nothing, since the next position is either in the same cache line (6 out of 8 times) or we already prefetched it the last time we wrote to this queue (1 out of 8 times).
The other 12.5% of the time it helps, producing results like this:
The speedup is zero at the smallest size (1000 elements, aka 8 KB) which fits in L1, but ranges between 1.31x and 1.45x as soon as the data set exceeds L1. In principle, I wouldn’t expect prefetch to help here: we expect it to help for loads if we can start the load early, but for stores, with a full store buffer the CPU can already pick from 50+ stores to start prefetching. That is, the CPU already knows what stores are coming becaue the store buffer is full of them. However, in practice, theory and practice are different and in particular Intel CPUs seem to struggle when loads that hit in L1 are interleaved with loads that don’t, especially with recent microcode^{13}.
That interleaved scenario will happen all the time with this type of scattered write pattern, and as noted in the earlier post, prefetch is one way of mitigating this. For the 8 MB working set we now have the following perf stat
results:
SIZE=1000000, KB=8000
...
Performance counter stats for './bench Radix7':
4,342,298,986 cycles
8,719,315,140 instructions # 2.01 insn per cycle
1,555,574,009 resource_stalls.any
623,342,154 resource_stalls.sb
1.675302704 seconds time elapsed
About a fourfold reduction in storebuffer stalls. Reductions are even more dramatic for other sizes: the 100,000 element (8 KB) size has an even larger reduction, from being stalled 43% of the time down to 5% after prefetching is added.
What’s Next?
What’s next is that I have to eat. So while we’re not done here yet, the remaining part of this trip down the radix sort hole will have to wait for part 2.

Note that wanting distinct values here is key. Without that restriction, the problem is simple: just use the output of any decent random number generator. Returning only distinct entries is trickier: you have to find a way to avoid or remove duplicate elements. Well it’s not all that tricky: you can simply remember existing elements using a
std::set
and reject any duplicate elements. What is tricky is doing it quickly. ↩ 
In particular, the method he suggests picks random values oneatatime, already in sorted order, deciding on the gap to the next number by drawing randomly from a geometric distribution. Read his post for full details, but the one sentence summary is that Dan finds that the geometric distribution approach has favorable performace results compared to the sorted one (more than 2x as fast). Both this method and the sorting method have the little problem that the resulting list may be smaller than requested. For example in the case of the sorting method, the output after deduplication is smaller if there are any collisions (and due to the Birthday Problem that is a lot likelier than you might think). One could cope with this by generating slightly more numbers than required, and then removing randomly selected elements until the list reaches its desired size. ↩

There is a long tail of other functions here, but they add up to only about 2% of the runtime, so you can safely ignore them. ↩

I get the annotated assembly with
perfc annotate Mintel stdio symbol=msort_with_tmp.part.0
 this only shows assmebly because the function is in glibc and there are no debug symbols for that library on this host. In any case, assembly is probably what we want. ↩ 
msort.c
implements a mergesort algoirhtm. There is also a fileqsort.c
which implements a traditional partitionaroundapivot based quicksort, but it seems not used to implementqsort()
on recent gcc versions. ↩ 
It turns out that inlining the comparator is not simply a matter of compiling
msort.c
and the comparator in the same translation unit while still passing the comparator as a function pointer, so the compiler can see the definition. This doens’t pan out because msort (a) is a recursive function, which means unlimited inlining isn’t possible (so at some point there is an outofline call where the identity of the comparator will be lost) and (b) the comparator function is saved to memory (in themsort_param
struct) and used from there, which makes it harder for compilers to prove the comparator is always the one originally passed in. Instead, I use a template version of msort which takes the comparator of arbitrary typeC
, and in the case that a functor object is passed, the comparator is built right into the signature of the function, making inlining the comparator basically automatic for any compiler. ↩ 
We obtain this output with
perf record ./bench qsortinlined && perfc annotate Mintel stdio nosource symbol='msort_with_tmp<CompareU64>'
, if your version ofperf
supports demangling names (otherwise you’ll have to use the mangled name). ↩ 
Technically, there is still some redundancy since we load two elements every iteration, whereas we really only need to load one: the new element from whichever list an element was removed from. You can still do that in a branchfree way, but the transformation is aparently beyond the capability of the compiler. ↩

It’s hard to blame
std::partial_sort
here, after all it is a specialized sort for cases where you want need sort only a subset of the input to be sorted, e.g., the first 100 elements of a 100,000 element sequence. However, we can use it as a full sort simply by specifying that we want the full range, but one would not expect the algorithm to be optimized for that case. ↩ 
I had to edit this part of the blog entry because something weird happened: originally, my results always showed that
std::stable_sort
was faster thanstd::sort
across all input sizes. After installing a bunch of OS packages and restarting, however,std::stable_sort
performance came back down to earth, around 1.4x slower than before and now slower thanstd::sort
across all input sizes. I don’t know what changed. I did find that before the changestd::stable_sort
had very few (< 1%) branch mispredictions, while after it had many mispredictions (about 15%). ↩ 
This behavior actually depends pretty heavily on your memory allocator. The default glibc allocator I’m using likes to give memory allocations above a certain size back to the OS whenever they are freed, which means this workload turns into a
mmap
andmunmap
workout for the kernel. Using an allocator that wasn’t too worried about memory use and kept these pages around for its own use would result in a very different profile. ↩ 
Stores never participate in normal “register carried” dependency chains, since they do not write any registers, but they can still participate in chains through memory, e.g., if a store is followed by a load that reads the same location, the load depends on the store (and efficiently this dependency is handled depends on a lot on the hardware). This case doesn’t apply here because we don’t read the recently written queue locations any time soon: our stores are truly “fire and forget”. ↩

In fact, if you have an older microcode on your machine, you will see different results: there will be very little difference between Radix6 and Radix7 because Radix6 is considerably faster if you don’t update your microcode. ↩