On Twitter, Paul Khuong asks: Has anyone done a study on the distribution of interrupt points in OOO processors?

Personally, I’m not aware of any such study for modern x86, and I have also wondered the same thing. In particular, when a CPU receives an externally triggered1 interrupt, at what point in the instruction stream is the CPU interrupted?

For a simple 1-wide in-order, non-pipelined CPU the answer might be as simple as: the CPU is interrupted either before or after instruction that is currently running2. For anything more complicated it’s not going to be easy. On a modern out-of-order processor there may be hundreds of instructions in-flight at any time, some waiting to execute, a dozen or more currently executing, and others waiting to retire. From all these choices, which instruction will be chosen as the victim?

Among other reasons, the answer is interesting because it helps us understand how useful the exact interrupt position is when profiling via interrupt: can we extract useful information from the instruction position, or should we only trust it at a higher level (e.g., over regions of say 100s of instructions).

So let’s go figure out how interruption works, at least on my Skylake i7-6700HQ, by compiling a bunch of small pure-asm programs and running them. The source for all the tests is available in the associated git repo so you can follow along or write your own tests. All the tests are written in assembly because we want full control over the instructions and because they are all short and simple. In any case, we can’t avoid assembly-level analysis when talking about what instructions get interrupted.

First, let’s take a look at some asm that doesn’t have any instruction that sticks out in any way at all, just a bunch of mov instructions3. The key part of the source looks like this4:

.loop:

%rep 10
	mov  eax, 1
	mov  ebx, 2
	mov  edi, 3
	mov  edx, 4
	mov  r8d, 5
	mov  r9d, 6
	mov r10d, 7
	mov r11d, 8
%endrep

	dec rcx
	jne .loop

Just constant moves into registers, 8 of them repeated 10 times. This code executes with an expected and measured5 IPC of 4.

Next, we get to the meat of the investigation. We run the binary using perf record -e task-clock ./indep-mov, which will periodically interrupt the process and record the IP. Next, we examine the interrupted locations with perf report6. Here’s the output (hereafter, I’m going to cut out the header and just show the samples):

Samples | Source code & Disassembly of indep-mov for task-clock (1769 samples, percent: local period) ----------------------------------------------------------------------------------------------------------- : : Disassembly of section .text: : : 00000000004000ae <_start.loop>: : _start.loop(): : indep-mov.asm:15 16 : 4000ae: mov eax,0x1 15 : 4000b3: mov ebx,0x2 22 : 4000b8: mov edi,0x3 25 : 4000bd: mov edx,0x4 14 : 4000c2: mov r8d,0x5 19 : 4000c8: mov r9d,0x6 25 : 4000ce: mov r10d,0x7 18 : 4000d4: mov r11d,0x8 22 : 4000da: mov eax,0x1 24 : 4000df: mov ebx,0x2 20 : 4000e4: mov edi,0x3 29 : 4000e9: mov edx,0x4 28 : 4000ee: mov r8d,0x5 18 : 4000f4: mov r9d,0x6 21 : 4000fa: mov r10d,0x7 19 : 400100: mov r11d,0x8 26 : 400106: mov eax,0x1 18 : 40010b: mov ebx,0x2 29 : 400110: mov edi,0x3 19 : 400115: mov edx,0x4

The first column shows the number of interrupts received for each instruction. Specially, the number of times an instruction would be the next instruction to execute following the interrupt.

Without doing any deep statistical analysis, I don’t see any particular pattern here. Every instruction gets its time in the sun. Some columns have somewhat higher values than others, but if you repeat the measurements, the columns with higher values don’t necessarily repeat.

We can try the exact same thing, but with add instructions like this:

	add  eax, 1
	add  ebx, 2
	add  edi, 3
	add  edx, 4
	add  r8d, 5
	add  r9d, 6
	add r10d, 7
	add r11d, 8

We expect the execution behavior to be similar to the mov case: we do have dependency chains here but 8 separate ones (for each destination register) for a 1 cycle instruction so there should be little practical impact. Indeed, the results are basically identical to the last experiment so I won’t show them here (you can see them yourself with the indep-add test).

Let’s get moving here and try something more interesting. This time we will again use all add instructions, but two of the adds will depend on each other, while the other two will be independent. So the chain shared by those two adds will be twice as long (2 cycles) as the other chains (1 cycle each). Like this:

    add  rax, 1 ; 2-cycle chain
    add  rax, 2 ; 2-cycle chain
    add  rsi, 3
    add  rdi, 4

Here the chain through rax should limit the throughput of the above repeated block to 1 per 2 cycles, and indeed I measure an IPC of 2 (4 instructions / 2 cycles = 2 IPC).

Here’s the interrupt distribution:

0 : 4000ae: add rax,0x1 82 : 4000b2: add rax,0x2 112 : 4000b6: add rsi,0x3 0 : 4000ba: add rdi,0x4 0 : 4000be: add rax,0x1 45 : 4000c2: add rax,0x2 144 : 4000c6: add rsi,0x3 0 : 4000ca: add rdi,0x4 0 : 4000ce: add rax,0x1 44 : 4000d2: add rax,0x2 107 : 4000d6: add rsi,0x3 (pattern repeats...)

This is certainly something new. We see that all the interrupts fall on the middle two instructions, one of which is part of the addition chain and one which is not. The second of the two locations also gets about 2-3 times as many interrupts as the first.

A Hypothesis

Let’s make a hypothesis now so we can design more tests.

Let’s guess that interrupts select instructions which are the oldest unretired instruction, and that this selected instruction is allowed to complete hence samples fall on the next instruction (let us call this next instruction the sampled instruction). I am making the distinction between selected and sampled instructions rather than just saying “interrupts sample instructions that follow the oldest unretired instruction” because we are going to build our model almost entirely around the selected instructions, so we want to name them. The characteristics of the ultimately sampled instructions (except their positioning after selected instructions) hardly matters7.

Without a more detailed model of instruction retirement, we can’t yet explain everything we see - but the basic idea is instructions that take longer, hence are more likely to be the oldest unretired instruction, are the ones that get sampled. In particular, if there is a critical dependency chain, instructions in that chain are likely8 be sampled at some point9.

Let’s take a look at some more examples. I’m going to switch using mov rax, [rax] as my long latency instruction (4 cycles latency) and nop as the filler instruction not part of any chain. Don’t worry, nop has to allocate and retire just like any other instruction: it simply gets to skip execution. You can build all these examples with a real instruction like add and they’ll work in the same way10.

Let’s take a look at a load followed by 10 nops:

.loop:

%rep 10
    mov  rax, [rax]
    times 10 nop
%endrep

    dec rcx
    jne .loop

The result:

0 : 4000ba: mov rax,QWORD PTR [rax] 33 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: nop 0 : 4000c0: nop 11 : 4000c1: nop 0 : 4000c2: nop 0 : 4000c3: nop 0 : 4000c4: nop 22 : 4000c5: nop 0 : 4000c6: nop 0 : 4000c7: mov rax,QWORD PTR [rax] 15 : 4000ca: nop 0 : 4000cb: nop 0 : 4000cc: nop 0 : 4000cd: nop 13 : 4000ce: nop 1 : 4000cf: nop 0 : 4000d0: nop 0 : 4000d1: nop 35 : 4000d2: nop 0 : 4000d3: nop 0 : 4000d4: mov rax,QWORD PTR [rax] 16 : 4000d7: nop 0 : 4000d8: nop 0 : 4000d9: nop 0 : 4000da: nop 14 : 4000db: nop 0 : 4000dc: nop 0 : 4000dd: nop 0 : 4000de: nop 31 : 4000df: nop 0 : 4000e0: nop 0 : 4000e1: mov rax,QWORD PTR [rax] 22 : 4000e4: nop 0 : 4000e5: nop 0 : 4000e6: nop 0 : 4000e7: nop 16 : 4000e8: nop 0 : 4000e9: nop 0 : 4000ea: nop 0 : 4000eb: nop 24 : 4000ec: nop 0 : 4000ed: nop

The selected instructions are the long-latency mov-chain, but also two specific nop instructions out of the 10 that follow: those which fall 4 and 8 instructions after the mov. Here we can see the impact of retirement throughput. Although the mov-chain is the only thing that contributes to execution latency, this Skylake CPU can only retire up to 4 instructions per cycle (per thread). So when the mov finally retires, there will be two cycles of retiring blocks of 4 nop instructions before we get to the next mov:

;                            retire cycle
mov    rax,QWORD PTR [rax] ; 0 (execution limited)
nop                        ; 0
nop                        ; 0
nop                        ; 0
nop                        ; 1 <-- selected nop
nop                        ; 1
nop                        ; 1
nop                        ; 1
nop                        ; 2 <-- selected nop
nop                        ; 2
nop                        ; 2
mov    rax,QWORD PTR [rax] ; 4 (execution limited)
nop                        ; 4
nop                        ; 4

In this example, the retirement of mov instructions are “execution limited” - i.e., their retirement cycle is determined by when they are done executing, not by any details of the retirement engine. The retirement of the other instructions, on the other hand, is determined by the retirement behavior: they are “ready” early but cannot retire because the in-order retirement pointer hasn’t reached them yet (it is held up waiting for the mov to execute).

So those selected nop instructions aren’t particularly special: they aren’t slower than the rest or causing any bottleneck. They are simply selected because the pattern of retirement following the mov is predictable. Note also that it is the first nop that in the group of 4 that would otherwise retire that is selected11.

This means that we can construct an example where an instruction on the critical path never gets selected:

%rep 10
    mov  rax, [rax]
    nop
    nop
    nop
    nop
    nop
    add  rax, 0
%endrep

The results:

0 : 4000ba: mov rax,QWORD PTR [rax] 94 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: nop 0 : 4000c0: nop 17 : 4000c1: nop 0 : 4000c2: add rax,0x0 0 : 4000c6: mov rax,QWORD PTR [rax] 78 : 4000c9: nop 0 : 4000ca: nop 0 : 4000cb: nop 0 : 4000cc: nop 18 : 4000cd: nop 0 : 4000ce: add rax,0x0

The add instruction is on the critical path: it increases the execution time of the block from 4 cycles to 6 cycles12, yet it is never selected. The retirement pattern looks like:

                        /- scheduled
                       |  /- ready
                       |  |  /- complete
                       |  |  |  /- retired
    mov  rax, [rax]  ; 0  0  5  5 <-- selected
    nop              ; 0  0  0  5 <-- sample
    nop              ; 0  0  0  5
    nop              ; 0  0  0  5
    nop              ; 1  1  1  6 <-- selected
    nop              ; 1  1  1  6 <-- sampled
    add  rax, 0      ; 1  5  6  6
    mov  rax, [rax]  ; 1  6 11 11 <-- selected
    nop              ; 2  2  2 11 <-- sampled
    nop              ; 2  2  2 11
    nop              ; 2  2  2 11
    nop              ; 2  2  2 12 <-- selected
    nop              ; 3  3  3 12 <-- sampled
    add  rax, 0      ; 3 11 12 12
    mov  rax, [rax]  ; 3 12 17 17 <-- selected
    nop              ; 3  3  3 17 <-- sampled

On the right hand side, I’ve annotated each instruction with several key cycle values, described below.

The scheduled column indicates when the instruction enters the scheduler and hence could execute if all its dependencies were met. This column is very simple: we assume that there are no front-end bottlenecks and hence we schedule (aka “allocate”) 4 instructions every cycle. This part is in-order: instructions enter the scheduler in program order.

The ready column indicates when all dependencies of a scheduled instruction have executed and hence the instruction is ready to execute. In this simple model, an instruction always begins executing when it is ready. A more complicated model would also need to model contention for execution ports, but here we don’t have any such contention. Instruction readiness occurs out of order: you can see that many instructions become ready before older instructions (e.g., the nop instructions are generally ready before the preceding mov or add instructions). To calculate this column take the maximum of the ready column for this instruction and the completed column for all previous instructions whose outputs are inputs to this instruction.

The complete column indicates when an instruction finishes execution. In this model it simply takes the value of the ready column plus the instruction latency, which is 0 for the nop instructions (they don’t execute at all, so they have 0 effective latency), 1 for the add instruction and 5 for the mov. Like the ready column this happens out of order.

Finally, the retired column, what we’re really after, shows when the instruction retires. The rule is fairly simple: an instruction cannot retire until it is complete, and the instruction before it must be retired or retiring on this cycle. No more than 4 instructions can retire in a cycle. As a consequence of the “previous instruction must retired” part, this column is only increasing and so like the first column, retirement is in order.

Once we have the retired column filled out, we can identify the <-- selected instructions13: they are the ones where the retirement cycle increases. In this case, selected instructions are always either the mov instruction (because of its long latency, it holds up retirement), or the fourth nop after the mov (because of the “only retire 4 per cycle” rule, this nop is at the head of the group that retires in the cycle after the mov retires). Finally, the sampled instructions which are those will actually show up in the interrupt report are simply the instruction following each selected instruction.

Here, the add is never selected because it executes in the cycle after the mov, so it is eligible for retirement in the next cycle and hence doesn’t slow down retirement and so doesn’t behave much differently than a nop for the purposes of retirement. We can change the position of the add slightly, so it falls in the same 4-instruction retirement window as the mov, like this:

%rep 10
    mov  rax, [rax]
    nop
    nop
    add  rax, 0
    nop
    nop
    nop
%endrep

We’ve only slid the add up a few places. The number of instructions is the same and this block executes in 6 cycles, identical to the last example. However, the add instruction now always gets selected:

21 : 4000ba: mov rax,QWORD PTR [rax] 54 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: add rax,0x0 15 : 4000c3: nop 0 : 4000c4: nop 0 : 4000c5: nop 0 : 4000c6: mov rax,QWORD PTR [rax] 88 : 4000c9: nop 0 : 4000ca: nop 0 : 4000cb: add rax,0x0 14 : 4000cf: nop 0 : 4000d0: nop 0 : 4000d1: nop 0 : 4000d2: mov rax,QWORD PTR [rax] 91 : 4000d5: nop 0 : 4000d6: nop 0 : 4000d7: add rax,0x0 13 : 4000db: nop

Here’s the cycle analysis and retirement pattern for this version:

                        /- scheduled
                       |  /- ready
                       |  |  /- complete
                       |  |  |  /- retired
    mov  rax, [rax]  ; 0  0  5  5 <-- selected
    nop              ; 0  0  0  5 <-- sample
    nop              ; 0  0  0  5
    add  rax, 0      ; 0  5  6  6 <-- selected
    nop              ; 1  1  1  6 <-- sampled
    nop              ; 1  1  1  6
    nop              ; 1  1  1  6
    mov  rax, [rax]  ; 1  6 11 11 <-- selected
    nop              ; 2  2  2 11 <-- sampled
    nop              ; 2  2  2 11
    add  rax, 0      ; 2 11 12 12 <-- selected
    nop              ; 2  2  2 12 <-- sampled
    nop              ; 3  3  3 12
    nop              ; 3  3  3 12
    mov  rax, [rax]  ; 3 12 17 17 <-- selected
    nop              ; 3  3  3 17 <-- sampled

Now might be a good time to note that we also care about the actual sample counts, and not just their presence or absence. Here, the samples associated with the mov are more frequent than the samples associated with the add. In fact, there are about 4.9 samples for mov for every sample for add (calculated over the full results). That lines up almost exactly with the mov having a latency of 5 and the add a latency of 1: the mov will be the oldest unretired instruction 5 times as often as the add. So the sample counts are very meaningful in this case.

Going back to the cycle charts, we know the selected instructions are those where the retirement cycle increases. To that we add that the size of the increase determines their selection weight: the mov instruction has a weight of 5, since it jumps (for example) from 6 to 11 so it is the oldest unretired instruction for 5 cycles, while the nop instructions have a weight of 1.

This lets you measure in-situ the latency of various instructions, as long as you have accounted for the retirement behavior. For example, measuring the following block (rdx is zero at runtime):

    mov  rax, [rax]
    mov  rax, [rax + rdx]

Results in samples accumulating in a 4:5 ratio for the first and second lines: reflecting the fact that the second load has a latency of 5 due to complex addressing, while the first load takes only 4 cycles.

Branches

What about branches? I don’t find anything special about branches: whether taken or untaken they seem to retire normally and fit the pattern described above. I am not going to show the results but you can play with the branches test yourself if you want.

Atomic Operations

What about atomic operations? Here the story does get interesting.

I’m going to use lock add QWORD [rbx], 1 as my default atomic instruction, but the story seems similar for all of them. Alone, this instruction has a “latency”14 of 18 cycles. Let’s put it in parallel with a couple of SIMD instructions that have a total latency of 20 cycles alone:

    vpmulld xmm0, xmm0, xmm0
    vpmulld xmm0, xmm0, xmm0
    lock add QWORD [rbx], 1

This loop still takes 20 cycles to execute. That is, the atomic costs nothing in runtime: the performance is the same if you comment it out. The vpmulld dependency chain is long enough to hide the cost of the atomic. Let’s take a look at the interrupt distribution for this code:

0 : 4000c8: vpmulld xmm0,xmm0,xmm0 12 : 4000cd: vpmulld xmm0,xmm0,xmm0 10 : 4000d2: lock add QWORD PTR [rbx],0x1 244 : 4000d7: vpmulld xmm0,xmm0,xmm0 0 : 4000dc: vpmulld xmm0,xmm0,xmm0 26 : 4000e1: lock add QWORD PTR [rbx],0x1 299 : 4000e6: vpmulld xmm0,xmm0,xmm0 0 : 4000eb: vpmulld xmm0,xmm0,xmm0 35 : 4000f0: lock add QWORD PTR [rbx],0x1 277 : 4000f5: vpmulld xmm0,xmm0,xmm0 0 : 4000fa: vpmulld xmm0,xmm0,xmm0 33 : 4000ff: lock add QWORD PTR [rbx],0x1 302 : 400104: vpmulld xmm0,xmm0,xmm0 0 : 400109: vpmulld xmm0,xmm0,xmm0 33 : 40010e: lock add QWORD PTR [rbx],0x1 272 : 400113: vpmulld xmm0,xmm0,xmm0 0 : 400118: vpmulld xmm0,xmm0,xmm0 31 : 40011d: lock add QWORD PTR [rbx],0x1 280 : 400122: vpmulld xmm0,xmm0,xmm0 0 : 400127: vpmulld xmm0,xmm0,xmm0 40 : 40012c: lock add QWORD PTR [rbx],0x1 277 : 400131: vpmulld xmm0,xmm0,xmm0 0 : 400136: vpmulld xmm0,xmm0,xmm0 21 : 40013b: lock add QWORD PTR [rbx],0x1 282 : 400140: vpmulld xmm0,xmm0,xmm0 0 : 400145: vpmulld xmm0,xmm0,xmm0 35 : 40014a: lock add QWORD PTR [rbx],0x1 291 : 40014f: vpmulld xmm0,xmm0,xmm0 0 : 400154: vpmulld xmm0,xmm0,xmm0 35 : 400159: lock add QWORD PTR [rbx],0x1 270 : 40015e: dec rcx 0 : 400161: jne 4000c8 <_start.loop>

The lock add instructions are selected by the interrupt close to 90% of the time, despite not contributing to the execution time. Based on our mental model, these instructions should be able to run ahead of the vpmulld loop and hence be ready to retire as soon as they are the head of the ROB. The effect that we see here is because lock-prefixed instructions are execute at retire. This is a special type of instruction that waits until it at the head of the ROB before it executes15.

So this instruction will always take a certain minimum amount of time as the oldest unretired instruction: it never retires immediately regardless of the surrounding instructions. In this case, it means retirement spends most of its time waiting for the locked instructions, while execution spends most of its time waiting for the vpmulld. Note that the retirement time added by the locked instructions wasn’t additive with that from vpmulld: the time it spends waiting for retirement is subtracted from the time that would otherwise be spent waiting on the multiplication retirement. That’s why you end up with a lopsided split, not 50/50. We can see this more clearly if we double the number of multiplications to 4:

    vpmulld xmm0, xmm0, xmm0
    vpmulld xmm0, xmm0, xmm0
    vpmulld xmm0, xmm0, xmm0
    vpmulld xmm0, xmm0, xmm0
    lock add QWORD [rbx], 1

This takes 40 cycles to execute, and the interrupt pattern looks like:

0 : 4000c8: vpmulld xmm0,xmm0,xmm0 18 : 4000cd: vpmulld xmm0,xmm0,xmm0 49 : 4000d2: vpmulld xmm0,xmm0,xmm0 133 : 4000d7: vpmulld xmm0,xmm0,xmm0 152 : 4000dc: lock add QWORD PTR [rbx],0x1 263 : 4000e1: vpmulld xmm0,xmm0,xmm0 0 : 4000e6: vpmulld xmm0,xmm0,xmm0 61 : 4000eb: vpmulld xmm0,xmm0,xmm0 160 : 4000f0: vpmulld xmm0,xmm0,xmm0 168 : 4000f5: lock add QWORD PTR [rbx],0x1 251 : 4000fa: vpmulld xmm0,xmm0,xmm0 0 : 4000ff: vpmulld xmm0,xmm0,xmm0 59 : 400104: vpmulld xmm0,xmm0,xmm0 166 : 400109: vpmulld xmm0,xmm0,xmm0 162 : 40010e: lock add QWORD PTR [rbx],0x1 267 : 400113: vpmulld xmm0,xmm0,xmm0 0 : 400118: vpmulld xmm0,xmm0,xmm0 60 : 40011d: vpmulld xmm0,xmm0,xmm0 160 : 400122: vpmulld xmm0,xmm0,xmm0 155 : 400127: lock add QWORD PTR [rbx],0x1 218 : 40012c: vpmulld xmm0,xmm0,xmm0 0 : 400131: vpmulld xmm0,xmm0,xmm0 58 : 400136: vpmulld xmm0,xmm0,xmm0 144 : 40013b: vpmulld xmm0,xmm0,xmm0 154 : 400140: lock add QWORD PTR [rbx],0x1 pattern continues...

The sample counts are similar for lock add but they’ve now increased to a comparable amount (in total) for the vmulld instructions16. In fact, we can calculate how long the lock add instruction takes to retire, using the ratio of the times it was selected compared to the known block throughput of 40 cycles. I get about a 38%-40% rate over a couple of runs which corresponds to a retire time of 15-16 cycles, only slightly less than then back-to-back latency14 of this instruction.

Can we do anything with this information?

Well one idea is that it lets us fairly precisely map out the retirement timing of instructions. For example, we can set up an instruction to test, and a parallel series of instructions with a known latency. Then we observe what is selected by interrupts: the instruction under test or the end of the known-latency chain. Whichever is selected has longer retirement latency and the known-latency chain can be adjusted to narrow it down exactly.

Of course, this sounds way harder than the usual way of measuring latency: a long series of back-to-back instrucitons, but it does let us measure some things “in situ” without a long chain, and we can measure instructions that don’t have an obvious way to chain (e.g,. have no output like stores or different-domain instructions).

TODO

  • Explain the variable (non-self synchronizing) results in terms of retire window patterns
  • Check interruptible instructions
  • Check mfence and friends
  • Check the execution effect of atomic instructions (e.g., blocking load ports)

Comments

Feedback of any type is welcome. I don’t have a comments system17 yet, so as usual I’ll outsource discussion to this HackerNews thread.

Thanks and Attribution

Thanks to HN user rrss for pointing out errors in my cycle chart.

Intel 8259 image by Wikipedia user German under CC BY-SA 3.0.




  1. By externally triggered I mean pretty much any interrupt that doesn’t result directly from the code currently running on the logical core (such as an int instruction or some fault/trap). So it includes includes timer interrupts, cross CPU inter-processor interrupts, etc. 

  2. The two choices basically correspond to letting the current instruction finish, or aborting it in order to run the interrupt. 

  3. I use mov here to avoid creating any dependency chains, i.e., the destination of mov is write-only unlike most x86 integer instructions which both read and write their first operand. 

  4. Here I show the full loop, and the %rep directive which repeats (unrolls) the contained block the specified number of times, but from this point on I may just show the inner block itself with the understanding that it is unrolled with %rep some number of times (to reduce the impact of the loop) and contained with a loop of ~1,000,000 iterations so we get a reasonable number of samples. For the exact code you can always refer to the repo

  5. In general I’ll just say “this code executes like this” or “the bottleneck is this”, without futher justification as the examples are usually simple with one obvious bottleneck. Even though I won’t always mentioned it, I have checked that the examples perform as expected, usually with a perf stat to confirm the IPC or cycles per iteration or whatever. 

  6. The full command I use is something like perfc annotate -Mintel --stdio --no-source --stdio-color --show-nr-samples. For interactive analysis just drop the --stdio part to get the tui UI. 

  7. Of course we will see that in some cases the selected and sampled instructions can actually be the same, in the case of interruptible instructions. 

  8. Originally I thought it was guaranteed that instructions on the critical path would be eligible for sampling, but it is not: later we construct an example where a critical instruction never gets selected. 

  9. Note that I am specifically not saying the only selected instructions will be from the chain. As we will see soon, retirement limits mean that unless the chain instructions are fairly dense (i.e., at least 1 every 4 instructions), other instructions may appear also. 

  10. I use nop because it has no inputs or outputs so I don’t have to be careful not to create dependency chains (e.g., ensuring each add goes to a distinct register), and it doesn’t use any execution units, so we won’t steal a unit from the chain on the critical path. 

  11. We have already seen this effect in all of the earlier examples: it is why the long-latency add is selected, even though that in the cycle that it retires the three subsequent instructions also retire. This behavior is nice because it ensures that longer latency instructions are generally selected, rather than some unrelated instruction (i.e., the usual skid is +1 not +4). 

  12. The increase is from 4 to 6, rather than 4 to 5, because of +1 cycle for the add itself (obvious) and +1 cycle because any ALU instruction in the address computation path of a pointer chasing loop disables the 4-cycle load fast path (much less obvious). 

  13. Of course this doesn’t mean the instruction will be selected, we only have a few thousand interrupts over one million or more interrupts, so usually nothign happens at all. These are just the locations that are highly likely to be chosen when an interrupt does arrive. 

  14. Latency is in scare quotes here because the apparent latency of atomic operations doesn’t behave like most other instructions: if you measure the throughput of back-to-back atomic instructions you get higher performance if the instructions form a dependency chain, rather than if they don’t! The latency of atomics doesn’t fit neatly into an input-to-output model as their execute at retirement behavior causes other bottlenecks.  2

  15. That’s the simple way of looking at it: reality is more complex as usual. Locked instructions may execute some of their instructions before they are ready to retire: e.g., loading the the required value and the ALU operation. However, the key “unlock” operation which verifies the result of the earlier operations and commits the results, in an atomic way, happens at retire and this is responsible for the majority of the cost of this instruction. 

  16. The uneven pattern among the vpmulld instructions is because the lock add instruction eats the retire cycles of the first vpmulld that occur after (they can retire quickly because they are already complete), so only the later ones have a full allocation. 

  17. If anyone has a recommendation or a if anyone knows of a comments system that works with static sites, and which is not Disqus, has no ads, is free and fast, and lets me own the comment data (or at least export it in a reasonable format), I am all ears.