Jekyll2024-01-16T19:53:43+00:00https://travisdowns.github.io/feed.xmlPerformance MattersA blog about low-level software and hardware performance.Travis Downstravis.downs@gmail.comYour CPU May Have Slowed Down on Wednesday2021-06-17T00:00:00+00:002021-06-17T00:00:00+00:00https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt<!-- boilerplate
page.assets: /assets/rip-zero-opt
assetpath: /assets/rip-zero-opt
tablepath: /misc/tables/rip-zero-opt
-->
<h2 id="a-strange-performance-effect">A Strange Performance Effect</h2>
<p>The plot below shows the throughput of filling a region of the given size (varying on the x-axis) with zeros<sup id="fnref:stdfill" role="doc-noteref"><a href="#fn:stdfill" class="footnote" rel="footnote">1</a></sup> on Skylake (and Ice Lake in the second tab).</p>
<p>The two series were generated under apparently identical conditions: the same binary on the same machine. Only the date the benchmark was run varies. That is, on Monday, June 7th, filling with zeros is substantially faster than the same benchmark on Wednesday, at least when the region no longer fits in the L2 cache<sup id="fnref:whitelie" role="doc-noteref"><a href="#fn:whitelie" class="footnote" rel="footnote">2</a></sup>.</p>
<div class="tabs" id="tabs-fig1">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig1-1" name="tab-group-fig1" checked="" />
<label class="tab-label" for="tab-fig1-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/rip-zero-opt/skl/fig1.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/post3/skl-combined/l2-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/rip-zero-opt/skl/fig1.html">
<img class="figimg" src="/assets/rip-zero-opt/skl/fig1.svg" alt="Figure 13: A chart of region size (x-axis) versus fill throughput (y-axis) with two series, Monday and Wednesday, with the Wednesday series showing worse performance in L3 and RAM" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig1-2" name="tab-group-fig1" />
<label class="tab-label" for="tab-fig1-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/rip-zero-opt/icl/fig1.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/post3/icl-combined/l2-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/rip-zero-opt/icl/fig1.html">
<img class="figimg" src="/assets/rip-zero-opt/icl/fig1.svg" alt="Figure 13: A chart of region size (x-axis) versus fill throughput (y-axis) with two series, Monday and Wednesday, with the Wednesday series showing worse performance in L3 and RAM" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<h3 id="hump-day-strikes-back">Hump Day Strikes Back</h3>
<p>What’s going on here? Are my Skylake and Ice Lake hosts simply work-weary by Wednesday and don’t put in as much effort? Is there a new crypto-coin based on who can store the most zeros and this is a countermeasure to avoid ballooning CPU prices in the face of this new workload?</p>
<p>Believe it or not, it is none of the above!</p>
<p>These hosts run Ubuntu 20.04 and on Wednesday June 9th an update to the <a href="https://launchpad.net/ubuntu/+source/intel-microcode">intel-<abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr></a> OS package was released. After a reboot<sup id="fnref:reboot" role="doc-noteref"><a href="#fn:reboot" class="footnote" rel="footnote">3</a></sup>, this loads the CPU with new <em><abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr></em> (released a day earlier by Intel) that causes the behavior shown above. Specifically, this <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr><sup id="fnref:versions" role="doc-noteref"><a href="#fn:versions" class="footnote" rel="footnote">4</a></sup> disables the <a href="/blog/2020/05/13/intel-zero-opt.html">hardware zero store</a> optimization we discussed in a previous post. It was disabled to mitigate <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-24512">CVE-2020-24512</a> further described<sup id="fnref:barely" role="doc-noteref"><a href="#fn:barely" class="footnote" rel="footnote">5</a></sup> in Intel security advisory <a href="https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00464.html">INTEL-SA-00464</a>.</p>
<p>To be clear, I don’t know <em>for sure</em> that the <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> disables the zero store optimization – but the evidence is rather overwhelming. After the update, performance is the same when filling zeros as for any other value, and the performance counters tracking L2 evictions suggestion that substantially all evictions are now non-silent (recall from the previous posts that silent evictions were a hallmark of the optimization).</p>
<p>Although I suspect the performance impact will be minuscule on average<sup id="fnref:impact" role="doc-noteref"><a href="#fn:impact" class="footnote" rel="footnote">6</a></sup>, this surprise still serves as a reminder that raw CPU performance can <em>silently</em> change due to <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> updates and most Linux distributions and modern Windows have these updates enabled by default. We’ve <a href="/blog/2019/03/19/random-writes-and-microcode-oh-my.html">seen this before</a>. If you are trying to run reproducible benchmarks, you should always re-run your <em>entire</em> suite in order to make accurate comparisons, even on the same hardware, rather than just running the stuff you think has changed.</p>
<h3 id="reproduction">Reproduction</h3>
<p>The code to collect this performance data and reproduce my results is available in <a href="https://github.com/travisdowns/zero-fill-bench/tree/post3">zero-fill-bench</a>, with some <a href="https://github.com/travisdowns/zero-fill-bench#third-post-rip-zero-store-optimization">instructions</a> in the README.</p>
<h3 id="mea-culpa-and-an-unsustainable-path">Mea Culpa and an Unsustainable Path</h3>
<p>In writing the earlier blog entries on this topic, I was interested in the <em>performance</em> aspects of this optimization, not its potential as an attack vector. However, merely by observing (and publishing) the results, the optimization was affected: <a href="https://en.wikipedia.org/wiki/Measurement_problem">the system under measurement changed as a result of the observation</a>. Now I can’t be sure that the optimization wouldn’t have eventually been disabled anyway, but it does seem that the reason for this behavior change to occur now was my earlier post.</p>
<p>I am not convinced that removing any optimization which can be used in a timing-based side channel is sustainable. I am not sure this is a thread you want to keep pulling on: practically <em>every</em> aspect of a modern CPU can vary in performance and timing based on internal state<sup id="fnref:power" role="doc-noteref"><a href="#fn:power" class="footnote" rel="footnote">7</a></sup>. Trying to draw the security boundaries tightly around co-located entities (e.g., processes on the same CPU, especially on the same core), without allowing any leaks seems destined to fail without a complete overhaul of CPU design, likely at the cost of a large amount of performance. There are just too many holes to plug.</p>
<p>I hope that once the wave of vulnerabilities and disclosures that started with Meltdown and Spectre begins to recede, we can start to work on a measured approach to classifying and mitigating timing and other side-channel attacks. This could start by enumerating which performance characteristics are reasonable guaranteed to hold, and which aren’t. For example, it could be specified whether memory access timing may vary based on the <em>value</em> accessed. If it is allowed to vary, the zero store optimization would be allowed.</p>
<p>In any case, I still plan to write about performance-related microarchitectural details. I just hope this outcome does not repeat itself.</p>
<h3 id="thanks">Thanks</h3>
<p>Thanks to JN, Chris Martin, Jonathan and m_ueberall for reporting or fixing typos in the text.</p>
<p>Stone photo by Colin Watts on Unsplash.</p>
<h3 id="discussion-and-feedback">Discussion and Feedback</h3>
<p>You can join the discussion on <a href="https://twitter.com/trav_downs/status/1407110595761950720">Twitter</a>, <a href="https://news.ycombinator.com/item?id=27588258">Hacker News</a> or <a href="https://www.reddit.com/r/intel/comments/o5b9gg/your_intel_cpu_may_have_slowed_down_on_wednesday/">r/Intel</a>.</p>
<p>If you have a question or any type of feedback, you can leave a <a href="#comment-section">comment below</a>.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:stdfill" role="doc-endnote">
<p>Specifically, it uses <a href="https://en.cppreference.com/w/cpp/algorithm/fill"><code class="language-plaintext highlighter-rouge">std::fill</code></a> with a zero argument, with some inlining prevention, which ultimately results in a fill which uses a series of 32-byte vector loads and stores to store 256 bytes per unrolled iteration, with a loop body like this:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0x20</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0x40</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0x60</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0x80</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0xa0</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0xc0</span><span class="p">],</span><span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mh">0xe0</span><span class="p">],</span><span class="nv">ymm1</span>
</code></pre></div> </div>
<p>So the compiler does a good job: you can’t ask for much better than that. <a href="#fnref:stdfill" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whitelie" role="doc-endnote">
<p>I’m doing a bit of a retcon here. The effect is present as described on the gates given, and I observed and benchmarked it on Wednesday myself, but the specific data series used for the plots were generated a week later when I had time to collect the data properly in a relatively noise free environment. So the two series were collected back-to-back on the same day, varying only the hidden parameter you’ll learn about two paragraphs from now. <a href="#fnref:whitelie" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:reboot" role="doc-endnote">
<p>To be clear, the <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> is not persistent, so it needs to be loaded on <em>every</em> boot. If you remove or downgrade the <code class="language-plaintext highlighter-rouge">intel-microcode</code> package, you’ll be back to an older <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> after the next boot. That is, unless you also update your BIOS which can <em>also</em> come with a <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> update: this will be persistent unless you downgrade your BIOS. <a href="#fnref:reboot" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:versions" role="doc-endnote">
<p>The new June 8th <abbr title="Internal instructions and other logic forming part of a CPU which may be used to implement user-visible instructions and control other aspects of CPU behavior and which may be modified dynamically by vendor-provided updates.">microcode</abbr> versions are <code class="language-plaintext highlighter-rouge">0xea</code> for Skylake (versus <code class="language-plaintext highlighter-rouge">0xe2</code> previously) and <code class="language-plaintext highlighter-rouge">0xa6</code> for Ice Lake (versus <code class="language-plaintext highlighter-rouge">0xa0</code> previously). <a href="#fnref:versions" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:barely" role="doc-endnote">
<p><em>barely</em> <a href="#fnref:barely" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:impact" role="doc-endnote">
<p>The performance regression shown in the plots is close to a worst case: the benchmark only fills zeros and nothing else. Real code doesn’t spend <em>that much</em> time filling zeros, although zero <em>is</em> no doubt the dominant value in large block fills, at least because the OS must zero pages before returning them to user processes and memory-safe languages like Java will zero some objects and array types in bulk. <a href="#fnref:impact" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:power" role="doc-endnote">
<p>This observation becomes almost universal once you consider that the <em>values</em> involved in any operation affect power use (see e.g. <a href="https://arxiv.org/pdf/1905.12468.pdf">Schöne et al</a> or <a href="https://hal.inria.fr/hal-02401760/document">Cornebize and Legrand</a>). Since power use can be directly (e.g., RAPL or external measurements) or indirectly (e.g., because of heat-dependent frequency changes) observed, it means that in theory <em>any</em> operation, even those widely considered to be constant-time, may leak information. <a href="#fnref:power" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comThe death of hardware store optimization.Ice Lake AVX-512 Downclocking2020-08-19T00:00:00+00:002020-08-19T00:00:00+00:00https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq<!-- boilerplate
page.assets: /assets/icl-avx512-freq
assetpath: /assets/icl-avx512-freq
tablepath: /misc/tables/icl-avx512-freq
-->
<p>This is a short post investigating the behavior of AVX2 and AVX-512 related <em>license-based downclocking</em> on Intel’s newest Ice Lake and Rocket Lake chips.</p>
<p>license-based downclocking<sup id="fnref:tiring" role="doc-noteref"><a href="#fn:tiring" class="footnote" rel="footnote">1</a></sup> refers to the <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/">semi-famous</a> effect where lower than nominal frequency limits are imposed when certain <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions are executed, especially heavy floating point instructions or 512-bit wide instructions.</p>
<p>More details about this type of downclocking are available at <a href="https://stackoverflow.com/a/56861355">this StackOverflow answer</a> and we’ve already <a href="/blog/2020/01/17/avxfreq1.html">covered in somewhat exhaustive detail</a> the low level mechanics of these transitions. You can also find <a href="https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/">some guidelines</a> on to how make use of wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> given this issue<sup id="fnref:dmore" role="doc-noteref"><a href="#fn:dmore" class="footnote" rel="footnote">2</a></sup>.</p>
<p>All of those were written in the context of Skylake-SP (<abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>) which were the first generation of chips to support AVX-512.</p>
<p>So what about Ice Lake, the newest chips which support both the <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> flavor of AVX-512 and also have a <a href="https://branchfree.org/2019/05/29/why-ice-lake-is-important-a-bit-bashers-perspective/">whole host of new AVX-512 instructions</a>? Will we be stuck gazing longly at these new instructions from afar while never being allowed to actually use them due to downclocking?</p>
<p>Read on to find out, or just skip to the <a href="#summary">end</a>. The original version of this post included only Ice Lake is the primary focus. On March 28th, 2020 I updated it with a <a href="#rocket-lake">Rocket Lake section</a>.</p>
<h2 id="ice-lake-frequency-behavior">Ice Lake Frequency Behavior</h2>
<h3 id="avx-turbo">AVX-Turbo</h3>
<p>We will use the <a href="https://github.com/travisdowns/avx-turbo">avx-turbo</a> utility to measure the core count and instruction mix dependent frequencies for a CPU. This tools works in a straightforward way: run a given mix of instructions on the given number of cores, while measuring the frequency achieved during the test.</p>
<p>For example, the <code class="language-plaintext highlighter-rouge">avx256_fma_t</code> test – which measures the cost of <em>heavy</em> 256-bit instructions with high <abbr title="Instruction level parallelism: a measure of inter-instruction parallelism on a superscalar CPU">ILP</abbr> – runs the following sequence of FMAs:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nf">vfmadd132pd</span> <span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm1</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm2</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm3</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm4</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm5</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm6</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm7</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm8</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="nf">vfmadd132pd</span> <span class="nv">ymm9</span><span class="p">,</span><span class="nv">ymm10</span><span class="p">,</span><span class="nv">ymm11</span>
<span class="c1">; repeat 10x for a total of 100 FMAs</span>
</code></pre></div></div>
<p>In total, we’ll use five tests to test every combination of light and heavy 256-bit and 512-bit instructions, as well as scalar instructions (128-bit <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> behaves the same as scalar), using this command line:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>avx-turbo --test=scalar_iadd,avx256_iadd,avx512_iadd,avx256_fma_t,avx512_fma_t
</code></pre></div></div>
<h3 id="ice-lake-results">Ice Lake Results</h3>
<p>I ran avx-turbo as described above on an Ice Lake i5-1035G4, which is the middle-of-the-range Ice Lake client CPU running at up to 3.7 GHz. The full output is <a href="https://gist.github.com/travisdowns/c53f40fc4dbbd944f5613eaab78f3189#file-icl-turbo-results-txt">hidden away in a gist</a>, but here are the all-important frequency results (all values in GHz):</p>
<table class="td-right">
<tbody>
<tr>
<th rowspan="2">Instruction Mix</th>
<th colspan="4">Active Cores</th>
</tr>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
</tr>
<tr>
<th>Scalar/128-bit</th>
<td>3.7</td>
<td>3.6</td>
<td>3.3</td>
<td>3.3</td>
</tr>
<tr>
<th>Light 256-bit</th>
<td>3.7</td>
<td>3.6</td>
<td>3.3</td>
<td>3.3</td>
</tr>
<tr>
<th>Heavy 256-bit</th>
<td>3.7</td>
<td>3.6</td>
<td>3.3</td>
<td>3.3</td>
</tr>
<tr>
<th>Light 512-bit</th>
<td>3.6</td>
<td>3.6</td>
<td>3.3</td>
<td>3.3</td>
</tr>
<tr>
<th>Heavy 512-bit</th>
<td>3.6</td>
<td>3.6</td>
<td>3.3</td>
<td>3.3</td>
</tr>
</tbody>
</table>
<p>As expected, maximum frequency decreases with active core count, but scan down each column to see the effect of instruction category. Along this axis, there is almost no downclocking at all! Only for a single active core count is there any decrease with wider instructions, and it is a paltry only 100 MHz: from 3,700 MHz to 3,600 MHz when any 512-bit instructions are used.</p>
<p>In any other scenario, including any time more than one core is active, or for heavy 256-bit instructions, there is <em>zero</em> license-based downclocking: everything runs as fast as scalar.</p>
<h4 id="license-mapping">license Mapping</h4>
<p>There another change here too. In <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>, there are three licenses, or categories of instructions with respect to downclocking: L0, L1 and L2. Here, in client ICL, there are only two<sup id="fnref:visible" role="doc-noteref"><a href="#fn:visible" class="footnote" rel="footnote">3</a></sup> and those don’t line up exactly with the three in <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>.</p>
<p>To be clearer, in <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> the licenses mapped to instruction width and weight as follows:</p>
<style>
.l0 {
background-color: hsl(118deg 96% calc(72% - var(--dark) * 55%));
}
.l1 {
background-color: hsl(63deg 100% calc(74% - var(--dark) * 59%));
}
.l2 {
background-color: hsl(2deg 92% calc(75% - var(--dark) * 44%));
}
</style>
<table>
<tbody>
<tr>
<th>Width</th>
<th>Light</th>
<th>Heavy</th>
</tr>
<tr>
<td>Scalar/128</td>
<td class="l0">L0</td>
<td class="l0">L0</td>
</tr>
<tr>
<td>256</td>
<td class="l0">L0</td>
<td class="l1">L1</td>
</tr>
<tr>
<td>512</td>
<td class="l1">L1</td>
<td class="l2">L2</td>
</tr>
</tbody>
</table>
<p>In particular, note that 256-bit heavy instructions have the same license as 512-bit light.</p>
<p>In ICL client, the mapping is:</p>
<table>
<tbody>
<tr>
<th>Width</th>
<th>Light</th>
<th>Heavy</th>
</tr>
<tr>
<td>Scalar/128</td>
<td class="l0">L0</td>
<td class="l0">L0</td>
</tr>
<tr>
<td>256</td>
<td class="l0">L0</td>
<td class="l0">L0</td>
</tr>
<tr>
<td>512</td>
<td class="l1">L1</td>
<td class="l1">L1</td>
</tr>
</tbody>
</table>
<p>Now, 256 heavy and 512 light are in different categories! In fact, the whole concept of light vs heavy doesn’t seem to apply here: the categorization is purely based on the width<sup id="fnref:onefma" role="doc-noteref"><a href="#fn:onefma" class="footnote" rel="footnote">4</a></sup>.</p>
<h2 id="rocket-lake">Rocket Lake</h2>
<p>Rocket Lake (shortened as <abbr title="Intel's Rocket Lake architecture, aka 11th Generation Intel Core i3,i5,i7 and i9">RKL</abbr>, see <a href="https://en.wikipedia.org/wiki/Rocket_Lake">wikipedia</a> or <a href="https://en.wikichip.org/wiki/intel/microarchitectures/rocket_lake">wikichip</a> for more) is more-or-less a backport of the 10nm <abbr title="The new 7nm microarchitecture used in Ice Lake CPUs.">Sunny Cove</abbr> microarchitecture to Intel’s highly-tuned workhorse<sup id="fnref:some" role="doc-noteref"><a href="#fn:some" class="footnote" rel="footnote">5</a></sup> 14nm process.</p>
<p>Edison Chan has graciously provided the output of running avx-turbo on his Rocket Lake i9-11900K, the top of the line Rocket Lake chip. The <a href="/assets/icl-avx512-freq/11900k-avx-freq-results.txt">full results</a> are available, but I’ve summarized the achieved frequencies in the following table.</p>
<p style="text-align: center"><strong>Rocket Lake i9-11900K Frequency Matrix</strong></p>
<table class="td-right">
<tbody>
<tr>
<th rowspan="2">Active Cores</th>
<th colspan="5" style="text-align:center">Instruction Mix</th>
</tr>
<tr>
<th>Scalar and 128</th>
<th>Light 256</th>
<th>Heavy 256</th>
<th>Light 512</th>
<th>Heavy 512</th>
</tr>
<tr>
<th>1 Core</th>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
</tr>
<tr>
<th>2 Cores</th>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
</tr>
<tr>
<th>3 Cores</th>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
</tr>
<tr>
<th>4 Cores</th>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
<td>5.1</td>
</tr>
<tr>
<th>5 Cores</th>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
</tr>
<tr>
<th>6 Cores</th>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
<td>4.9</td>
</tr>
<tr>
<th>7 Cores</th>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
</tr>
<tr>
<th>8 Cores</th>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
</tr>
</tbody>
</table>
<p>The results paint a very promising picture of Rocket Lake’s AVX-512 frequency behavior: there is <em>no</em> license-based downclocking evident at any combination of core count and frequency<sup id="fnref:rklcaveats" role="doc-noteref"><a href="#fn:rklcaveats" class="footnote" rel="footnote">6</a></sup>. Even heavy AVX-512 instructions can execute at the same frequency as lightweight scalar code.</p>
<p>In fact, the frequency behavior of this chip appears very simple: the full Turbo Boost 2.0 frequency<sup id="fnref:tb2" role="doc-noteref"><a href="#fn:tb2" class="footnote" rel="footnote">7</a></sup> of 5.1 GHz is available for any instruction mix up and up to 4 active cores, then the speed drops to 4.9 for 5 and 6 active cores, and finally to 4.8 GHz for 7 or 8 active cores. This means that at 8 active cores and AVX-512, you are still achieving 94% of the frequency observed for 1 active core running light instructions.</p>
<h3 id="so-what">So What?</h3>
<p>Well, so what?</p>
<p>At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than the prior-generation verdict of “AVX-512 generally causes significant downclocking”, on these Ice Lake and Rocket Lake client chips we can say that AVX-512 causes insignificant (usually, none at all) license-based downclocking and I expect this to be true on other ICL and <abbr title="Intel's Rocket Lake architecture, aka 11th Generation Intel Core i3,i5,i7 and i9">RKL</abbr> client chips as well.</p>
<p>Now, this adjustment of expectations comes with an important caveat: license-based downclocking is only <em>one</em> source of downclocking. It is also possible to hit power, thermal or current limits. Some configurations may only be able to run wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions on all cores for a short period of time before exceeding running power limits. In my case, the $250 laptop I’m testing this on has extremely poor cooling and rather than power limits I hit thermal limits (100°C limit) within a few seconds running anything heavy on all cores.</p>
<p>However, these other limits are qualitatively different than license based limits. They apply mostly<sup id="fnref:voltage" role="doc-noteref"><a href="#fn:voltage" class="footnote" rel="footnote">8</a></sup> in a <em>pay for what you use</em> way: if you use a wide or heavy instruction or two you incur only a microscopic amount of additional power or heat cost associated with only those instructions. This is unlike some license-based transitions where a core or chip-wide transition occurs that affects unrelated subsequent execution for a significant period of time.</p>
<p>Since wider operations are generally <em>cheaper</em> in power than an equivalent number of narrower operations<sup id="fnref:widenarrow" role="doc-noteref"><a href="#fn:widenarrow" class="footnote" rel="footnote">9</a></sup>, you can determine up-front that a wide operation is <em>worth it</em> – at least for cases that scale well with width. In any case, the problem is most local: not depending on the behavior of the surrounding code.</p>
<h3 id="summary">Summary</h3>
<p>Here’s what we’ve learned.</p>
<ul>
<li>The Ice Lake i5-1035 CPU exhibits only 100 MHz of license-based downclock with 1 active core when running 512-bit instructions, and <em>no</em> license downclock in any other scenario.</li>
<li>The Rocket Lake i9-11900K CPU doesn’t exhibit any license-based downclock in the tested scenarios.</li>
<li>The Ice Lake CPU has an all-core 512-bit turbo frequency of 3.3 GHz is 89% of the maximum single-core scalar frequency of 3.7 GHz, so within power and thermal limits this chip has a very “flat” frequency profile. The Rocket Lake 11900K is even flatter with an all-eight-cores frequency of 4.8 GHz clocking in at 94% of the 5.1 GHz single-core speed.</li>
<li>Unlike <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>, this Ice Lake chip does not distinguish between “light” and “heavy” instructions for frequency scaling purposes: FMA operations behave the same as lighter operations.</li>
</ul>
<p>So on ICL and <abbr title="Intel's Rocket Lake architecture, aka 11th Generation Intel Core i3,i5,i7 and i9">RKL</abbr> client, you don’t have to fear the downclock. Only time will tell if this applies also to the Ice Lake Xeon server chips.</p>
<h3 id="thanks">Thanks</h3>
<p>Thanks to Edison Chan for the Rocket Lake i9-11900K results.</p>
<p>Stopwatch photo by <a href="https://unsplash.com/@kevinandrephotography">Kevin Andre</a> on <a href="https://unsplash.com/s/photos/stopwatch">Unsplash</a>.</p>
<h3 id="discussion-and-feedback">Discussion and Feedback</h3>
<p>This post was discussed <a href="https://news.ycombinator.com/item?id=24215022">on Hacker News</a>.</p>
<p>If you have a question or any type of feedback, you can leave a <a href="#comment-section">comment below</a>. I’m also interested in results on <em>other</em> new Intel or AMD chips, like the i3 and i7 variants: let me know if you have one of those and we can collect results.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:tiring" role="doc-endnote">
<p>It gets tiring to constantly repeat <em>license-based downclock</em> so I’ll often use simply “downclock” instead, but this should still be understood to refer to the license-based variety rather than other types of frequency throttling. <a href="#fnref:tiring" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:dmore" role="doc-endnote">
<p>Note that Daniel has <a href="https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-instructions-are-maybe-not-so-dangerous/">written</a> <a href="https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-throttling-a-3-impact/">much more</a> <a href="https://lemire.me/blog/2018/08/24/trying-harder-to-make-avx-512-look-bad-my-quantified-and-reproducible-results/">than</a> <a href="https://lemire.me/blog/2018/09/04/per-core-frequency-scaling-and-avx-512-an-experiment/">just that</a> one. <a href="#fnref:dmore" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:visible" role="doc-endnote">
<p>Only two <em>visible:</em> it is possible that the three (or more) categories still exist, but they cause voltage transitions only, not any frequency transitions. <a href="#fnref:visible" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:onefma" role="doc-endnote">
<p>One might imagine this is a consequence of ICL client having only one FMA unit on all SKUs: very heavy FP 512-bit operations aren’t possible. However, this doesn’t align with 256-bit heavy still being fast: you can still do 2x256-bit FMAs per cycle and this is the same FP intensity as 1x512-bit FMA per cycle. It’s more like, on this chip, FP operation don’t need more license based protection from other operations of the same width, and the main cost is 512-bit width. <a href="#fnref:onefma" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:some" role="doc-endnote">
<p>Those of a more critical bent might prefer <em>long suffering</em> or <em>very long in the tooth</em> as adjectives for this process. <a href="#fnref:some" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rklcaveats" role="doc-endnote">
<p><em>Some</em> tests did show lower speeds, although these outlier results didn’t correlate well with heavy or light instructions, and the difference was generally 100 MHz or less. These likely represent <em>other</em> sources of reduced frequency, such as thermal throttling or switching to a higher active core count when a process not related to the test process has active threads. In any case, for <em>each</em> core count, we can find a test in each of the instruction categories that runs at full speed, allowing me to fill out the matrix even in the presence of these outliers. <a href="#fnref:rklcaveats" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tb2" role="doc-endnote">
<p>I mention Turbo Boost 2.0 specifically because <a href="https://ark.intel.com/content/www/us/en/ark/products/212325/intel-core-i9-11900k-processor-16m-cache-up-to-5-30-ghz.html">this chip</a> also has a higher Turbo Boost 3.0 maximum frequency of 5.2 GHz, and beyond that a high <em>Thermal Velocity Boost</em> frequency of 5.3 GHz. These higher frequencies apply only to specific <em>chosen cores</em> within the CPU selected at manufacturing based on their ability to reach these higher frequencies. We don’t see any of these higher speeds during the test, possibly because the cores the test pins itself to are not the <em>chosen cores</em> on this CPU. So the frequency behavior of this chip can be characterized as “very simple” only if you ignore these additional turbo levels and other complicating factors. <a href="#fnref:tb2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:voltage" role="doc-endnote">
<p>I have to weasel-word with <em>mostly</em> here because even if there is no frequency transition, there may be a voltage transition which both incurs a halted period where nothing executes, and increases power for subsequent execution that may not require the elevated voltage. Also, there is the not-yet-discussed concept of <em>implicit widening</em> which may extend later narrow operations to maximum width if the upper parts of the registers are not zeroed with <code class="language-plaintext highlighter-rouge">vzeroupper</code> or <code class="language-plaintext highlighter-rouge">vzeroall</code>. <a href="#fnref:voltage" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:widenarrow" role="doc-endnote">
<p>For example, one 512-bit integer addition would generally be cheaper in energy use than the two 256-bit operations required to calculate the same result, because of execution overheads that don’t scale linearly with width (that’s almost everything outside of execution itself). <a href="#fnref:widenarrow" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comExamining the extent of AVX related downclocking on Intel's Ice Lake CPUA Concurrency Cost Hierarchy2020-07-06T00:00:00+00:002020-07-06T00:00:00+00:00https://travisdowns.github.io/blog/2020/07/06/concurrency-costs<!-- boilerplate
page.assets: /assets/concurrency-costs
assetpath: /assets/concurrency-costs
tablepath: /misc/tables/concurrency-costs
-->
<h2 id="introduction">Introduction</h2>
<p>Concurrency is hard to get <em>correct</em>, at least for those of us unlucky enough to be writing in languages which expose directly the guts of concurrent hardware: threads and shared memory. Getting concurrency correct <em>and</em> fast is hard, too. Your knowledge about single-threaded optimization often won’t help you: at a micro (instruction) level we can’t simply apply the usual rules of μops, dependency chains, throughput limits, and so on. The rules are different.</p>
<p>If that first paragraph got your hopes up, this second one is here to dash them: I’m not actually going to do a deep dive into the very low level aspects of concurrent performance. There are a lot of things we just don’t know about how atomic instructions and fences execute, and we’ll save that for another day.</p>
<p>Instead, I’m going to describe a higher level taxonomy that I use to think about concurrent performance. We’ll group the performance of concurrent operations into six broad <em>levels</em> running from fast to slow, with each level differing from its neighbors by roughly an order of magnitude in performance.</p>
<p>I often find myself thinking in terms of these categories when I need high performance concurrency: what is the best level I can practically achieve for the given problem? Keeping the levels in mind is useful both during initial design (sometimes a small change in requirements or high level design can allow you to achieve a better level), and also while evaluating existing systems (to better understand existing performance and evaluate the path of least resistance to improvements).</p>
<h3 id="a-real-world-example">A “Real World” Example</h3>
<p>I don’t want this to be totally abstract, so we will use a real-world-if-you-squint<sup id="fnref:realworld" role="doc-noteref"><a href="#fn:realworld" class="footnote" rel="footnote">1</a></sup> running example throughout: safely incrementing an integer counter across threads. By <em>safely</em> I mean without losing increments, producing out-of-thin air values, frying your RAM or making more than a minor rip in space-time.</p>
<h3 id="source-and-results">Source and Results</h3>
<p>The source for every benchmark here is <a href="https://github.com/travisdowns/concurrency-hierarchy-bench">available</a>, so you can follow along and even reproduce the results or run the benchmarks on your own hardware. All of the results discussed here (and more) are available in the same repository, and each plot includes a <code class="language-plaintext highlighter-rouge">[data table]</code> link to the specific subset used to generate the plot.</p>
<h3 id="hardware">Hardware</h3>
<p>All of the performance results are provided for several different hardware platforms: Intel Skylake, Ice Lake, Amazon Graviton and Graviton 2. However except when I explicitly mention other hardware, the prose refers to the results on Skylake. Although the specific numbers vary, most of the qualitative relationships hold for the hardware too, but <em>not always</em>. Not only does the hardware vary, but the OS and library implementations will vary as well.</p>
<p>It’s almost inevitable that this will be used to compare across hardware (“wow, Graviton 2 sure kicks Graviton 1’s ass”), but that’s not my goal here. The benchmarks are written primarily to tease apart the characteristics of the different levels, and <em>not</em> as a hardware shootout.</p>
<p>Find below the details of the hardware used:</p>
<table>
<thead>
<tr>
<th>Micro-architecture</th>
<th>ISA</th>
<th>Model</th>
<th>Tested Frequency</th>
<th>Cores</th>
<th>OS</th>
<th>Instance Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Skylake</td>
<td>x86</td>
<td>i7-6700HQ</td>
<td>2.6 GHz</td>
<td>4</td>
<td>Ubuntu 20.04</td>
<td> </td>
</tr>
<tr>
<td>Ice Lake</td>
<td>x86</td>
<td>i5-1035G4</td>
<td>3.3 GHz</td>
<td>4</td>
<td>Ubuntu 19.10</td>
<td> </td>
</tr>
<tr>
<td>Graviton</td>
<td>AArch64</td>
<td>Cortex-A72</td>
<td>2.3 GHz</td>
<td>16</td>
<td>Ubuntu 20.04</td>
<td>a1.4xlarge</td>
</tr>
<tr>
<td>Graviton 2</td>
<td>AArch64</td>
<td>Neoverse N1</td>
<td>2.5 GHz</td>
<td>16<sup id="fnref:g2cores" role="doc-noteref"><a href="#fn:g2cores" class="footnote" rel="footnote">2</a></sup></td>
<td>Ubuntu 20.04</td>
<td>c6g.4xlarge</td>
</tr>
</tbody>
</table>
<h2 id="level-2-contended-atomics">Level 2: Contended Atomics</h2>
<p>You’d probably expect this hierarchy to be introduced from fast to slow, or vice-versa, but we’re all about defying expectations here and we are going to start in the <em>middle</em> and work our way outwards. The middle (rounding down) turns out to be <em>level 2</em> and that’s where we will jump in.</p>
<p>The most elementary way to safely modify any shared object is to use a lock. It mostly <em>just works</em> for any type of object, no matter its structure or the nature of the modifications. Almost any mainstream CPU from the last thirty years has some type of locking<sup id="fnref:parisc" role="doc-noteref"><a href="#fn:parisc" class="footnote" rel="footnote">3</a></sup> instruction accessible to userspace.</p>
<p>So our baseline increment implementation will use a simple mutex of type <code class="language-plaintext highlighter-rouge">T</code> to protect a plain integer variable:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">T</span> <span class="n">lock</span><span class="p">;</span>
<span class="kt">uint64_t</span> <span class="n">counter</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">bench</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">iters</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">holder</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">counter</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We’ll call this implementation <em><abbr title="Uses a std::mutex and std::lock_guard to protect a plain integer counter.">mutex add</abbr></em>, and on my 4 CPU Skylake-S i7-6700HQ machine, when I use the vanilla <code class="language-plaintext highlighter-rouge">std::mutex</code> I get the following results for 2 to 4 threads:</p>
<div class="tabs" id="tabs-mutex">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-mutex-1" name="tab-group-mutex" checked="" />
<label class="tab-label" for="tab-mutex-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/mutex.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/mutex.html">
<img class="figimg" src="/assets/concurrency-costs/skl/mutex.svg" alt="Mutex" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-mutex-2" name="tab-group-mutex" />
<label class="tab-label" for="tab-mutex-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/mutex.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/mutex.html">
<img class="figimg" src="/assets/concurrency-costs/icl/mutex.svg" alt="Mutex" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-mutex-3" name="tab-group-mutex" />
<label class="tab-label" for="tab-mutex-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/mutex.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/mutex.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/mutex.svg" alt="Mutex" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-mutex-4" name="tab-group-mutex" />
<label class="tab-label" for="tab-mutex-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/mutex.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/mutex.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/mutex.svg" alt="Mutex" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p class="info">The reported value is the median of all trials, and the vertical black error lines at the top of each bar indicate the <em>interdecile range</em>, i.e., the values at the 10th and 90th percentile. Where the error bars don’t show up, it means there is no difference between the p10 and p90 values at all, at least within the limits of the reporting resolution (100 picoseconds).</p>
<p>This shows that the baseline contended cost to modify an integer protected by a lock starts at about 125 nanoseconds for two threads, and grows somewhat with increasing thread count.</p>
<p>I can already hear someone saying: <em>If you are just modifying a single 64-bit integer, skip the lock and just directly use the atomic operations that most ISAs support!</em></p>
<p>Sure, let’s add a couple of variants that do that. The <code class="language-plaintext highlighter-rouge">std::atomic<T></code> template makes this easy: we can wrap any type meeting some basic requirements and then manipulate it atomically. The easiest of all is to use <code class="language-plaintext highlighter-rouge">std::atomic<uint64>::operator++()</code><sup id="fnref:post" role="doc-noteref"><a href="#fn:post" class="footnote" rel="footnote">4</a></sup> and this gives us <em><abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr></em>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span> <span class="n">atomic_counter</span><span class="p">{};</span>
<span class="kt">void</span> <span class="nf">atomic_add</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">iters</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">atomic_counter</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The other common approach would be to use <a href="https://en.wikipedia.org/wiki/Compare-and-swap">compare and swap (<abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>)</a> to load the existing value, add one and then <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> it back if it hasn’t changed. If it <em>has</em> changed, the increment raced with another thread and we try again.</p>
<p>Note that even if you use increment at the source level, the assembly might actually end up using <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> if your hardware doesn’t support atomic increment<sup id="fnref:atomicsup" role="doc-noteref"><a href="#fn:atomicsup" class="footnote" rel="footnote">5</a></sup>, or if your compiler or runtime just don’t take advantage of atomic operations even though they are available (e.g., see what even the newest version of <a href="https://godbolt.org/z/5h4K7y">icc does</a> for atomic increment, and what Java did for years<sup id="fnref:java" role="doc-noteref"><a href="#fn:java" class="footnote" rel="footnote">6</a></sup>). This caveat doesn’t apply to any of our tested platforms, however.</p>
<p>Let’s add a counter implementation that uses <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> as described above, and we’ll call it <em><abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr></em>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span> <span class="n">cas_counter</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">cas_add</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">iters</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">v</span> <span class="o">=</span> <span class="n">cas_counter</span><span class="p">.</span><span class="n">load</span><span class="p">();</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">cas_counter</span><span class="p">.</span><span class="n">compare_exchange_weak</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">v</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
<span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here’s what these look like alongside our existing <code class="language-plaintext highlighter-rouge">std::mutex</code> benchmark:</p>
<div class="tabs" id="tabs-atomic-inc">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-atomic-inc-1" name="tab-group-atomic-inc" checked="" />
<label class="tab-label" for="tab-atomic-inc-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/atomic-inc.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/atomic-inc.html">
<img class="figimg" src="/assets/concurrency-costs/skl/atomic-inc.svg" alt="Atomic increment" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-atomic-inc-2" name="tab-group-atomic-inc" />
<label class="tab-label" for="tab-atomic-inc-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/atomic-inc.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/atomic-inc.html">
<img class="figimg" src="/assets/concurrency-costs/icl/atomic-inc.svg" alt="Atomic increment" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-atomic-inc-3" name="tab-group-atomic-inc" />
<label class="tab-label" for="tab-atomic-inc-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/atomic-inc.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/atomic-inc.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/atomic-inc.svg" alt="Atomic increment" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-atomic-inc-4" name="tab-group-atomic-inc" />
<label class="tab-label" for="tab-atomic-inc-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/atomic-inc.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/atomic-inc.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/atomic-inc.svg" alt="Atomic increment" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>The first takeaway is that, at least in this <em>unrealistic maximum contention</em> benchmark, using <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr> (<a href="https://www.felixcloutier.com/x86/xadd"><code class="language-plaintext highlighter-rouge">lock xadd</code></a> at the hardware level) is significantly better than <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>. The second would be that <code class="language-plaintext highlighter-rouge">std::mutex</code> doesn’t come out looking all that bad on Skylake. It is only slightly worse than the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> approach at 2 cores and beats it at 3 and 4 cores. It is slower than the atomic increment approach, but less than three times as slow and seems to be scaling in a reasonable way.</p>
<p>All of these operations are belong to <em>level 2</em> in the hierarchy. The primary characteristic of level 2 is that they make a <em>contended access</em> to a shared variable. This means that at a minimum, the line containing the data needs to move out to the caching agent that manages coherency<sup id="fnref:l3" role="doc-noteref"><a href="#fn:l3" class="footnote" rel="footnote">7</a></sup>, and then back up to the core that will receive ownership next. That’s about 70 cycles minimum just for that operation<sup id="fnref:inter" role="doc-noteref"><a href="#fn:inter" class="footnote" rel="footnote">8</a></sup>.</p>
<p>Can it get slower? You bet it can. <em>Way</em> slower.</p>
<h3 id="level-3-system-calls">Level 3: System Calls</h3>
<p>The next level up (“up” is not good here…) is level 3. The key characteristic of implementations at this level is that they make a <em>system call on almost every operation</em>.</p>
<p>It is easy to write concurrency primitives that make a system call <em>unconditionally</em> (e.g., a lock which always tries to wake waiters via a <code class="language-plaintext highlighter-rouge">futex(2)</code> call, even if there aren’t any), but we won’t look at those here. Rather we’ll take a look at a case where the fast path is written to avoid a system call, but the design or way it is used implies that such a call usually happens anyway.</p>
<p>Specifically, we are going to look at some <em>fair locks</em>. Fair locks allow threads into the critical section in the same order they began waiting. That is, when the critical section becomes available, the thread that has been waiting the longest is given the chance to take it.</p>
<p>Sounds like a good idea, right? Sometimes yes, but as we will see it can have significant performance implications.</p>
<p>On the menu are three different fair locks.</p>
<p>The first is a <a href="https://en.wikipedia.org/wiki/Ticket_lock">ticket lock</a> with a <code class="language-plaintext highlighter-rouge">sched_yield</code> in the spin loop. The idea of the yield is to give other threads which may hold the lock time to run. This <code class="language-plaintext highlighter-rouge">yield()</code> approach is publicly frowned upon by concurrency experts<sup id="fnref:notwhat" role="doc-noteref"><a href="#fn:notwhat" class="footnote" rel="footnote">9</a></sup>, who then sometimes go right ahead and use it anyway.</p>
<p id="ys-lock">We will call it <abbr title="A ticket lock that calls sched_yield in a spin loop while waiting for its turn.">ticket yield</abbr> and it looks like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* A ticket lock which uses sched_yield() while waiting
* for the ticket to be served.
*/</span>
<span class="k">class</span> <span class="nc">ticket_yield</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">size_t</span><span class="o">></span> <span class="n">dispenser</span><span class="p">{},</span> <span class="n">serving</span><span class="p">{};</span>
<span class="nl">public:</span>
<span class="kt">void</span> <span class="n">lock</span><span class="p">()</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">ticket</span> <span class="o">=</span> <span class="n">dispenser</span><span class="p">.</span><span class="n">fetch_add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">ticket</span> <span class="o">!=</span> <span class="n">serving</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_acquire</span><span class="p">))</span>
<span class="n">sched_yield</span><span class="p">();</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">unlock</span><span class="p">()</span> <span class="p">{</span>
<span class="n">serving</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">serving</span><span class="p">.</span><span class="n">load</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_release</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Let’s plot the performance results for this lock alongside the existing approaches:</p>
<div class="tabs" id="tabs-fair-yield">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fair-yield-1" name="tab-group-fair-yield" checked="" />
<label class="tab-label" for="tab-fair-yield-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/fair-yield.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/fair-yield.html">
<img class="figimg" src="/assets/concurrency-costs/skl/fair-yield.svg" alt="Increment Cost: Fair Yield" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fair-yield-2" name="tab-group-fair-yield" />
<label class="tab-label" for="tab-fair-yield-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/fair-yield.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/fair-yield.html">
<img class="figimg" src="/assets/concurrency-costs/icl/fair-yield.svg" alt="Increment Cost: Fair Yield" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fair-yield-3" name="tab-group-fair-yield" />
<label class="tab-label" for="tab-fair-yield-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/fair-yield.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/fair-yield.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/fair-yield.svg" alt="Increment Cost: Fair Yield" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fair-yield-4" name="tab-group-fair-yield" />
<label class="tab-label" for="tab-fair-yield-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/fair-yield.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/fair-yield.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/fair-yield.svg" alt="Increment Cost: Fair Yield" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>This is level 3 visualized: it is an order of magnitude slower than the level 2 approaches. The slowdown comes from the <code class="language-plaintext highlighter-rouge">sched_yield</code> call: this is a system call and these are generally on the order of 100s of nanoseconds<sup id="fnref:spectre" role="doc-noteref"><a href="#fn:spectre" class="footnote" rel="footnote">10</a></sup>, and it shows in the results.</p>
<p>This lock <em>does</em> have a fast path where <code class="language-plaintext highlighter-rouge">sched_yield</code> isn’t called: if the lock is available, no spinning occurs and <code class="language-plaintext highlighter-rouge">sched_yield</code> is never called. However, the combination of being a <em>fair</em> lock and the high contention in this test means that a lock convoy quickly forms (we’ll describe this in more detail later) and so the spin loop is entered basically every time <code class="language-plaintext highlighter-rouge">lock()</code> is called.</p>
<p>So have we <em>now</em> fully plumbed the depths of slow concurrency constructs? Not even close. We are only now just about to cross the River Styx.</p>
<h4 id="revisiting-stdmutex">Revisiting std::mutex</h4>
<p>Before we proceed, let’s quickly revisit the <code class="language-plaintext highlighter-rouge">std::mutex</code> implementation discussed in level 2 in light of our definition of level 3 as requiring a system call. Doesn’t <code class="language-plaintext highlighter-rouge">std::mutex</code> <em>also</em> make system calls? If a thread tries to lock a <code class="language-plaintext highlighter-rouge">std::mutex</code> object which is already locked, we expect that thread to block using OS-provided primitives. So why isn’t it level 3 and slow like <abbr title="A ticket lock that calls sched_yield in a spin loop while waiting for its turn.">ticket yield</abbr>?</p>
<p>The primary reason is that it makes <em>few</em> system calls in practice. Through a combination of spinning and unfairness I measure only about 0.18 system calls per increment, with three threads on my Skylake box. So <em>most</em> increments happen without a system call. On the other hand, <abbr title="A ticket lock that calls sched_yield in a spin loop while waiting for its turn.">ticket yield</abbr> makes about 2.4 system calls per increment, more than an order of magnitude more, and so it suffers a corresponding decrease in performance.</p>
<p>That out of way, let’s get even slower.</p>
<h3 id="level-4-implied-context-switch">Level 4: Implied Context Switch</h3>
<p>The next level is when the implementation forces a significant number of concurrent operations to cause a <em>context switch</em>.</p>
<p>The yielding lock wasn’t resulting in many context switches, since we are not running more threads than there are cores, and so there usually is no other runnable process (except for the occasional background process). Therefore, the current thread stays on the CPU when we call <code class="language-plaintext highlighter-rouge">sched_yield</code>. Of course, this burns a lot of CPU.</p>
<p>As the experts recommend whenever one suggests <em>yielding</em> in a spin loop, let us try a <em>blocking lock</em> instead.</p>
<p class="info"><strong>Blocking Locks</strong><br />
<br />
A more resource friendly design, and one that will often perform better is a <em>blocking</em> lock.<br /><br />Rather than busy waiting, these locks ask the OS to put the current thread to sleep until the lock becomes available. On Linux, the <a href="http://man7.org/linux/man-pages/man2/futex.2.html"><code class="language-plaintext highlighter-rouge">futex(3)</code></a> system call is the preferred way to accomplish this, while on Windows you have the <a href="https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject"><code class="language-plaintext highlighter-rouge">WaitFor*Object</code></a> API family. Above the OS interfaces, things like C++’s <code class="language-plaintext highlighter-rouge">std::condition_variable</code> provide a general purpose mechanism to wait until an arbitrary condition is true.</p>
<p>Our first blocking lock is again a ticket-based design, except this time it uses a condition variable to block when it detects that it isn’t first in line to be served (i.e., that the lock was held by another thread). We’ll name it <abbr title="A ticket lock which blocks if it cannot immediately acquire the lock.">ticket blocking</abbr> and it looks like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">blocking_ticket</span><span class="o">::</span><span class="n">lock</span><span class="p">()</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">ticket</span> <span class="o">=</span> <span class="n">dispenser</span><span class="p">.</span><span class="n">fetch_add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ticket</span> <span class="o">==</span> <span class="n">serving</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_acquire</span><span class="p">))</span>
<span class="k">return</span><span class="p">;</span> <span class="c1">// uncontended case</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_lock</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="k">while</span> <span class="p">(</span><span class="n">ticket</span> <span class="o">!=</span> <span class="n">serving</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_acquire</span><span class="p">))</span> <span class="p">{</span>
<span class="n">cvar</span><span class="p">.</span><span class="n">wait</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">blocking_ticket</span><span class="o">::</span><span class="n">unlock</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">unique_lock</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">lock</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">s</span> <span class="o">=</span> <span class="n">serving</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">serving</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_release</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">d</span> <span class="o">=</span> <span class="n">dispenser</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="n">assert</span><span class="p">(</span><span class="n">s</span> <span class="o"><=</span> <span class="n">d</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">s</span> <span class="o"><</span> <span class="n">d</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// wake all waiters</span>
<span class="n">cvar</span><span class="p">.</span><span class="n">notify_all</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The main difference with the earlier implementation occurs in the case where we don’t acquire the lock immediately (we don’t return at the location marked <code class="language-plaintext highlighter-rouge">// uncontended case</code>). Instead of yielding in a loop, we take the mutex associated with the condition variable and wait until notified. Every time we are notified we check if it is our turn.</p>
<p>Even without <abbr title="When a waiter on a condition variable is woken up even though no other thread notified it.">spurious wakeups</abbr> we might get woken many times, because this lock suffers from the <em>thundering herd</em> problem where every waiter is woken on <code class="language-plaintext highlighter-rouge">unlock()</code> even though only one will ultimately be able to get the lock.</p>
<p>We’ll try a second design too, that doesn’t suffer from thundering herd. This is a queued lock, where each lock waits on its own private node in a queue of waiters, so only a single waiter (the new lock owner) is woken up on unlock. We will call it <abbr title="A blocking ticket lock where each waiter waits on a unique condition variable.">queued fifo</abbr> and if you’re interested in the implementation you <a href="https://github.com/travisdowns/concurrency-hierarchy-bench/blob/9b8e0e0dfec7d38036d114038c6a9ed020b5b775/fairlocks.cpp#L61">can find it here</a>.</p>
<p>Here’s how our new locks perform against the existing crowd:</p>
<div class="tabs" id="tabs-more-fair">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-more-fair-1" name="tab-group-more-fair" checked="" />
<label class="tab-label" for="tab-more-fair-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/more-fair.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/more-fair.html">
<img class="figimg" src="/assets/concurrency-costs/skl/more-fair.svg" alt="Increment Cost: Fair Blocking" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-more-fair-2" name="tab-group-more-fair" />
<label class="tab-label" for="tab-more-fair-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/more-fair.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/more-fair.html">
<img class="figimg" src="/assets/concurrency-costs/icl/more-fair.svg" alt="Increment Cost: Fair Blocking" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-more-fair-3" name="tab-group-more-fair" />
<label class="tab-label" for="tab-more-fair-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/more-fair.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/more-fair.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/more-fair.svg" alt="Increment Cost: Fair Blocking" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-more-fair-4" name="tab-group-more-fair" />
<label class="tab-label" for="tab-more-fair-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/more-fair.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/more-fair.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/more-fair.svg" alt="Increment Cost: Fair Blocking" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>You’re probably seeing the pattern now: performance is again a new level of terrible compared to the previous contenders. About an order of magnitude slower than the yielding approach, which was already slower than the earlier approaches, which are now just slivers a few pixels high on the plots. The queued version of the lock does slightly better at increasing thread counts (<em>especially</em> on Graviton 2), as might be expected from the lack of the thundering herd effect, but is still very slow because the primary problem isn’t thundering herd, but rather a <a href="https://en.wikipedia.org/wiki/Lock_convoy"><em>lock convoy</em></a>.</p>
<p class="info"><strong>Lock Convoy</strong><br />
<br />
Unlike unfair locks, fair locks can result in sustained convoys involving only a single lock, once the contention reaches a certain point<sup id="fnref:hyst" role="doc-noteref"><a href="#fn:hyst" class="footnote" rel="footnote">11</a></sup>.<br />
<br />
Consider what happens when two threads, <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">B</code>, try to acquire the lock repeatedly. Let’s say <code class="language-plaintext highlighter-rouge">A</code> gets ticket 1 and <code class="language-plaintext highlighter-rouge">B</code> ticket 2. So <code class="language-plaintext highlighter-rouge">A</code> gets to go first and <code class="language-plaintext highlighter-rouge">B</code> has to wait, and for these implementations that means blocking (we can say the thread is <em>parked</em> by the OS). Now, <code class="language-plaintext highlighter-rouge">A</code> unlocks the lock and sees <code class="language-plaintext highlighter-rouge">B</code> waiting and wakes it. <code class="language-plaintext highlighter-rouge">A</code> is still running and soon tries to get the lock again, receiving ticket 3, but it cannot acquire the lock immediately because the lock is <em>fair</em>: <code class="language-plaintext highlighter-rouge">A</code> can’t jump the queue and acquire the lock with ticket 3 before <code class="language-plaintext highlighter-rouge">B</code>, holding ticket 2, gets its chance to enter the lock.<br />
<br />
Of course, <code class="language-plaintext highlighter-rouge">B</code> is going to be a while: it needs to be woken by the scheduler and this takes a microsecond or two, at least. Now <code class="language-plaintext highlighter-rouge">B</code> wakes and gets the lock, and the same scenario repeats itself with the roles reversed. The upshot is that there is a full context switch for each acquisition of the lock.<br />
<br />
Unfair locks avoid this problem because they allow queue jumping: in the scenario above, <code class="language-plaintext highlighter-rouge">A</code> (or any other thread) could re-acquire the lock after unlocking it, before <code class="language-plaintext highlighter-rouge">B</code> got its chance. So the use of the shared resource doesn’t grind to a halt while <code class="language-plaintext highlighter-rouge">B</code> wakes up.</p>
<p>So, are you tired of seeing mostly-white plots where the newly introduced algorithm relegates the rest of the pack to little chunks of color near the x-axis, yet?</p>
<p>I’ve just got one more left on the slow end of the scale. Unlike the other examples, I haven’t actually diagnosed something <em>this</em> bad in real life, but examples are out there.</p>
<h3 id="level-5-catastrophe">Level 5: Catastrophe</h3>
<p>Here’s a ticket lock which is identical to the <a href="#ys-lock">first ticket lock we saw</a>, except that the <code class="language-plaintext highlighter-rouge">sched_yield();</code> is replaced by <code class="language-plaintext highlighter-rouge">;</code>. That is, it busy waits instead of yielding (<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/blob/9b8e0e0dfec7d38036d114038c6a9ed020b5b775/fairlocks.cpp#L31">and here are the spin flavors which specialize on a shared ticket lock template</a>). You could also replace this by a CPU-specific “relax” instruction like <a href="https://www.felixcloutier.com/x86/pause"><code class="language-plaintext highlighter-rouge">pause</code></a>, but it won’t <a href="https://github.com/travisdowns/concurrency-hierarchy-bench/blob/9b8e0e0dfec7d38036d114038c6a9ed020b5b775/fairlocks.hpp#L26">change the outcome</a>. We call it <abbr title="A traditional spin-based ticket lock that does a hot spin while waiting for its ticket to be next.">ticket spin</abbr>, and here’s how it performs compared to the existing candidates:</p>
<div class="tabs" id="tabs-ts-4">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-4-1" name="tab-group-ts-4" checked="" />
<label class="tab-label" for="tab-ts-4-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/ts-4.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/ts-4.html">
<img class="figimg" src="/assets/concurrency-costs/skl/ts-4.svg" alt="Increment Cost: Ticket Spin" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-4-2" name="tab-group-ts-4" />
<label class="tab-label" for="tab-ts-4-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/ts-4.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/ts-4.html">
<img class="figimg" src="/assets/concurrency-costs/icl/ts-4.svg" alt="Increment Cost: Ticket Spin" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-4-3" name="tab-group-ts-4" />
<label class="tab-label" for="tab-ts-4-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/ts-4.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/ts-4.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/ts-4.svg" alt="Increment Cost: Ticket Spin" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-4-4" name="tab-group-ts-4" />
<label class="tab-label" for="tab-ts-4-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/ts-4.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/ts-4.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/ts-4.svg" alt="Increment Cost: Ticket Spin" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>What? That doesn’t look too bad at all. In fact, it is only slightly worse than the level 2 crew, the fastest we’ve seen so far<sup id="fnref:huh" role="doc-noteref"><a href="#fn:huh" class="footnote" rel="footnote">12</a></sup>.</p>
<p>The picture changes if we show the results for up to 6 threads, rather than just 4. Since I have 4 available cores<sup id="fnref:noht" role="doc-noteref"><a href="#fn:noht" class="footnote" rel="footnote">13</a></sup>, this means that not all the test threads will be able to run at once:</p>
<div class="tabs" id="tabs-ts-6">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-6-1" name="tab-group-ts-6" checked="" />
<label class="tab-label" for="tab-ts-6-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/ts-6.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/ts-6.html">
<img class="figimg" src="/assets/concurrency-costs/skl/ts-6.svg" alt="Increment Cost: Ticket Spin (Oversubscribed)" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-6-2" name="tab-group-ts-6" />
<label class="tab-label" for="tab-ts-6-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/ts-6.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/ts-6.html">
<img class="figimg" src="/assets/concurrency-costs/icl/ts-6.svg" alt="Increment Cost: Ticket Spin (Oversubscribed)" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-6-3" name="tab-group-ts-6" />
<label class="tab-label" for="tab-ts-6-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/ts-6.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/ts-6.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/ts-6.svg" alt="Increment Cost: Ticket Spin (Oversubscribed)" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-ts-6-4" name="tab-group-ts-6" />
<label class="tab-label" for="tab-ts-6-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/ts-6.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/ts-6.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/ts-6.svg" alt="Increment Cost: Ticket Spin (Oversubscribed)" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>Now it becomes clear why this level is called <em>catastrophic</em>. As soon as we oversubscribe the number of available cores, performance gets about <em>five hundred times worse</em>. We go from 100s of nanoseconds to 100s of microseconds. I don’t show more threads, but it only gets worse as you add more.</p>
<p>We are also about an order of magnitude slower than the best solution (<abbr title="A blocking ticket lock where each waiter waits on a unique condition variable.">queued fifo</abbr>) of the previous level, although it varies a lot by hardware: on Ice Lake the difference is more like <em>forty</em> times, while on Graviton this solution is actually slightly faster than <abbr title="A ticket lock which blocks if it cannot immediately acquire the lock.">ticket blocking</abbr> (also level 4) at 17 threads. Note also the huge error bars. This is the least consistent benchmark of the bunch and exhibits a lot of variance and the slowest and fastest runs might vary by a factor of 100.</p>
<h4 id="lock-convoy-on-steroids">Lock Convoy on Steroids</h4>
<p>So what happens here?</p>
<p>It’s similar to the lock convoy described above: all the threads queue on the lock and acquire it in a round-robin order due to the fair design. The difference is that threads don’t block when they can’t acquire the lock. This works out great when the cores are not oversubscribed, but falls off a cliff otherwise.</p>
<p>Imagine 5 threads, <code class="language-plaintext highlighter-rouge">T1</code>, <code class="language-plaintext highlighter-rouge">T2</code>, …, <code class="language-plaintext highlighter-rouge">T5</code>, where <code class="language-plaintext highlighter-rouge">T5</code> is the one not currently running. As soon as <code class="language-plaintext highlighter-rouge">T5</code> is the thread that needs the acquire the lock next (i.e., <code class="language-plaintext highlighter-rouge">T5</code>’s saved ticket value is equal to <code class="language-plaintext highlighter-rouge">dispensing</code>), nothing will happen because <code class="language-plaintext highlighter-rouge">T1</code> through <code class="language-plaintext highlighter-rouge">T4</code> are busily spinning away waiting for their turn. The OS scheduler sees no reason to interrupt them until their time slice expires. Time slices are usually measured in milliseconds. Once one thread is preempted, say <code class="language-plaintext highlighter-rouge">T1</code>, <code class="language-plaintext highlighter-rouge">T5</code> will get the chance to run, but at most 4 total acquisitions can happen (<code class="language-plaintext highlighter-rouge">T5</code>, plus any of <code class="language-plaintext highlighter-rouge">T2</code>, <code class="language-plaintext highlighter-rouge">T3</code>, <code class="language-plaintext highlighter-rouge">T4</code>), before it’s <code class="language-plaintext highlighter-rouge">T1</code>’s turn. <code class="language-plaintext highlighter-rouge">T1</code> is waiting for their chance to run again, but since everyone is spinning this won’t occur until another time slice expires.</p>
<p>So the lock can only be acquired a few times (at most <code class="language-plaintext highlighter-rouge">$(nproc)</code> times), or as little as once<sup id="fnref:once" role="doc-noteref"><a href="#fn:once" class="footnote" rel="footnote">14</a></sup>, every time slice. Modern Linux using <a href="https://en.wikipedia.org/wiki/Completely_Fair_Scheduler">CFS</a> doesn’t have a fixed timeslice, but on my system, <code class="language-plaintext highlighter-rouge">sched_latency_ns</code> is 18,000,000 which means that we expect two threads competing for one core to get a typical timeslice of 9 ms. The measured numbers are roughly consistent with a timeslice of single-digit milliseconds.</p>
<p>If I was good at diagrams, there would be a diagram here.</p>
<p>Another way of thinking about this is that in this over-subscription scenario, the <abbr title="A traditional spin-based ticket lock that does a hot spin while waiting for its ticket to be next.">ticket spin</abbr> lock implies roughly the same number of context switches as the blocking ticket lock<sup id="fnref:perf" role="doc-noteref"><a href="#fn:perf" class="footnote" rel="footnote">15</a></sup>, but in the former case each context switch comes with a giant delay caused by the need to exhaust the timeslice, while in the blocking case we are only limited by how fast a context switch can occur.</p>
<p>Interestingly, although this benchmark uses 100% CPU on every core, the performance of the benchmark in the oversubscribed case almost doesn’t depend on your CPU speed! Performance is approximately the same if I throttle my CPU to 1 GHz, or enable turbo up to 3.5 GHz. All of other implementations scale almost proportionally with CPU frequency. The benchmark does scale strongly with adjustment to <code class="language-plaintext highlighter-rouge">sched_latency_ns</code> (and <code class="language-plaintext highlighter-rouge">sched_min_granularity_ns</code> if the former is set low enough): lower scheduling latency values gives proportionally better performance as the time slices shrink, helping to confirm our theory of how this works.</p>
<p>This behavior also explains the large amount of variance once the available cores are oversubscribed: by definition, not all threads will be running at once, so the test becomes very sensitive to exactly where the not-running threads took their context switch. At the beginning of the test, only 4 of 6 threads will be running, and the two will be switched out, still waiting on the the <a href="https://github.com/travisdowns/concurrency-hierarchy-bench/blob/master/cyclic-barrier.hpp">barrier</a> that synchronizes the test start. Since the two switched out threads haven’t tried to get the lock yet, the four running threads will be able to quickly share the lock between themselves, since the six-thread convoy hasn’t been set up.</p>
<p>This runs up the “iteration count” (work done) during an initial period which varies randomly, until the first context switch lets the fifth thread join the competition and then the convoy gets set up<sup id="fnref:csdepend" role="doc-noteref"><a href="#fn:csdepend" class="footnote" rel="footnote">16</a></sup>. That’s when the catastrophe starts. This makes the results very noisy: for example, if you set a too-short time period for a trial, the <em>entire test</em> is composed of this initial phase and the results are artificially “good”.</p>
<p>We can probably invent something even worse, but that’s enough for now. Let’s move on to scenarios that are <em>faster</em> than the use of vanilla <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr>.</p>
<h3 id="level-1-uncontended-atomics">Level 1: Uncontended Atomics</h3>
<p>Recall that we started at level 2: contended atomics. The name gives it away: the next faster level is when atomic operations are used but there is no contention, either by design or by luck. You might have noticed that so far we’ve only shown results for at least two threads. That’s because the single threaded case involves no contention, and so every implementation so far is level 1 if run on a single thread<sup id="fnref:notexx" role="doc-noteref"><a href="#fn:notexx" class="footnote" rel="footnote">17</a></sup>.</p>
<p>Here are the results for all the implementations we’ve looked at so far, for a single thread:</p>
<div class="tabs" id="tabs-single">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-single-1" name="tab-group-single" checked="" />
<label class="tab-label" for="tab-single-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/single.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/single.html">
<img class="figimg" src="/assets/concurrency-costs/skl/single.svg" alt="Increment Cost: Single Threaded" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-single-2" name="tab-group-single" />
<label class="tab-label" for="tab-single-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/single.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/single.html">
<img class="figimg" src="/assets/concurrency-costs/icl/single.svg" alt="Increment Cost: Single Threaded" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-single-3" name="tab-group-single" />
<label class="tab-label" for="tab-single-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/single.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/single.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/single.svg" alt="Increment Cost: Single Threaded" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-single-4" name="tab-group-single" />
<label class="tab-label" for="tab-single-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/single.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/single.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/single.svg" alt="Increment Cost: Single Threaded" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>The fastest implementations run in about 10 nanoseconds, which is 5x faster than the fastest solution for 2 or more threads. The <em>slowest</em> implementation (<abbr title="A blocking ticket lock where each waiter waits on a unique condition variable.">queued fifo</abbr>) for one thread ties the <em>fastest</em> implementation (<abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr>) at two threads, and beats it handily at three or four.</p>
<p>The number overlaid on each bar is the number of atomic operations<sup id="fnref:atomhow" role="doc-noteref"><a href="#fn:atomhow" class="footnote" rel="footnote">18</a></sup> each implementation makes per increment. It is obvious that the performance is almost directly proportional to the number of atomic instructions. On the other hand, performance does <em>not</em> have much of a relationship with the total number of instructions of any type, which vary a lot even between algorithms with the same performance as the following table shows:</p>
<table>
<thead>
<tr>
<th>Algorithm</th>
<th style="text-align: right">Atomics</th>
<th style="text-align: right">Instructions</th>
<th style="text-align: right">Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td><abbr title="Uses a std::mutex and std::lock_guard to protect a plain integer counter.">mutex add</abbr></td>
<td style="text-align: right">2</td>
<td style="text-align: right">64</td>
<td style="text-align: right">~21 ns</td>
</tr>
<tr>
<td><abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">4</td>
<td style="text-align: right">~7 ns</td>
</tr>
<tr>
<td><abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">7</td>
<td style="text-align: right">~12 ns</td>
</tr>
<tr>
<td><abbr title="A ticket lock that calls sched_yield in a spin loop while waiting for its turn.">ticket yield</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">13</td>
<td style="text-align: right">~10 ns</td>
</tr>
<tr>
<td><abbr title="A ticket lock which blocks if it cannot immediately acquire the lock.">ticket blocking</abbr></td>
<td style="text-align: right">3</td>
<td style="text-align: right">107</td>
<td style="text-align: right">~32 ns</td>
</tr>
<tr>
<td><abbr title="A blocking ticket lock where each waiter waits on a unique condition variable.">queued fifo</abbr></td>
<td style="text-align: right">4</td>
<td style="text-align: right">167</td>
<td style="text-align: right">~45 ns</td>
</tr>
<tr>
<td><abbr title="A traditional spin-based ticket lock that does a hot spin while waiting for its ticket to be next.">ticket spin</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">13</td>
<td style="text-align: right">~10 ns</td>
</tr>
<tr>
<td><abbr title="A simple mutex from "Futexes Are Tricky".">mutex3</abbr></td>
<td style="text-align: right">2</td>
<td style="text-align: right">17</td>
<td style="text-align: right">~20 ns</td>
</tr>
</tbody>
</table>
<p>In particular, note that <abbr title="Uses a std::mutex and std::lock_guard to protect a plain integer counter.">mutex add</abbr> has more than 9x the number of instructions compared to <abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr> yet still runs at half the speed, in line with the 2:1 ratio of atomics. Similarly, <abbr title="A ticket lock that calls sched_yield in a spin loop while waiting for its turn.">ticket yield</abbr> and <abbr title="A traditional spin-based ticket lock that does a hot spin while waiting for its ticket to be next.">ticket spin</abbr> have slightly <em>better</em> performance than <abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr> despite having about twice the number of instructions, in line with them all having a single atomic operation<sup id="fnref:casworse" role="doc-noteref"><a href="#fn:casworse" class="footnote" rel="footnote">19</a></sup>.</p>
<p>The last row in the table shows the performance of <abbr title="A simple mutex from "Futexes Are Tricky".">mutex3</abbr>, an implementation we haven’t discussed. It is a basic mutex offering similar functionality to <code class="language-plaintext highlighter-rouge">std::mutex</code> and whose implementation is described in <a href="https://akkadia.org/drepper/futex.pdf">Futexes Are Tricky</a>. Because it doesn’t need to pass through two layers of abstraction<sup id="fnref:twolayer" role="doc-noteref"><a href="#fn:twolayer" class="footnote" rel="footnote">20</a></sup>, it has only about one third the instruction count of <code class="language-plaintext highlighter-rouge">std::mutex</code>, yet performance is almost exactly the same, differing by less than 10%.</p>
<p>So the idea that you can almost ignore things that are in a lower cost tier seems to hold here. Don’t take this too far: if you design a lock with a single atomic operation but 1,000 other instructions, it is not going to be fast. There are also reasons to keep your instruction count low other than microbenchmark performance: smaller instruction cache footprint, less space occupied in various <abbr title="Out-of-order execution allows CPUs to execute instructions out of order with respect to the source.">out-of-order</abbr> execution buffers, more favorable inlining tradeoffs, etc.</p>
<p>Here it is important to note that the change in level of our various functions didn’t require a change in implementation. These are exactly the same few implementations we discussed in the slower levels. Instead, we simply changed (by fiat, i.e., adjusting the benchmark parameters) the contention level from “very high” to “zero”. So in this case the level doesn’t depend only on the code, but also this external factor. Of course, just saying that we are going to get to level 1 by only running one thread is not very useful in real life: we often can’t simply ban multi-threaded operation.</p>
<p>So can we get to level 1 even under concurrent calls from multiple threads? For this particular problem, we can.</p>
<h4 id="adaptive-multi-counter">Adaptive Multi-Counter</h4>
<p>One option is to use multiple counters to represent the counter value. We try to organize it so that that threads running concurrently on different CPUs will increment different counters. Thus the <em>logical</em> counter value is split across all of these internal <em>physical</em> counters, and so a read of the logical counter value now needs to add together all the physical counter values.</p>
<p>Here’s an implementation:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">cas_multi_counter</span> <span class="p">{</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">NUM_COUNTERS</span> <span class="o">=</span> <span class="mi">64</span><span class="p">;</span>
<span class="k">static</span> <span class="k">thread_local</span> <span class="kt">size_t</span> <span class="n">idx</span><span class="p">;</span>
<span class="n">multi_holder</span> <span class="n">array</span><span class="p">[</span><span class="n">NUM_COUNTERS</span><span class="p">];</span>
<span class="nl">public:</span>
<span class="cm">/** increment the logical counter value */</span>
<span class="kt">uint64_t</span> <span class="k">operator</span><span class="o">++</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
<span class="k">auto</span><span class="o">&</span> <span class="n">counter</span> <span class="o">=</span> <span class="n">array</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">counter</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">counter</span><span class="p">.</span><span class="n">compare_exchange_strong</span><span class="p">(</span><span class="n">cur</span><span class="p">,</span> <span class="n">cur</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">cur</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// CAS failure indicates contention,</span>
<span class="c1">// so try again at a different index</span>
<span class="n">idx</span> <span class="o">=</span> <span class="p">(</span><span class="n">idx</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">NUM_COUNTERS</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint64_t</span> <span class="n">read</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">uint64_t</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&</span> <span class="n">h</span> <span class="o">:</span> <span class="n">array</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">h</span><span class="p">.</span><span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We’ll call this <abbr title="Uses a CAS on an adatively per-CPU counter.">cas multi</abbr>, and the approach is relatively straightforward.</p>
<p>There are 64 padded<sup id="fnref:padded" role="doc-noteref"><a href="#fn:padded" class="footnote" rel="footnote">21</a></sup> physical counters whose sum makes up the logical counter value. There is a thread-local <code class="language-plaintext highlighter-rouge">idx</code> value, initially zero for every thread, that points to the physical counter that each thread should increment. When <code class="language-plaintext highlighter-rouge">operator++</code> is called, we attempt to increment the counter pointed to by <code class="language-plaintext highlighter-rouge">idx</code> using <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>.</p>
<p>If this fails, however, we don’t simply retry. Failure indicates contention<sup id="fnref:notallfailure" role="doc-noteref"><a href="#fn:notallfailure" class="footnote" rel="footnote">22</a></sup> (this is the only way the <em>strong</em> variant of <code class="language-plaintext highlighter-rouge">compare_exchange</code> can fail), so we add one to <code class="language-plaintext highlighter-rouge">idx</code> to try another counter on the next attempt.</p>
<p>In a high-contention scenario like our benchmark, every CPU quickly ends up pointing to a different index value. If there is low contention, it is possible that only the first physical counter will be used.</p>
<p>Let’s compare this to the <code class="language-plaintext highlighter-rouge">atomic add</code> version we looked at above, which was the fastest of the level 2 approaches. Recall that it uses an <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr> on a single counter.</p>
<div class="tabs" id="tabs-cas-multi">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-cas-multi-1" name="tab-group-cas-multi" checked="" />
<label class="tab-label" for="tab-cas-multi-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/cas-multi.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/cas-multi.html">
<img class="figimg" src="/assets/concurrency-costs/skl/cas-multi.svg" alt="Increment Cost: Contention Adaptive Multi-Counter" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-cas-multi-2" name="tab-group-cas-multi" />
<label class="tab-label" for="tab-cas-multi-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/cas-multi.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/cas-multi.html">
<img class="figimg" src="/assets/concurrency-costs/icl/cas-multi.svg" alt="Increment Cost: Contention Adaptive Multi-Counter" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-cas-multi-3" name="tab-group-cas-multi" />
<label class="tab-label" for="tab-cas-multi-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/cas-multi.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/cas-multi.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/cas-multi.svg" alt="Increment Cost: Contention Adaptive Multi-Counter" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-cas-multi-4" name="tab-group-cas-multi" />
<label class="tab-label" for="tab-cas-multi-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/cas-multi.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/cas-multi.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/cas-multi.svg" alt="Increment Cost: Contention Adaptive Multi-Counter" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>For 1 active core, the results are the same as we saw earlier: the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> approach performs the same as the <abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr> algorithm<sup id="fnref:perfsame" role="doc-noteref"><a href="#fn:perfsame" class="footnote" rel="footnote">23</a></sup>, which is somewhat slower than <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr>, due to the need for an additional load (i.e., the line with <code class="language-plaintext highlighter-rouge">counter.load()</code>) to set up the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>.</p>
<p>For 2 to 4 cores, the situation changes dramatically. The multiple counter approach performs the <em>same</em> regardless of the number of active cores. That is, it exhibits perfect scaling with multiple cores – in contrast to the single-counter approach which scales poorly. At four cores, the relative speedup of the multi-counter approach is about 9x. On Amazon’s Graviton ARM processor the speedup approaches <em>eighty</em> times at 16 threads.</p>
<p>This improvement in increment performance comes at a cost, however:</p>
<ul>
<li>64 counters ought to be enough for anyone, but they take 4096 (!!) bytes of memory to store what takes only 8 bytes in the <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr> approach<sup id="fnref:eightbyte" role="doc-noteref"><a href="#fn:eightbyte" class="footnote" rel="footnote">24</a></sup>.</li>
<li>The <code class="language-plaintext highlighter-rouge">read()</code> method is much slower: it needs to iterate over and add all 64 values, versus a single load for the earlier approaches.</li>
<li>The implementation compiles to much larger code: 113 bytes versus 15 bytes for the single counter <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> approach or 7 bytes for the <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr> approach.</li>
<li>The concurrent behavior is considerably harder to reason about and document. For example, it is harder to explain the consistency condition provided by <code class="language-plaintext highlighter-rouge">read()</code> since it is no longer a single atomic read<sup id="fnref:read" role="doc-noteref"><a href="#fn:read" class="footnote" rel="footnote">25</a></sup>.</li>
<li>There is a single thread-local <code class="language-plaintext highlighter-rouge">idx</code> variable. So while different <code class="language-plaintext highlighter-rouge">cas_multi_counter</code> instances are logically independent, the shared <code class="language-plaintext highlighter-rouge">idx</code> variable means that things that happen in one counter can affect the non-functional behavior of the others<sup id="fnref:sharedidx" role="doc-noteref"><a href="#fn:sharedidx" class="footnote" rel="footnote">26</a></sup>.</li>
</ul>
<p>Some of these downsides can be partly mitigated:</p>
<ul>
<li>A much smaller number of counters would probably be better for most practical uses. We could also set the array size dynamically based on the detected number of logical CPUs since a larger array should not provide much of a performance increase. Better yet, we might make the size even more dynamic, based on contention: start with a single element and grow it only when contention is detected. This means that even on systems with many CPUs, the size will remain small if contention is never seen in practice. This has a runtime cost<sup id="fnref:rtcost" role="doc-noteref"><a href="#fn:rtcost" class="footnote" rel="footnote">27</a></sup>, however.</li>
<li>We could optimize the <code class="language-plaintext highlighter-rouge">read()</code> method by stopping when we see a zero counter. I believe a careful analysis shows that the non-zero counter values for any instance of this class are all in a contiguous region starting from the beginning of the counter array<sup id="fnref:subtle" role="doc-noteref"><a href="#fn:subtle" class="footnote" rel="footnote">28</a></sup>.</li>
<li>We could mitigate some of the code footprint by carefully carving the “less hot”<sup id="fnref:lesshot" role="doc-noteref"><a href="#fn:lesshot" class="footnote" rel="footnote">29</a></sup> slow path out into a another function, and use our <a href="https://xania.org/201209/forcing-code-out-of-line-in-gcc">magic powers</a> to encourage the small fast path (the first <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>) to be inlined while the fallback remains not inlined.</li>
<li>We could make the thread-local <code class="language-plaintext highlighter-rouge">idx</code> per instance specific to solve the “shared <code class="language-plaintext highlighter-rouge">idx</code> across all instances” problem. This does require some non-negligible amount of work to implement a dynamic TLS system which can create as many thread local keys as you want<sup id="fnref:dynamictls" role="doc-noteref"><a href="#fn:dynamictls" class="footnote" rel="footnote">30</a></sup>, and it is slower.</li>
</ul>
<p>So while we got a good looking chart, this solution doesn’t exactly dominate the simpler ones. You pay a price along several axes for the lack of contention and you shouldn’t blindly replace the simpler solutions with this one – it needs to be a carefully considered and use-case dependent decision.</p>
<p>Is it over yet? Can I close this browser tab and reclaim all that memory? Almost. Just one level to go.</p>
<h3 id="level-0-vanilla">Level 0: Vanilla</h3>
<p>The last and fastest level is achieved when only vanilla instructions are used (and without contention). By <em>vanilla instructions</em> I mean things like regular loads and stores which don’t imply additional synchronization above what the hardware memory model offers by default<sup id="fnref:noatomic" role="doc-noteref"><a href="#fn:noatomic" class="footnote" rel="footnote">31</a></sup>.</p>
<p>How can we increment a counter atomically while allowing it to be read from any thread? By ensuring there is only one writer for any given physical counter. If we keep a counter <em>per thread</em> and only allow the owning thread to write to it, there is no need for an atomic increment.</p>
<p>The obvious way to keep a per-thread counter is use thread-local storage. Something like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* Keeps a counter per thread, readers need to sum
* the counters from all active threads and add the
* accumulated value from dead threads.
*/</span>
<span class="k">class</span> <span class="nc">tls_counter</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">uint64_t</span><span class="o">></span> <span class="n">counter</span><span class="p">{</span><span class="mi">0</span><span class="p">};</span>
<span class="cm">/* protects all_counters and accumulator */</span>
<span class="k">static</span> <span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">lock</span><span class="p">;</span>
<span class="cm">/* list of all active counters */</span>
<span class="k">static</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">tls_counter</span> <span class="o">*></span> <span class="n">all_counters</span><span class="p">;</span>
<span class="cm">/* accumulated value of counters from dead threads */</span>
<span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">accumulator</span><span class="p">;</span>
<span class="cm">/* per-thread tls_counter object */</span>
<span class="k">static</span> <span class="k">thread_local</span> <span class="n">tls_counter</span> <span class="n">tls</span><span class="p">;</span>
<span class="cm">/** add ourselves to the counter list */</span>
<span class="n">tls_counter</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">g</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">all_counters</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="k">this</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/**
* destruction means the thread is going away, so
* we stash the current value in the accumulator and
* remove ourselves from the array
*/</span>
<span class="o">~</span><span class="n">tls_counter</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">g</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="n">accumulator</span> <span class="o">+=</span> <span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="n">all_counters</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">remove</span><span class="p">(</span><span class="n">all_counters</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">all_counters</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="k">this</span><span class="p">),</span> <span class="n">all_counters</span><span class="p">.</span><span class="n">end</span><span class="p">());</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">incr</span><span class="p">()</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="n">counter</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">cur</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
<span class="nl">public:</span>
<span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">read</span><span class="p">()</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">g</span><span class="p">(</span><span class="n">lock</span><span class="p">);</span>
<span class="kt">uint64_t</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">h</span> <span class="o">:</span> <span class="n">all_counters</span><span class="p">)</span> <span class="p">{</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">h</span><span class="o">-></span><span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="n">count</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sum</span> <span class="o">+</span> <span class="n">accumulator</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">increment</span><span class="p">()</span> <span class="p">{</span>
<span class="n">tls</span><span class="p">.</span><span class="n">incr</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The approach is the similar to the per-CPU counter, except that we keep one counter per thread, using <code class="language-plaintext highlighter-rouge">thread_local</code>. Unlike earlier implementations, you don’t create instances of this class: there is only one counter and you increment it by calling the static method <code class="language-plaintext highlighter-rouge">tls_counter::increment()</code>.</p>
<p>Let’s focus a moment on the actual increment inside the thread-local counter instance:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">incr</span><span class="p">()</span> <span class="p">{</span>
<span class="k">auto</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">counter</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="n">counter</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="n">cur</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_relaxed</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is just a verbose way of saying “add 1 to this <code class="language-plaintext highlighter-rouge">std::atomic<uint64_t></code> but it doesn’t have to be atomic”. We don’t need an atomic increment as there is only one writer<sup id="fnref:whyatomic" role="doc-noteref"><a href="#fn:whyatomic" class="footnote" rel="footnote">32</a></sup>. Using the <em>relaxed</em> memory order means that no barriers are inserted<sup id="fnref:barrier" role="doc-noteref"><a href="#fn:barrier" class="footnote" rel="footnote">33</a></sup>. We still need a way to read all the thread-local counters, and the rest of the code deals with that: there is a global vector of pointers to all the active <code class="language-plaintext highlighter-rouge">tls_counter</code> objects, and <code class="language-plaintext highlighter-rouge">read()</code> iterates over this. All access to this vector is protected by a <code class="language-plaintext highlighter-rouge">std::mutex</code>, since it will be accessed concurrently. When threads die, we remove their entry from the array, and add their final value to <code class="language-plaintext highlighter-rouge">tls_counter::accumulator</code> which is added to the sum of active counters in <code class="language-plaintext highlighter-rouge">read()</code>.</p>
<p>Whew.</p>
<p>So how does this <abbr title="Uses thread-local storage for a counter per thread.">tls add</abbr> implementation benchmark?</p>
<div class="tabs" id="tabs-tls">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-tls-1" name="tab-group-tls" checked="" />
<label class="tab-label" for="tab-tls-1">Skylake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/skl/tls.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/skl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/skl/tls.html">
<img class="figimg" src="/assets/concurrency-costs/skl/tls.svg" alt="Increment Cost: Thread Local Storage" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-tls-2" name="tab-group-tls" />
<label class="tab-label" for="tab-tls-2">Ice Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/icl/tls.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/icl/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/icl/tls.html">
<img class="figimg" src="/assets/concurrency-costs/icl/tls.svg" alt="Increment Cost: Thread Local Storage" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-tls-3" name="tab-group-tls" />
<label class="tab-label" for="tab-tls-3">Graviton</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g1-16/tls.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g1-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g1-16/tls.html">
<img class="figimg" src="/assets/concurrency-costs/g1-16/tls.svg" alt="Increment Cost: Thread Local Storage" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-tls-4" name="tab-group-tls" />
<label class="tab-label" for="tab-tls-4">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/concurrency-costs/g2-16/tls.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/concurrency-hierarchy-bench/tree/master/results/g2-16/combined.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/concurrency-costs/g2-16/tls.html">
<img class="figimg" src="/assets/concurrency-costs/g2-16/tls.svg" alt="Increment Cost: Thread Local Storage" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>That’s two nanoseconds per increment, regardless of the number of active cores. This turns out to be exactly as fast as just incrementing a variable in memory with a single instruction like <code class="language-plaintext highlighter-rouge">inc [eax]</code> or <code class="language-plaintext highlighter-rouge">add [eax], 1</code>, so it’s somehow as fast as possible for any solution which ends up incrementing something in memory<sup id="fnref:whitelie" role="doc-noteref"><a href="#fn:whitelie" class="footnote" rel="footnote">34</a></sup>.</p>
<p>Let’s take a look at the number of atomics, total instructions and performance for the three implementations in the last plot, for four threads:</p>
<table>
<thead>
<tr>
<th>Algorithm</th>
<th style="text-align: right">Atomics</th>
<th style="text-align: right">Instructions</th>
<th style="text-align: right">Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td><abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">4</td>
<td style="text-align: right">~ 110 ns</td>
</tr>
<tr>
<td><abbr title="Uses a CAS on an adatively per-CPU counter.">cas multi</abbr></td>
<td style="text-align: right">1</td>
<td style="text-align: right">7</td>
<td style="text-align: right">~ 12 ns</td>
</tr>
<tr>
<td><abbr title="Uses thread-local storage for a counter per thread.">tls add</abbr></td>
<td style="text-align: right">0</td>
<td style="text-align: right">12</td>
<td style="text-align: right">~ 2 ns</td>
</tr>
</tbody>
</table>
<p>This is a clear indication that the difference in performance has very little to do with the number of instructions: the ranking by instruction count is exactly the reverse of the ranking by performance! <abbr title="Uses thread-local storage for a counter per thread.">tls add</abbr> has three times the number of instructions, yet is more than <em>fifty times</em> faster (so the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> varies by a factor of more than 150x).</p>
<p>As we saw at the last 1, this improvement in performance doesn’t come for free:</p>
<ul>
<li>The total code size is considerably larger than the per-CPU approach, although most of it is related to creation of the initial object on each thread, and not on the hot path.</li>
<li>We have one object per thread, instead of per CPU. For an application with many threads using the counter, this may mean the creation of many individual counters which use both more memory<sup id="fnref:tlsmem" role="doc-noteref"><a href="#fn:tlsmem" class="footnote" rel="footnote">35</a></sup> and result in a slower <code class="language-plaintext highlighter-rouge">read()</code> function.</li>
<li>This implementation only supports <em>one</em> counter: the key methods in <code class="language-plaintext highlighter-rouge">tls_counter</code> are static. This boils down to the need for a <code class="language-plaintext highlighter-rouge">thread_local</code> object for the physical counter, which must be static by the rules of C++. A template parameter could be added to allow multiple counters based on dummy types used as tags, but this is still more awkward to use than instances of a class (and some platforms <a href="https://docs.microsoft.com/en-us/windows/win32/procthread/thread-local-storage">have limits</a> on the number of <code class="language-plaintext highlighter-rouge">thread_local</code> variables). This limitation could be removed in the same way as discussed earlier for the <abbr title="Uses a CAS on an adatively per-CPU counter.">cas multi</abbr> <code class="language-plaintext highlighter-rouge">idx</code> variable, but at a cost in performance and complexity.</li>
<li>A lock was introduced to protect the array of all counters. Although the important increment operation is still lock-free, things like the <code class="language-plaintext highlighter-rouge">read()</code> call, the first counter access on a given thread and thread destruction all compete for the same lock. This could be eased with a read-write lock or a concurrent data structure, but at a cost as always.</li>
</ul>
<h2 id="the-table">The Table</h2>
<style>
.yesno {
display: inline-block;
border-radius: 3px;
min-width: 25px;
text-align: center;
}
.yes {
background-color: #070;
padding: 3px;
}
.no {
background-color: orangered;
padding: 3px 5px;
}
</style>
<p>Let’s summarize all the levels in this table.</p>
<p>The <em>~Cost</em> column is a <em>very</em> approximate estimate of the cost of each “occurrence” of the expensive operation associated with the level. It should be taken as a very rough ballpark for current Intel and AMD hardware, but especially the later levels can vary a lot.</p>
<p>The <em>Perf Event</em> column lists a Linux <code class="language-plaintext highlighter-rouge">perf</code> event that you can use to count the number of times the operation associated with this level occurs, i.e., the thing that is slow. For example, in level 1, you count atomic operations using the <code class="language-plaintext highlighter-rouge">mem_inst_retired.lock_loads</code> counter, and if you get three counts per high level operation, you can expect roughly 3 x 10 ns = 30 ns cost. Of course, you don’t necessarily need perf in this case: you can inspect the assembly too.</p>
<p>The <em>Local</em> column records whether the behavior of this level is <em>core local</em>. If yes, it means that operations on different cores complete independently and don’t compete and so the performance scales with the number of cores. If not, there is contention or serialization, so the throughput of the entire system is often limited, regardless of how many cores are involved. For example, only one core at a time performs an atomic operation on a cache line, so the throughput of the whole system is fixed and the throughput per core decreases as more cores become involved.</p>
<p>The <em>Key Characteristic</em> tries to get across the idea of the level in one bit-sized chunk.</p>
<table>
<thead>
<tr>
<th>Level</th>
<th>Name</th>
<th style="text-align: right">~Cost (ns)</th>
<th style="text-align: center">Perf Event</th>
<th style="text-align: center">Local</th>
<th>Key Characteristic</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Vanilla</td>
<td style="text-align: right">low</td>
<td style="text-align: center">depends</td>
<td style="text-align: center"><strong class="yes yesno">Yes</strong></td>
<td>No atomic instructions or contended accesses at all</td>
</tr>
<tr>
<td>1</td>
<td>Uncontended Atomic</td>
<td style="text-align: right">10</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">mem_inst_retired.</code> <code class="language-plaintext highlighter-rouge">lock_loads</code></td>
<td style="text-align: center"><strong class="yes yesno">Yes</strong></td>
<td>Atomic instructions without contention</td>
</tr>
<tr>
<td>2</td>
<td>True Sharing</td>
<td style="text-align: right">40 - 400</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">mem_load_l3_hit_retired.</code> <code class="language-plaintext highlighter-rouge">xsnp_hitm</code></td>
<td style="text-align: center"><strong class="no yesno">No</strong></td>
<td>Contended atomics or locks</td>
</tr>
<tr>
<td>3</td>
<td>Syscall</td>
<td style="text-align: right">1,000</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">raw_syscalls:sys_enter</code></td>
<td style="text-align: center"><strong class="no yesno">No</strong></td>
<td>System call</td>
</tr>
<tr>
<td>4</td>
<td>Context Switch</td>
<td style="text-align: right">10,000</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">context-switches</code></td>
<td style="text-align: center"><strong class="no yesno">No</strong></td>
<td>Forced context switch</td>
</tr>
<tr>
<td>5</td>
<td>Catastrophe</td>
<td style="text-align: right">huge</td>
<td style="text-align: center">depends</td>
<td style="text-align: center"><strong class="no yesno">No</strong></td>
<td>Stalls until quantum exhausted, or other sadness</td>
</tr>
</tbody>
</table>
<h2 id="so-what">So What?</h2>
<p>What’s the point of all this?</p>
<p>Primarily, I use the hierarchy as a simplification mechanism when thinking about concurrency and performance. As a first order approximation <em>you mostly only need to care about the operations related to the current level</em>. That is, if you are focusing on something which has contended atomic operations (level 2), you don’t need to worry too much about uncontended atomics or instruction counts: just focus on reducing contention. Similarly, if you are at level 1 (uncontended atomics) it is often worth using <em>more</em> instructions to reduce the number of atomics.</p>
<p>This guideline only goes so far: if you have to add 100 instructions to remove one atomic, it is probably not worth it.</p>
<p>Second, when optimizing a concurrent system I always try to consider how I can get to a (numerically) lower level. Can I remove the last atomic? Can I avoid contention? Successfully moving to a lower level can often provide an order-of-magnitude boost to performance, so it should be attempted first, before finer-grained optimizations within the current level. Don’t spend forever optimizing your contended lock, if there’s some way to get rid of the contention entirely.</p>
<p>Of course, this is not always possible, or not possible without tradeoffs you are unwilling to make.</p>
<h3 id="getting-there">Getting There</h3>
<p>Here’s a quick look at some usual and unusual ways of achieving levels lower on the hierarchy.</p>
<h4 id="level-4">Level 4</h4>
<p>You probably don’t want to really be in level 4 but it’s certainly better than level 5. So, if you still have your job and your users haven’t all abandoned you, it’s usually pretty easy to get out of level 5. More than half the battle is just recognizing what’s going on and from there the solution is often clear. Many times, you’ve simply violated some rule like “don’t use pure spinlocks in userspace” or “you built a spinlock by accident” or “so-and-so accidentally held that core lock during IO”. There’s almost never any inherent reason you’d need to stay in level 5 and you can usually find an almost tradeoff-free fix.</p>
<p>A better approach than targeting level 4 is just to skip to level 2, since that’s usually not too difficult.</p>
<h4 id="level-3">Level 3</h4>
<p>Getting to level 3 just means solving the underlying reason for so many context switches. In the example used in this post, it means giving up fairness. Other approaches include not using threads for small work units, using smarter thread pools, not oversubscribing cores, and keeping locked regions short.</p>
<p>You don’t usually really want to be in level 3 though: just skip right to level 2.</p>
<h4 id="level-2">Level 2</h4>
<p>Level 3 isn’t a <em>terrible</em> place to be, but you’ll always have that gnawing in your stomach that you’re leaving a 10x speedup on the table. You just need to get rid of that system call or context switch, bringing you to level 2.</p>
<p>Most library provided concurrency primitives already avoid system calls on the happy path. E.g., pthreads mutex, <code class="language-plaintext highlighter-rouge">std::mutex</code>, Windows <code class="language-plaintext highlighter-rouge">CRITICAL_SECTION</code> will avoid a system call while acquiring and releasing an uncontended lock. There are, however, some notable exceptions: if you are using a <a href="https://docs.microsoft.com/en-us/windows/win32/sync/mutex-objects">Win32 mutex object</a> or <a href="https://man7.org/linux/man-pages/man2/semop.2.html">System V semaphore</a> object, you are paying a system call on every operation. Double check if you can use an in-process alternative in this case.</p>
<p>For more general synchronization purposes which don’t fit the lock-unlock pattern, a condition variable often fits the bill and a quality implementation generally avoids system calls on the fast path. A relatively unknown and higher performance alternative to condition variables, especially suitable for coordinating blocking for otherwise lock-free structures, is an <a href="http://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-reordering/#event-counts-with-x86-tso-and-futexes"><em>event count</em></a>. Paul’s implementation is <a href="https://github.com/concurrencykit/ck/blob/master/include/ck_ec.h">available in concurrency kit</a> and we’ll mention it again at Level 0.</p>
<p>System calls often creep in when home-grown synchronization solutions are used, e.g., using Windows events to build your own read-write lock or striped lock or whatever the flavor of the day is. You can often remove the call in the fast path by making a check in user-space to see if a system call is necessary. For example, rather than unconditionally unblocking any waiters when releasing some exclusive object, <em>check</em> to see if there are waiters<sup id="fnref:tricky" role="doc-noteref"><a href="#fn:tricky" class="footnote" rel="footnote">36</a></sup> in userspace and skip the system call if there are none.</p>
<p>If a lock is generally held for a short period, you can avoid unnecessary system calls and context switches with a hybrid lock that spins for an appropriate<sup id="fnref:spin" role="doc-noteref"><a href="#fn:spin" class="footnote" rel="footnote">37</a></sup> amount of time before blocking. This can trade tens of nanoseconds of spinning for hundreds or thousands of nanoseconds of system calls.</p>
<p>Ensure your use of threads is “right sized” as much as possible. A lot of unnecessary context switches occur when many more threads are running than there are CPUs, and this increases the chance of a lock being held during a context switch (and makes it worse when it does happen: it takes longer for the holding thread to run again as the scheduler probably cycles through all the other runnable threads first).</p>
<h4 id="level-1">Level 1</h4>
<p>A lot of code that does the work to get to level 2 actually ends up in level 1. Recall that the primary difference between level 1 and 2 is the lack of contention in level 1. So if your process naturally or by design has low contention, simply using existing off-the-shelf synchronization like <code class="language-plaintext highlighter-rouge">std::mutex</code> can get you to level 1.</p>
<p>I can’t give a step-by-step recipe for reducing contention, but here’s a laundry list of things to consider:</p>
<ul>
<li>Keep your critical sections as short as possible. Ensure you do any heavy work that doesn’t directly involve a shared resource outside of the critical section. Sometimes this means making a copy of the shared data to work on it “outside” of the lock, which might increase the total amount of work done, but reduce contention.</li>
<li>For things like atomic counters, try to batch your updates: e.g., if you update the counter multiple times during some operation, update a local on the stack rather than the global counter and only “upload” the entire value once at the end.</li>
<li>Consider using structures that use fine-grained locks, striped locks or similar mechanisms that reduce contention by locking only portions of a container.</li>
<li>Consider per-CPU structures, as in the examples above, or some approximation of them (e.g., hashing the current thread ID into an array of structures). This post used an atomic counter as a simple example, but it applies more generally to any case where the mutations can be done independently and aggregated later.</li>
</ul>
<p>For all of the advice above, when I say <em>consider doing X</em> I really mean <em>consider finding and using an existing off-the shelf component that does X</em>. Writing concurrent structures yourself should be considered a last resort – despite what you think, your use case is probably not all that unique.</p>
<p>Level 1 is where a lot of well written, straightforward and high-performing concurrent code lives. There is nothing wrong with this level – it is a happy place.</p>
<h4 id="level-0">Level 0</h4>
<p>It is not always easy or possible to remove the last atomic access from your fast paths, but if you just can’t live with the extra ~10 ns, here are some options:</p>
<ul>
<li>The general approach of using thread local storage, as discussed above, can also be extended to structures more complicated than counters.</li>
<li>You may be able to achieve fewer than one expensive atomic instruction per logical operation by <em>batching:</em> saving up multiple operations and then committing them at all once with a small fixed number of atomic operations. Some containers or concurrent structures may have a batched API which does this for you, but even if not you can sometimes add batching yourself, e.g., by inserting collections of elements rather than a single element<sup id="fnref:hiddenbatch" role="doc-noteref"><a href="#fn:hiddenbatch" class="footnote" rel="footnote">38</a></sup>.</li>
<li>Many lock-free structures offer atomic-free <em>read</em> paths, notably concurrent containers in garbage collected languages, such as <code class="language-plaintext highlighter-rouge">ConcurrentHashMap</code> in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a <a href="http://concurrencyfreaks.blogspot.com/2017/08/why-is-memory-reclamation-so-important.html">hard problem</a>, but there are still <a href="https://web.archive.org/web/20220626105301id_/http://concurrencykit.org/">some</a> <a href="https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-guide/containers.html">good</a> <a href="https://github.com/facebook/folly/tree/master/folly/concurrency">options</a> out there.</li>
<li>I find that <a href="https://liburcu.org/">RCU</a> is especially powerful and fairly general if you are using a garbage collected language, or can satisfy the requirements for an efficient reclamation method in a non-GC language.</li>
<li>The <a href="https://en.wikipedia.org/wiki/Seqlock">seqlock</a><sup id="fnref:despite" role="doc-noteref"><a href="#fn:despite" class="footnote" rel="footnote">39</a></sup> is an underrated and little known alternative to RCU without reclaim problems, although not as general. Concurrencykit has <a href="https://web.archive.org/web/20220711221845id_/http://concurrencykit.org/doc/ck_sequence.html">an implementation</a>. It has an atomic-free read path for readers. Unfortunately, seqlocks don’t integrate cleanly with either the Java<sup id="fnref:stampedlock" role="doc-noteref"><a href="#fn:stampedlock" class="footnote" rel="footnote">40</a></sup> or <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1478r1.html">C++</a> memory models.</li>
<li>It is also possible in some cases to do a per-CPU rather than a per-thread approach using only vanilla instructions, although the possibility of interruption at any point makes this tricky. <a href="https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/">Restartable sequences (rseq)</a> can help, and there are other tricks lurking out there.</li>
<li>Event counts, mentioned earlier, <a href="https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-reordering/#event-counts-with-x86-tso-and-futexes:~:text=However%2C%20if%20we%20go">can even be level 0</a> in a single writer scenario, as Paul shows.</li>
<li>This is the last point, but it should be the first: you can probably often redesign your algorithm or application to avoid sharing data in the first place, or to share much less. For example, rather than constantly updating a shared collection with intermediate results, do as much private computation as possible before only merging the final results.</li>
</ul>
<h3 id="summary">Summary</h3>
<p>We looked at the six different levels that make up this concurrency cost hierarchy. The slow half (3, 4 and 5) are all basically performance bugs. You should be able to achieve level 2 or level 1 (if you naturally have low contention) for most designs fairly easily and those are probably what you should target by default. Level 1 in a contended scenario and level 0 are harder to achieve and often come with difficult tradeoffs, but the performance boost can be significant: often one or more orders of magnitude.</p>
<h3 id="thanks">Thanks</h3>
<p>Thanks to Paul Khuong who <a href="https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-pointers">showed me something</a> that made me reconsider in what scenarios level 0 is achievable and typo fixes.</p>
<p>Thanks to <a href="https://twitter.com/never_released">@never_released</a> for help on a problem I had bringing up an EC2 bare-metal instance (tip: just wait).</p>
<p>Special thanks to <a href="https://twitter.com/matt_dz">matt_dz</a> and Zach Wenger for helping fix about <em>sixty</em> typos between them.</p>
<p>Thanks to Alexander Monakov, Dave Andersen, Laurent and Kriau for reporting typos, and Aaron Jacobs for suggesting clarifications to the level 0 definition.</p>
<p>Traffic light photo by <a href="https://unsplash.com/@harshaldesai">Harshal Desai</a> on <a href="https://unsplash.com/s/photos/traffic-light">Unsplash</a>.</p>
<h3 id="discussion-and-feedback">Discussion and Feedback</h3>
<p>You can leave a <a href="#comment-section">comment below</a> or discuss on <a href="https://news.ycombinator.com/item?id=23749172">Hacker News</a>, <a href="https://www.reddit.com/r/programming/comments/hma5y1/a_concurrency_cost_hierarchy/">r/programming</a> or <a href="https://www.reddit.com/r/cpp/comments/hmaocb/a_concurrency_cost_hierarchy/">r/cpp</a>.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:realworld" role="doc-endnote">
<p>Well, this is quite real world: such atomic counters are used widely for a variety of purposes. I throw the <em>if you squint</em> in there because, after all, we are using microbenchmarks which simulate a probably-unrealistic density of increments to this counter, and it is a <em>bit</em> of a stretch to make this one example span all five levels – but I tried! <a href="#fnref:realworld" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:g2cores" role="doc-endnote">
<p>The Graviton 2 bare metal hardware has 64 cores, but this instance size makes 16 of them available. This means that in principle the results can be affected by the coherency traffic of other tenants on the same hardware, but the relatively stable results seem to indicate it doesn’t affect the results much. <a href="#fnref:g2cores" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:parisc" role="doc-endnote">
<p>Some hardware supports very limited atomic operations, which may be mostly useful <em>only</em> for locking, although you can <a href="https://parisc.wiki.kernel.org/index.php/FutexImplementation">get tricky</a>. <a href="#fnref:parisc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:post" role="doc-endnote">
<p>We could actually use either the pre or post-increment version of the operator here. The usual advice is to prefer the pre-increment form <code class="language-plaintext highlighter-rouge">++c</code> as it can be faster as it can return the mutated value, rather than making a copy to return after the mutation. Now this advice rarely applies to primitive values, but atomic increment is actually an interesting case which turns it on its head: the post-increment version is probably better (at least, never slower) since the underlying hardware operation returns the previous value. So it’s <a href="https://godbolt.org/z/p4TDjX">at least one extra operation</a> to calculate the pre-increment value (or much worse, apparently, if icc gets involved). <a href="#fnref:post" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:atomicsup" role="doc-endnote">
<p>Many ISAs, including POWER and ARM, traditionally only included support for a <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>-like or <a href="https://en.wikipedia.org/wiki/Load-link/store-conditional">LL/SC</a> operation, without specific support for more specific atomic arithmetic operations. The idea, I think, was that you could build any operation you want on top of of these primitives, at the cost of “only” a small retry loop and that’s more RISC-y, right? This seems to be changing as ARMv8.1 got a bunch of atomic operations. <a href="#fnref:atomicsup" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:java" role="doc-endnote">
<p>From its introduction through Java 7, the <code class="language-plaintext highlighter-rouge">AtomicInteger</code> and related classes in Java implemented all their atomic operations on top of a <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> loop, as <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> was the only primitive implemented as an intrinsic. In Java 8, almost exactly a decade later, these were finally replaced with dedicated atomic RMWs where possible, with <a href="http://ashkrit.blogspot.com/2014/02/atomicinteger-java-7-vs-java-8.html">good results</a>. <a href="#fnref:java" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:l3" role="doc-endnote">
<p>On my system and most (all?) modern Intel systems this is essentially the L3 cache, as the caching home agent (CHA) lives in or adjacent to the L3 cache. <a href="#fnref:l3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:inter" role="doc-endnote">
<p>This doesn’t imply that each atomic operation needs to take 70 cycles under contention: a single core could do <em>multiple</em> operations on the cache line after it gains exclusive ownership, so the cost of obtaining the line could be amortized over all of these operations. How much of this occurs is a measure of fairness: a very fair CPU will not let any core monopolize the line for long, but this makes highly concurrent benchmarks like this slower. Recent Intel CPUs seem quite fair in this sense. <a href="#fnref:inter" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:notwhat" role="doc-endnote">
<p>It also might not <a href="https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752">work how you think</a>, depending on details of the OS scheduler. <a href="#fnref:notwhat" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spectre" role="doc-endnote">
<p>They used to be cheaper: based on my measurements the cost of system calls has more than doubled, on most Intel hardware, after the Spectre and Meltdown mitigations have been applied. <a href="#fnref:spectre" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hyst" role="doc-endnote">
<p>An interesting thing about convoys is that they exhibit hysteresis: once you start having a convoy, they become self-sustaining, even if the conditions that started it are removed. Imagine two threads that lock a common lock for 1 nanosecond every 10,000 nanoseconds. Contention is low: the chance of any particular lock acquisition being contended is 0.01%. However, as soon as a contended acquisition occurs, the lock effectively becomes held for the amount of time it takes to do a full context switch (for the losing thread to block, and then to wake up). If that’s longer than 10,000 nanoseconds, the convoy will sustain itself indefinitely, until something happens to break the loop (e.g., one thread deciding to work on something else). A restart also “fixes” it, which is one of many possible explanations for processes that suddenly shoot to 100% CPU (but are still making progress), but can be fixed by a restart. Everything becomes worse with more than two threads, too. <a href="#fnref:hyst" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:huh" role="doc-endnote">
<p>Actually I find it remarkable that this performs about as well as the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>-based <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr>, since the fairness necessarily implies that the lock is acquired in a round-robin order, so the cache line with the lock must at a minimum move around to each acquiring thread. This is a real stress test of the arbitrary coherency mechanisms offered by the CPU. <a href="#fnref:huh" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:noht" role="doc-endnote">
<p>And no SMT enabled, so there are 4 logical processors from the point of view of the OS. <a href="#fnref:noht" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:once" role="doc-endnote">
<p>In fact, the <em>once</em> scenario is the most likely, since one would assume with homogeneous threads the scheduler will approximate something like round-robin scheduling. So the thread that is descheduled is most likely the one that is also closest to the head of the lock queue, because it had been spinning the longest. <a href="#fnref:once" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:perf" role="doc-endnote">
<p>In fact, you can measure this with <code class="language-plaintext highlighter-rouge">perf</code> and see that the total number of context switches is usually within a factor of 2 for both tests, when oversubscribed. <a href="#fnref:perf" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:csdepend" role="doc-endnote">
<p>There is another level of complication here: the convoy only gets set up when the fifth thread joins the fun <em>if</em> the thread that gets switched out had expressed interest in the lock before it lost the CPU. That is, after a thread unlocks the lock, there is a period before it gets a new ticket as it tries to obtain the lock again. Before it gets that ticket, it is essentially invisible to the other threads, and if it gets context switched out, the catastrophic convoy won’t be set up (because the new set of four threads will be able to efficiently share the lock among themselves). <a href="#fnref:csdepend" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:notexx" role="doc-endnote">
<p>This won’t always <em>necessarily</em> be the case. You could write a primitive that always makes a system call, putting it at level 3, even if there is no contention, but here I’ve made sure to always have a no-syscall fast path for the no-contention case. <a href="#fnref:notexx" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:atomhow" role="doc-endnote">
<p>On Intel hardware you can use <a href="https://github.com/travisdowns/concurrency-hierarchy-bench/blob/master/scripts/details.sh">details.sh</a> to collect the atomic instruction count easily, taking advantage of the <code class="language-plaintext highlighter-rouge">mem_inst_retired.lock_loads</code> performance counter. <a href="#fnref:atomhow" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:casworse" role="doc-endnote">
<p>The <abbr title="Uses a CAS loop to increment a single shared counter.">cas add</abbr> implementation comes off looking slightly worse than the other single-atomic implementations here because the load required to set up the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> value effectively adds to the dependency chain involving the atomic operation, which explains the 5-cycle difference with <abbr title="Uses an atomic increment on a single shared counter.">atomic add</abbr>. This goes away if you can do a <em>blind <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr></em> (e.g., in locks’ acquire paths), but that’s not possible here. <a href="#fnref:casworse" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:twolayer" role="doc-endnote">
<p>Actually three layers, <a href="https://github.com/gcc-mirror/gcc/blob/4ff685a8705e8ee55fa86e75afb769ffb0975aea/libstdc%2B%2B-v3/include/bits/std_mutex.h#L98">libstdc++</a>, then <a href="https://github.com/gcc-mirror/gcc/blob/4ff685a8705e8ee55fa86e75afb769ffb0975aea/libgcc/gthr-posix.h#L775">libgcc</a> and then finally pthreads. I’ll count the first two as one though because those can all inline into the caller. Based on a rough accounting, probably 75% of the instruction count comes from pthreads, the rest from the other two layers. The pthreads mutexes are more general purpose than what <code class="language-plaintext highlighter-rouge">std::mutex</code> offers (e.g., they support recursion), and the features are configured at runtime on a per-mutex basis, so that explains a lot of the additional work these functions are doing. It’s only due to cost of atomic operations that <code class="language-plaintext highlighter-rouge">std::mutex</code> doesn’t take a significant penalty compared to a more svelte design. <a href="#fnref:twolayer" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:padded" role="doc-endnote">
<p><em>Padded</em> means that the counters are aligned such that each falls into its own 64 byte cache line, to avoid <a href="https://en.wikipedia.org/wiki/False_sharing"><em>false sharing</em></a>. This means that even though each counter only has 8 bytes of logical payload, it requires 64 bytes of storage. Some people claim that you need to pad out to 128 bytes, not 64, to avoid the effect of the <em>adjacent line prefetcher</em> which fetches the 64-byte that completes an aligned 128-byte pair of lines. However, I have not observed this effect often on modern CPUs. Maybe the prefetcher is conservative and doesn’t trigger unless past behavior indicates the fetches are likely to be used, or the prefetch logic can detect and avoid cases of false sharing (e.g., by noticing when prefetched lines are subsequently invalidated by a snoop). <a href="#fnref:padded" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:notallfailure" role="doc-endnote">
<p>Actually, not <em>all</em> failure indicates contention: there is a small chance also that a context switch exactly splits the load and the subsequent <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr>, and in this case the <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> would fail when the thread was scheduled again if any thread that ran in the meantime updated the same counter. Treating this as contention doesn’t really cause any serious problems. <a href="#fnref:notallfailure" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:perfsame" role="doc-endnote">
<p>Not surprising, since there is no contention and the fast path looks the same for either algorithm: a single <abbr title="Compare-and-swap: an atomic operation implemented on x86 and other CPUs.">CAS</abbr> that always succeeds. <a href="#fnref:perfsame" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:eightbyte" role="doc-endnote">
<p>Here I’m assuming that <code class="language-plaintext highlighter-rouge">sizeof(std::atomic<uint64_t>)</code> is 8, and this is the case on all current mainstream platforms. Also, you may or may not want to pad out the single-counter version to 64 bytes as well, to avoid some <em>potential</em> false sharing with nearby values, but this is different than the multi-counter case where padding is obligatory to avoid guaranteed false sharing. <a href="#fnref:eightbyte" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:read" role="doc-endnote">
<p>In this limited case, I <em>think</em> <code class="language-plaintext highlighter-rouge">read()</code> provides the same guarantees as the single-counter case. Informally, <code class="language-plaintext highlighter-rouge">read()</code> returns some value that the counter had at some point between the start and end of the <code class="language-plaintext highlighter-rouge">read()</code> call. Formally, there is a <em>linearization point</em> within <code class="language-plaintext highlighter-rouge">read()</code> although this point can only be determined in retrospect by examining the returned value (unlike the single-counter approaches, where the linearization is clear regardless of the value). However, <em>this is only true because the only mutating operation is <code class="language-plaintext highlighter-rouge">increment()</code></em>. If we also offered a <code class="language-plaintext highlighter-rouge">decrement()</code> method, this would no longer be true: you could read values that the logical counter never had based on the sequence of increments and decrements. Specifically, if you execute <code class="language-plaintext highlighter-rouge">increment(); decrement(); increment()</code> and even if you know these operations are strictly ordered (e.g., via locking), a concurrent call to <code class="language-plaintext highlighter-rouge">read()</code> could return <em>2</em>, even though the counter never logically exceeded 1. <a href="#fnref:read" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sharedidx" role="doc-endnote">
<p>In particular, if contention is seen on one object, the per-thread index will change to avoid it, which changes the index of all other objects as well, even if they have not seen any contention. This doesn’t seem like much of a problem for this simple implementation (which index we write to doesn’t matter much), but it could make some other optimizations more difficult: e.g., if we size the counter array dynamically, we don’t want to unnecessarily change the <code class="language-plaintext highlighter-rouge">idx</code> for uncontended objects, since it requires a larger counter array, unnecessarily. <a href="#fnref:sharedidx" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rtcost" role="doc-endnote">
<p>At least, an extra indirection to access the array which is no longer embedded in the object<sup id="fnref:soa" role="doc-noteref"><a href="#fn:soa" class="footnote" rel="footnote">41</a></sup>, and checks to ensure the array is large enough. Furthermore, we have another decision to make: when to expand the array. How much contention should we suffer before we decide the array is too small? <a href="#fnref:rtcost" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:subtle" role="doc-endnote">
<p>The intuition is later counter positions only get written when an earlier position failed a compare and swap, which necessarily implies it was written to by some other thread and hence non-zero. There is some subtlety here: this wouldn’t hold if <code class="language-plaintext highlighter-rouge">compare_exchange_weak</code> was used instead of <code class="language-plaintext highlighter-rouge">compare_exchange_strong</code>, and it more obviously wouldn’t apply if we allowed decrements or wanted to change the “probe” strategy. <a href="#fnref:subtle" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lesshot" role="doc-endnote">
<p>I’m not sure if “less hot” means <code class="language-plaintext highlighter-rouge">__attribute__((cold))</code> necessarily, that might be <em>too</em> cold. We mostly just want to separate the first-cas-succeeds case and the rest of the logic so we don’t pay the dynamic code size impact except when the fallback path is taken. <a href="#fnref:lesshot" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:dynamictls" role="doc-endnote">
<p>A sketch of an implementation would be to use something like a single static <code class="language-plaintext highlighter-rouge">thread_local</code> pointer to an array or map, which maps an ID contained in the dynamic TLS key to the object data. Lookup speed is important, which favors an array, but you also need to be able to remove elements, which can favor some type of hash map. All of this is probably at least twice as slow as a plain <code class="language-plaintext highlighter-rouge">thread_local</code> access … or just use <a href="https://github.com/facebook/folly/blob/master/folly/docs/ThreadLocal.md">folly</a> or <a href="https://www.boost.org/doc/libs/1_73_0/doc/html/thread/thread_local_storage.html">boost</a>. <a href="#fnref:dynamictls" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:noatomic" role="doc-endnote">
<p>On x86, what’s vanilla and what isn’t is fairly cut and dried: regular memory accesses and read-modify-write instructions are <em>vanilla</em> while LOCKed <abbr title="Read-modify-write: an instruction that reads from a memory location, operates on the value, and writes the result back to the same location.">RMW</abbr> instructions, whether explicit like <code class="language-plaintext highlighter-rouge">lock inc</code> or implicit like <a href="https://www.felixcloutier.com/x86/xchg"><code class="language-plaintext highlighter-rouge">xchg [mem], reg</code></a>, are not and are an order of magnitude slower. Out of the fences, <code class="language-plaintext highlighter-rouge">mfence</code> is also a slow non-vanilla instruction, comparable in cost to a LOCKed instruction. On other platforms like ARM or POWER, there may be shades of grey: you still have vanilla accesses on one end, and expensive full barriers like <code class="language-plaintext highlighter-rouge">dmb</code> on ARM or <code class="language-plaintext highlighter-rouge">sync</code> on POWER at the other, but you also have things in the middle with some additional ordering guarantees but still short of sequential consistency. This includes things like <code class="language-plaintext highlighter-rouge">LDAR</code> and <code class="language-plaintext highlighter-rouge">LDAPR</code> on ARM which implement sort of a sliding scale of load ordering and performance. Still, on any given hardware, you might find that instructions generally fall into a “cheap” (vanilla) and “expensive” (non-vanilla) bucket. <a href="#fnref:noatomic" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whyatomic" role="doc-endnote">
<p>The only reason we even need <code class="language-plaintext highlighter-rouge">std::atomic<uint64_t></code> at all is because it is <em>undefined behavior</em> to have concurrent access to any variable if at least one access is a write. Since the owning thread is making writes, this would <em>technically</em> be a violation of the standard if there was a concurrent <code class="language-plaintext highlighter-rouge">tls_counter::read()</code> call. Most actual hardware has no problem with concurrent reads and writes like this, but it’s better to stay on the right side of the law. Some hardware could also exhibit <em>tearing</em> of the writes, and <code class="language-plaintext highlighter-rouge">std::atomic</code> guarantees this doesn’t happen. That is, the read and write are still <em>individually</em> atomic. <a href="#fnref:whyatomic" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:barrier" role="doc-endnote">
<p>If you use the default <code class="language-plaintext highlighter-rouge">std::memory_order_seq_cst</code>, on x86 gcc inserts an <code class="language-plaintext highlighter-rouge">mfence</code> which makes this <em>even slower than an atomic increment</em> since <code class="language-plaintext highlighter-rouge">mfence</code> is generally slower than instructions with a lock prefix (it has slightly stronger barrier semantics). <a href="#fnref:barrier" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whitelie" role="doc-endnote">
<p>This is a very small white lie. I’ll explain more elsewhere. <a href="#fnref:whitelie" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tlsmem" role="doc-endnote">
<p>On the other hand, the TLS approach doesn’t need padding since the counters will generally appear next to other thread-local data, and not subject to false sharing, which means an 8x reduction (from 64 to 8 bytes) in the per-counter size, so if your process has a number of threads roughly equal to the number of cores, you will probably <em>save</em> memory over the per-CPU approach. <a href="#fnref:tlsmem" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tricky" role="doc-endnote">
<p>It’s easy to introduce a missed wakeup problem if this isn’t done correctly. The usual cause is a race condition between some waiter arriving at a lock-like thing, seeing that it’s locked and then indicating interest, but in the critical region of that check-then-act the owning thread left the lock and didn’t see any waiters. The waiter blocks but there is nobody to unblock them. These bugs often go undetected since the situation resolves itself as soon as another thread arrives, so in a busy system you might not notice the temporarily hung threads. The <code class="language-plaintext highlighter-rouge">futex</code> system call is basically designed to make solving this easy, while the Event stuff in Windows requires a bit more work (usually based on a compare-and-swap). <a href="#fnref:tricky" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spin" role="doc-endnote">
<p>An “appropriate” time is probably something like the typical runtime of the locked region. Basically you want to spin in any case where the lock is held by a currently running thread, which will release it soon. As soon as you’ve been spinning for more than the typical hold time of the lock, it becomes much more likely you are simply waiting for a lock held by a thread that is <em>not</em> running (e.g., it was unlucky enough to incur a context switch while it held the lock). In that case, you are better off sleeping. <a href="#fnref:spin" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hiddenbatch" role="doc-endnote">
<p>An interesting design point is a data type that implements batching internally behind an API offering single-element operations. For example, a queue might decide that added elements won’t be immediately consumed (because there are already some elements in the queue), and hold them in a local staging area until several can be added as a batch, or until their absence would be noticed. <a href="#fnref:hiddenbatch" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:despite" role="doc-endnote">
<p>Despite the current claim in wikipedia that seqlocks are somehow a Linux-specific construct involving the kernel, they work great in userspace only and are not tied to Linux. It is likely they were not invented for use in Linux but <a href="https://twitter.com/davidtgoldblatt/status/1280189008803278848">pre-dated</a> the OS, although maybe the use in Linux was where the name <em>seqlock</em> first appeared? <a href="#fnref:despite" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:stampedlock" role="doc-endnote">
<p>Java does provide <a href="https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/locks/StampedLock.html">StampedLock</a> which offers seqlock functionality. <a href="#fnref:stampedlock" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:soa" role="doc-endnote">
<p>Of course, we could go even <em>one step further</em> and embed a small array of 1 or 2 elements in the counter object, in the hope that this is enough and only use a dynamically allocated array and suffer the additional indirection if we observe contention. <a href="#fnref:soa" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comConcurrent operations can be grouped relatively neatly into categories based on their costAVX-512 Mask Registers, Again2020-05-26T00:00:00+00:002020-05-26T00:00:00+00:00https://travisdowns.github.io/blog/2020/05/26/kreg2<!-- boilerplate
page.assets: /assets/kreg2
assetpath: /assets/kreg2
tablepath: /misc/tables/kreg2
-->
<h2 id="exposition">Exposition</h2>
<p><a href="/blog/2019/12/05/kreg-facts.html">Not that long ago</a> we looked at the AVX-512 mask registers. Specifically, the number of physical registers underlying the eight architectural ones, and some other behaviors such as zeroing idioms. Recently, a high resolution die shot of <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> appeared, and I thought it would be cool to verify our register count by visual inspection.</p>
<p>After all, rather than writing some complex software to test hardware, why not <em>simply</em> use a series of noxious chemicals and manual labor to painstakingly expose the CPU, then carefully photograph it with a microscope and stitch the photos together and finally, <em>just use our eyes</em> to count the register? If that doesn’t sound all that easy, you are not alone, but as luck would have it someone else has already done that part.</p>
<p>While trying to simply count the mask registers, I ran across something else even more interesting<sup id="fnref:bar" role="doc-noteref"><a href="#fn:bar" class="footnote" rel="footnote">1</a></sup> instead…</p>
<ul id="markdown-toc">
<li><a href="#exposition" id="markdown-toc-exposition">Exposition</a></li>
<li><a href="#rising-action" id="markdown-toc-rising-action">Rising Action</a> <ul>
<li><a href="#the-die-shot" id="markdown-toc-the-die-shot">The Die Shot</a></li>
<li><a href="#the-register-files" id="markdown-toc-the-register-files">The Register Files</a></li>
<li><a href="#the-mystery-block" id="markdown-toc-the-mystery-block">The Mystery Block</a></li>
<li><a href="#lets-get-legacy" id="markdown-toc-lets-get-legacy">Let’s Get Legacy</a></li>
<li><a href="#testing-our-theory" id="markdown-toc-testing-our-theory">Testing Our Theory</a></li>
</ul>
</li>
<li><a href="#some-missing-pieces" id="markdown-toc-some-missing-pieces">Some Missing Pieces</a></li>
<li><a href="#thanks" id="markdown-toc-thanks">Thanks</a></li>
<li><a href="#discussion-and-feedback" id="markdown-toc-discussion-and-feedback">Discussion and Feedback</a></li>
</ul>
<h2 id="rising-action">Rising Action</h2>
<h3 id="the-die-shot">The Die Shot</h3>
<p>We’re interested in this die shot, recently released by <a href="https://twitter.com/FritzchensFritz">Fritzchens Fritz</a> on <a href="https://www.flickr.com/photos/130561288@N04/49825363402/in/photostream/">Flickr</a>. We’ll be focusing on the highlighted area, which seems to have all the <a href="https://en.wikipedia.org/wiki/Register_file"><em>register files</em></a> on the chip. If you want a full breakdown of the core, you can check guesses <a href="https://twitter.com/GPUsAreMagic/status/1256866465577394181">here on Twitter</a>, <a href="https://www.realworldtech.com/forum/?threadid=191663&curpostid=191916">on RWT</a> and on <a href="https://community.intel.com/t5/Software-Tuning-Performance/Diagram-for-Skylake-SP-core/m-p/1166819">Intel’s forums</a>.</p>
<p><img src="/assets/kreg2/skx-full-small.jpg" alt="SKL full" class="no-invert" /></p>
<h3 id="the-register-files">The Register Files</h3>
<p>Here’s a close-up of that section, with the assumed register files and their purpose labeled<sup id="fnref:xmmetc" role="doc-noteref"><a href="#fn:xmmetc" class="footnote" rel="footnote">2</a></sup>.</p>
<p><img src="/assets/kreg2/zoomed.png" alt="SKL zoomed" class="no-invert" /></p>
<p>We guess the register file identities based on:</p>
<ul>
<li>The general purpose register files are of the right relative width (64 bits), and are in the right position below the integer execution units, and seem to have <code class="language-plaintext highlighter-rouge">EFLAGS</code> nearby.</li>
<li>The <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers are obvious from their size and positioning underneath the vector pipes.</li>
<li>The upper 256 bits of the 512-bit <code class="language-plaintext highlighter-rouge">zmm</code> registers (labelled ZMM on the closeup) can be determined from comparing the <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr><sup id="fnref:kbl" role="doc-noteref"><a href="#fn:kbl" class="footnote" rel="footnote">3</a></sup> (no AVX-512) and <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> (has AVX-512) dies and noting that the bottom file is not present in <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr> (a large empty area is present at that spot in <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr>).</li>
</ul>
<h3 id="the-mystery-block">The Mystery Block</h3>
<p>This leaves the mystery block in red. This block is in a prime spot below the vector execution units. Could it be the mask registers (kregs)? We found in the <a href="/blog/2019/12/05/kreg-facts.html">first post</a> that the mask aren’t shared with either the scalar or <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers, so we expect them to have their own physical register file. Maybe this is it?</p>
<p>Let’s compare the mystery register file to the integer register file, since they should be similar in size and appear to be implemented similarly:</p>
<p><img src="/assets/kreg2/compare.png" alt="SKL zoomed" /></p>
<p>Looking at the general purpose register file on the left, each block (6 of which are numbered on the general purpose file) seems to implement 16 bits, as if you zoom in you see a repeating structure of 16 elements, and 4 blocks makes 64 bits total which is the expected width of the file. We know from published numbers that the integer register file has 180 entries, and since there are 6 rows of 4 blocks, we expect each row to implement 180 / 6 = 30 registers.</p>
<p>Now we turn our attention to the mystery file, with the idea that it may be the mask register file. There are a total of 30 blocks. Looking at the general purpose registers, we determined each block can hold 16 bits (horizontally) from 30 registers (vertically, I guess). So 30 blocks gives us: 30 blocks * 30 registers/block * 16 bits / 64 bits = 225 registers. It’s too much! We calculated last time that there are ~142 physical mask registers, so this is way too high.</p>
<p>There’s another problem: we only have three columns of 16-bit blocks, for a total of 48 bits, horizontally. However, we know that a mask register must hold up to 64 bits (when using a byte-wise mask for a full 512-bit vector register). Also, while our calculation above worked out to a whole number, the number of blocks (30) is not divisible by 4, so even if you assumed the arrangement of the blocks didn’t matter, there is no possible mapping from each register to 4 distinct blocks. Instead, we’d need something weird like 2 blocks providing 15 registers (instead of 30), but 64 bits wide (instead of 32). That seems very unlikely.</p>
<p>So let’s look just at the two paired columns on the left for now: a total of 20 blocks. If we take the <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers as an example, it is not necessary that the full width of the register is present horizontally in a single row: the <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers have only 256 bits in a row (split into two 128-bit lines), and then other other 256 bits in a 512-bit zmm register appear vertically below, in the register file marked ZMM in the diagram. So there’s a kind of over-under arrangement<sup id="fnref:overunder" role="doc-noteref"><a href="#fn:overunder" class="footnote" rel="footnote">4</a></sup>.</p>
<p>Since the mask registers are associated with elements of the vector registers, maybe they are split up in the same way? That is, a 64-bit mask register uses one 2x16-bit (32-bit) chunk from the top half and one from the bottom half, to make up 64 bits? This is 20 total blocks, giving 150 registers by the same calculation above. This is much closer to the 142 we found by experiment.</p>
<p>Still… that nagging feeling. 142 is not equal to 150, and what about that third column of blocks? That doubt crept in: I had second thoughts that this was the mask register file after all. What could it be then?</p>
<h3 id="lets-get-legacy">Let’s Get Legacy</h3>
<p>I realized there was one register file unaccounted for: the file for legacy x87 floating point and MMX. We expect that x87 floating point and MMX use the the <em>same</em> physical file because MMX registers are architecturally aliased onto the x87 registers<sup id="fnref:why" role="doc-noteref"><a href="#fn:why" class="footnote" rel="footnote">5</a></sup>. So where is <em>this</em> file on the die? I looked all around<sup id="fnref:lied" role="doc-noteref"><a href="#fn:lied" class="footnote" rel="footnote">6</a></sup> the die shot. There are no good candidates.</p>
<p>So maybe <em>this</em> thing we’ve been looking at is actually the x87/MMX register file? In one way it’s a better fit: the x87 FP register file needs to be ~80 bits wide, so that would explain the extra column: if we assume each row is half of a register as before, that’s 96 bits. That’s enough to hold 80 bit extended precision values, and the 16 bits left over are probably could be used to store the FPU status word accessed by <a href="https://www.felixcloutier.com/x86/fstsw:fnstsw">ftstw</a> and related instructions. This status word is updated after every operation so must <em>also</em> be renamed for reasonable performance<sup id="fnref:intflags" role="doc-noteref"><a href="#fn:intflags" class="footnote" rel="footnote">7</a></sup>.</p>
<p>Additional evidence that this might be the x87/MMX register file comes from this <a href="https://flic.kr/p/YhuBWc"><abbr title="Intel's Kaby Lake client CPU architecture (7th, 8th gen): substantially identical to Skylake">KBL</abbr> die shot</a> also from Fritz:</p>
<p><img src="/assets/kreg2/kbl-compare.png" alt="Kaby Lake" class="no-invert" /></p>
<p>Note that while the high 256 bits of the register file are masked out (this chip supports only AVX2, not AVX-512 so there are no <code class="language-plaintext highlighter-rouge">zmm</code> registers), the register file we are considering is present in its entirety.</p>
<p>Cool theory bro, but aren’t we back to square zero? If this is the file for the x87/MMX registers, where do the mask registers live?</p>
<p>There’s one possibility we haven’t discussed although some of you might be screaming it at your monitors by now: maybe the x87/MMX and the kreg register files are <em>shared</em>. That is, physically aliased<sup id="fnref:aliasing" role="doc-noteref"><a href="#fn:aliasing" class="footnote" rel="footnote">8</a></sup> to the same register file, shared competitively.</p>
<h3 id="testing-our-theory">Testing Our Theory</h3>
<p>The good news? We can test for this, in software. That’s good, because I was never really <em>that</em> comfortable with this die shot thing and there was the risk that I would BS more than usual. Software-based <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr> probing is a bit more my thing.</p>
<p>We’ll use the test method originally <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">described by Henry Wong</a> and which we used in the <a href="/blog/2019/12/05/kreg-facts.html">last post</a> on this topic, and implemented in the <a href="https://github.com/travisdowns/robsize">robsize</a> tool. Here’s the quick description of the technique, a straight copy/paste from that post:</p>
<blockquote>
<p>We separate two cache miss load instructions by a variable number of <em>filler instructions</em> which vary based on the CPU resource we are > probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the > total execution time is roughly as long as a single miss.</p>
<p>However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation > stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the > baseline cache miss latency.</p>
<p>Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> instructions usually consume one physical register from the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> so are a good choice to measure the size of that resource.</p>
</blockquote>
<p>The trick we use to see if two register files are shared is first to use a test the size of each register file alone, using a test that uses filler that targets only that register file, then to run a third test whose filler <em>alternates</em> between instructions that use each register file. If the register files are shared, we expect all tests to produce the same results, since they are all drawing from the same pool. If the register files are not shared, the third (alternating) test should result in a much higher apparent resource limit, since two different pools are being drawn from and so it will take twice as many<sup id="fnref:roblimit" role="doc-noteref"><a href="#fn:roblimit" class="footnote" rel="footnote">9</a></sup> filler instructions to hit the RF limit.</p>
<p>Enough talk, let’s do this. I implemented several new tests in robsize to probe possible register sharing. First, we look at <strong>Test 38</strong>, which uses MMX instructions<sup id="fnref:whymmx" role="doc-noteref"><a href="#fn:whymmx" class="footnote" rel="footnote">10</a></sup> to target the size of the x87/MMX register file:</p>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#skx-38" id="skx-38">[link<span class="only-large"> to this chart</span>]</a>
<a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results2/skx-38.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<img class="figimg" src="/assets/kreg2/skx-38.svg" alt="Test 38" width="648" height="432" />
</div>
<p>We see a clear spike at 128 instructions, so it seems like the size of the speculative<sup id="fnref:spec" role="doc-noteref"><a href="#fn:spec" class="footnote" rel="footnote">11</a></sup> x87/MMX register file is 128 entries.</p>
<p>Next, we have <strong>Test 43</strong> which follows the same pattern as <strong>Test 38</strong> but using <code class="language-plaintext highlighter-rouge">kaddd</code> as a filler instruction so targets the mask (kreg) register file:</p>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#skx-43" id="skx-43">[link<span class="only-large"> to this chart</span>]</a>
<a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results2/skx-43.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<img class="figimg" src="/assets/kreg2/skx-43.svg" alt="Test 43" width="648" height="432" />
</div>
<p>This is mostly indistinguishable from the previous chart and we conclude that the size of the speculative mask register file is also 128.</p>
<p>Let’s see what happens when alternate MMX and another instruction type. <strong>Test 39</strong> alternates MMX with integer <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions, and <strong>Test 40</strong> alternatives with general purpose scalar instructions:</p>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#skx-39" id="skx-39">[link<span class="only-large"> to this chart</span>]</a>
<a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results2/skx-39.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<img class="figimg" src="/assets/kreg2/skx-39.svg" alt="Test 39" width="648" height="432" />
</div>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#skx-40" id="skx-40">[link<span class="only-large"> to this chart</span>]</a>
<a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results2/skx-40.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<img class="figimg" src="/assets/kreg2/skx-40.svg" alt="Test 40" width="648" height="432" />
</div>
<p>Both of these show the same effect: the effective resource limitation is much higher: around 210 filler instructions. This indicates strongly that the x87/MMX register is not shared with either the <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> or scalar register files.</p>
<p>Finally, we get to the end of this tale, <strong>Test 41</strong>. This test mixes MMX and mask register instructions<sup id="fnref:test41" role="doc-noteref"><a href="#fn:test41" class="footnote" rel="footnote">12</a></sup>:</p>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#skx-41" id="skx-41">[link<span class="only-large"> to this chart</span>]</a>
<a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results2/skx-41.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<img class="figimg" src="/assets/kreg2/skx-41.svg" alt="Test 41" width="648" height="432" />
</div>
<p>This one is definitely not like the others. We see that the resource limit is now 128, same as for the single-type tests. We can immediately conclude that this means that mask registers and MMX registers are allocated from the same resource pool: <em>they use the same physical register file</em>.</p>
<p>This resolves the mystery of the missing register file: nothing is missing but rather this one register file simply serves double duty.</p>
<p>Normally a shared register file might be something to watch out for, performance-wise, but it is hard to imagine this having an impact in any non-artificial example. Who is going to be making heavy use of x87 or MMX (both obsolete) along with AVX-512 mask registers (the opposite end of the spectrum from “obsolete”)? It seems extremely unlikely. In any case, the register file is still quite large so hitting the limit is unlikely in any case.</p>
<p>So sharing these register files is a neat trick to reduce power and area: the register files aren’t all that big, but they live in pretty prime real-estate close to the execution units.</p>
<p>What’s cool about this one though is that is the first time that I’ve <em>looked at a chip</em> (that this is even possible is remarkable to me) and come up with a theory about the hardware we can test and confirm with a targeted microbenchmark. Here, it actually happened that way. I was already aware of the possibility of register file sharing (Henry had tests for this right in robsize from the start) – but although I considered other sharing scenarios I never considered sharing between x87/MMX and the mask registers until I tried to identify the register files on Franz’s die shots.</p>
<h2 id="some-missing-pieces">Some Missing Pieces</h2>
<p>It seems like we’ve wrapped everything up nicely, but there are still a few rough edges.</p>
<ul>
<li>We calculated a total of 128 speculative registers, plus 16 non-speculative (to hold the 8 x87/MMX regs and the 8 kregs) is 144, but our ballpark estimate based on the regfile size was 150. Perhaps more importantly, with 5 rows of registers, we expect the number of registers to be a multiple of 5. Perhaps there are a handful of registers used from an unknown purpose or some other flaw in the test.</li>
<li>I noticed an unexplained different in results between a test that uses a single instruction like <code class="language-plaintext highlighter-rouge">kaddd k1, k2, k3</code> (test 27) and one that rotates through all the 8 registers: <code class="language-plaintext highlighter-rouge">kaddd k0, k1, k1</code> then <code class="language-plaintext highlighter-rouge">kaddd k1, k2, k2</code>, etc (test 43). The former test results in an register file size of about 5 more than the latter. Similarly for tests using the MMX registers (compare tests 37 and 38). This post uses the rotate through all registers approach, while the original post used the fixed register variant in some cases so the number vary slightly. I have some theories but no definite explanation for this behavior.</li>
</ul>
<h2 id="thanks">Thanks</h2>
<p>Thanks to <a href="https://www.flickr.com/people/130561288@N04/">Fritzchens Fritz</a> who created the die shots analyzed here, and who graciously put them into the public domain.</p>
<p><a href="http://www.stuffedcow.net/">Henry Wong</a> who wrote the <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">original article</a> which introduced me to this technique and subsequently shared the code for his tool, which is now <a href="https://github.com/travisdowns/robsize">hosted on github</a>.</p>
<p><a href="https://twitter.com/GPUsAreMagic">Nemez</a> who did a <a href="https://twitter.com/GPUsAreMagic/status/1256866465577394181">breakdown</a> of the die shot, noting the register file in question as some type of integer register file, which originally piqued my curiosity.</p>
<p>Thanks to <a href="https://lemire.me">Daniel Lemire</a> who provided access to the <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> hardware used in this post.</p>
<p>Thanks to Matt Godbolt and Vijay who pointed out typos in the text.</p>
<h2 id="discussion-and-feedback">Discussion and Feedback</h2>
<p>If you have something to say, leave a comment below or discuss this article on <a href="https://news.ycombinator.com/item?id=23309034">Hacker News</a>.</p>
<p>Feedback is also warmly welcomed by <a href="mailto:travis.downs@gmail.com">email</a> or as <a href="https://github.com/travisdowns/travisdowns.github.io/issues">a GitHub issue</a>.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:bar" role="doc-endnote">
<p>Admittedly, “how many physical mask registers does the CPU have” is probably not a very high bar of interestingness to clear, to most people. <a href="#fnref:bar" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:xmmetc" role="doc-endnote">
<p>The zmm, ymm and xmm registers all overlap, architecturally. That is, <code class="language-plaintext highlighter-rouge">xmm0</code> is just the bottom 128 bits of <code class="language-plaintext highlighter-rouge">ymm0</code>, and similarly for <code class="language-plaintext highlighter-rouge">ymm0</code> and <code class="language-plaintext highlighter-rouge">zmm0</code>. Physically, there are really <em>only</em> <code class="language-plaintext highlighter-rouge">zmm</code> registers and the other two are simply specific ranges of bits of those larger register. So the area marked <strong>YMM</strong> on the die shot really means: <em>the upper parts of the <code class="language-plaintext highlighter-rouge">ymm</code> registers which are not part of the corresponding xmm register</em>. <a href="#fnref:xmmetc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:kbl" role="doc-endnote">
<p>Actually Kaby Lake, since the best die shots we have are from that chip, but it’s the same thing. <a href="#fnref:kbl" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:overunder" role="doc-endnote">
<p>Incidentally, this lines up with an inspection of the execution units, which seem to have the same over-under arrangemnet: the port 5 FMA for example, looks like it has has two rows each with 4x 64-bit FMA units, rather than say a single row with 8 units. <a href="#fnref:overunder" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:why" role="doc-endnote">
<p>As a trick, I guess, to allow MMX registers to be saved and restored by operating systems and other code that weren’t aware of their presence. A similar mess occurred with the transition from SSE to AVX, where code unaware of AVX could accidentally clobber the upper part of AVX registers using SSE instructions (if SSE zeroed the upper bits), so instead we get the ongoing issue with <a href="https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake">legacy SSE and dirty uppers</a>. <a href="#fnref:why" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lied" role="doc-endnote">
<p>This is a lie: I didn’t really look around <em>all around</em> the die: I looked near by the execution units were the register file would be with very high probability. <a href="#fnref:lied" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:intflags" role="doc-endnote">
<p>The integer flags (so-called <em>EFLAGS</em> register) also need to be renamed and I believe they pull a similar trick: writing their results to the same physical register allocated for the result: I’ve marked the file that I think holds the so-called <em>SPAZO</em> group on the zoomed view, and the C flag may be stored in the same place or in the thin (single bit?) file immediately to the right of the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> file. <a href="#fnref:intflags" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:aliasing" role="doc-endnote">
<p>I talk of <em>physical</em> aliasing here, to distinguish it from the logical/architectural aliasing. Logical aliasing is that which is visible to software: the <code class="language-plaintext highlighter-rouge">ymm</code> and <code class="language-plaintext highlighter-rouge">xmm</code> registers are logically aliased in that writes to <code class="language-plaintext highlighter-rouge">xmm0</code> show up in the low bits of <code class="language-plaintext highlighter-rouge">ymm0</code>. Similarly, the MMX and x87 register files are aliased in that writes to MMX register modify values in the FP register stack, although the rules are more complicated. Logical aliasing usually implies physical aliasing, but not the other way around. Physical aliasing, then, means that two register sets are renamed onto the same pool of physical registers, but this is usually invisible to software (except though careful performance tests, as we do here). <a href="#fnref:aliasing" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:roblimit" role="doc-endnote">
<p>In practice, you don’t actually get all the way to 2x: you hit something close to the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> limit instead first. <a href="#fnref:roblimit" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whymmx" role="doc-endnote">
<p>I use MMX rather than x87 so I don’t have to deal with the x87 FP stack abstraction and understand how that maps to renaming. <a href="#fnref:whymmx" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spec" role="doc-endnote">
<p>The <em>speculative</em> register file because we expect some entries also to be used to hold the non-speculative values of the architectural registers. We’ll return to this point in a moment. <a href="#fnref:spec" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:test41" role="doc-endnote">
<p>Specifically, it mixes the same <code class="language-plaintext highlighter-rouge">kaddd</code> and <code class="language-plaintext highlighter-rouge">por</code> instructions we used in the single-type tests <strong>Test 38</strong> and <strong>Test 43</strong>. <a href="#fnref:test41" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comTaking a second look at the newly introduced mask registers, this time with the benefit of a SKX die shot from Fritzchens Fritz.Ice Lake Store Elimination2020-05-18T00:00:00+00:002020-05-18T00:00:00+00:00https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt<!-- boilerplate
page.assets: /assets/intel-zero-opt
assetpath: /assets/intel-zero-opt
tablepath: /misc/tables//intel-zero-opt
-->
<h2 id="introduction">Introduction</h2>
<p>If you made it down to the <a href="/blog/2020/05/13/intel-zero-opt.html#hardware-survey">hardware survey</a> on the last post, you might have <a href="https://twitter.com/tarlinian/status/1260629853000265728">wondered</a> where Intel’s newest mainstream architecture was. <em>Ice Lake was missing!</em></p>
<p>Well good news: it’s here… and it’s interesting. We’ll jump right into the same analysis we did last time for Skylake client. If you haven’t read the <a href="/blog/2020/05/13/intel-zero-opt.html">first article</a> you’ll probably want to start there, because we’ll refer to concepts introduced there without reexplaining them here.</p>
<p>As usual, you can skip to the <a href="#summary">summary</a> for the bite sized version of the findings.</p>
<h2 id="icl-results">ICL Results</h2>
<h3 id="the-compiler-has-an-opinion">The Compiler Has an Opinion</h3>
<p>Let’s first take a look at the overall performance: facing off <code class="language-plaintext highlighter-rouge">fill0</code> vs <code class="language-plaintext highlighter-rouge">fill1</code> as we’ve been doing for every microarchitecture. Remember, <code class="language-plaintext highlighter-rouge">fill0</code> fills a region with zeros, while <code class="language-plaintext highlighter-rouge">fill1</code> fills a region with the value one (as a 4-byte <code class="language-plaintext highlighter-rouge">int</code>).</p>
<p class="warning">All of these tests run at 3.5 GHz. The max single-core turbo for this chip is at 3.7 GHz, but is difficult to run in a sustained manner at this frequency, because of AVX-512 clocking effects and because other cores occasionally activate. 3.5 GHz is a good compromise that keeps the chip running at the same frequency, while remaining close to the ideal turbo. Disabling turbo is not a good option, because this chip runs at 1.1 GHz without turbo, which would introduce a large distortion when exercising the uncore and RAM.</p>
<center><strong>Figure 7a</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig7a" id="fig7a">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig7a.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl512/overall-warm.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig7a.html">
<img class="figimg" src="/assets/intel-zero-opt/fig7a.svg" alt="Figure 7a" width="648" height="432" />
</a>
</div>
<p>Actually, I lied. <em>This</em> is the right plot for Ice Lake:</p>
<center><strong>Figure 7b</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig7b" id="fig7b">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig7b.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/overall-warm.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig7b.html">
<img class="figimg" src="/assets/intel-zero-opt/fig7b.svg" alt="Figure 7b" width="648" height="432" />
</a>
</div>
<p>Well, which is it?</p>
<p>Those two have a couple of key differences. The first is this weird thing that <strong>Figure 7a</strong> has going on in the right half of the L1 region: there are two obvious and distinct performance levels visible, each with roughly half the samples.</p>
<!-- https://stackoverflow.com/questions/43806515/position-svg-elements-over-an-image -->
<style>
.img-overlay-wrap {
position: relative;
display: block; /* <= shrinks container to image size */
}
.img-overlay-wrap svg {
position: absolute;
top: 0;
left: 0;
}
</style>
<div class="img-overlay-wrap">
<img class="figimg" src="/assets/intel-zero-opt/fig7a.svg" alt="Figure 7a Annotated" />
<svg viewBox="0 0 90 60">
<g stroke-width=".5" fill="none" opacity="0.5">
<ellipse transform="rotate(-25 34 10)" cx="34" cy="10" rx="12" ry="5" stroke="green" />
<ellipse transform="rotate(-25 34 34.5)" cx="34" cy="34.5" rx="12" ry="5" stroke="red" />
</g>
</svg>
</div>
<p>The second thing is that while both of the plots show <em>some</em> of the zero optimization effect in the L3 and RAM regions, the effect is <em>much larger</em> in <strong>Figure 7b</strong>:</p>
<div class="img-overlay-wrap">
<img class="figimg" src="/assets/intel-zero-opt/fig7b.svg" alt="Figure 7a Annotated" />
<svg viewBox="0 0 90 60">
<g stroke-width=".5" fill="none" opacity="0.5">
<ellipse cx="63" cy="40" rx="10" ry="8" stroke="blue" />
</g>
</svg>
</div>
<p>So what’s the difference between these two plots? The top one was compiled with <code class="language-plaintext highlighter-rouge">-march=native</code>, the second with <code class="language-plaintext highlighter-rouge">-march=icelake-client</code>.</p>
<p>Since I’m compiling this <em>on</em> the Ice Lake client system, I would expect these to do the same thing, but for <a href="https://twitter.com/stdlib/status/1261038662751522826">some reason they don’t</a>. The primary difference is that <code class="language-plaintext highlighter-rouge">-march=native</code> <a href="https://godbolt.org/z/gm3vRa">generates</a> 512-bit instructions like so (for the main loop):</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.L4:</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">add</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">512</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">448</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">384</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">320</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">256</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">192</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">128</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">-</span><span class="mi">64</span><span class="p">],</span> <span class="nv">zmm0</span>
<span class="nf">cmp</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">r9</span>
<span class="nf">jne</span> <span class="nv">.L4</span>
</code></pre></div></div>
<p>Using <code class="language-plaintext highlighter-rouge">-march=icelake-client</code> uses 256-bit instructions<sup id="fnref:still512" role="doc-noteref"><a href="#fn:still512" class="footnote" rel="footnote">1</a></sup>:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.L4:</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">32</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">64</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">96</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">128</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">160</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">192</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">vmovdqu32</span> <span class="p">[</span><span class="nb">rax</span><span class="o">+</span><span class="mi">224</span><span class="p">],</span> <span class="nv">ymm0</span>
<span class="nf">add</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">256</span>
<span class="nf">cmp</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">r9</span>
<span class="nf">jne</span> <span class="nv">.L4</span>
</code></pre></div></div>
<p>Most compilers use 256-bit instructions by default even for targets that support AVX-512 (reason: <a href="https://reviews.llvm.org/D67259">downclocking</a>, so the <code class="language-plaintext highlighter-rouge">-march=native</code> version is the weird one here. All of the earlier x86 tests used 256-bit instructions.</p>
<p>The observation that <strong>Figure 7a</strong> results from running 512-bit instructions, combined with a peek at the data lets us immediately resolve the mystery of the bi-modal behavior.</p>
<p>Here’s the raw data for the 17 samples at a buffer size of 9864:</p>
<div class="table-wrapper">
<table class="dataframe" style="max-width:500px; min-width: 50%; font-size:80%;">
<thead>
<tr>
<td colspan="2"></td>
<th colspan="2" halign="left">GB/s</th>
</tr>
<tr>
<th>Size</th>
<th>Trial</th>
<th>fill0</th>
<th>fill1</th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="17" valign="top">9864</th>
<th>0</th>
<td>92.3</td>
<td>92.0</td>
</tr>
<tr>
<th>1</th>
<td>91.9</td>
<td>91.9</td>
</tr>
<tr>
<th>2</th>
<td>91.9</td>
<td>91.9</td>
</tr>
<tr>
<th>3</th>
<td>92.4</td>
<td>92.2</td>
</tr>
<tr>
<th>4</th>
<td>92.0</td>
<td>92.3</td>
</tr>
<tr>
<th>5</th>
<td>92.1</td>
<td>92.1</td>
</tr>
<tr>
<th>6</th>
<td>92.0</td>
<td>92.0</td>
</tr>
<tr>
<th>7</th>
<td>92.3</td>
<td>92.1</td>
</tr>
<tr>
<th>8</th>
<td>92.2</td>
<td>92.0</td>
</tr>
<tr>
<th>9</th>
<td>92.0</td>
<td>92.1</td>
</tr>
<tr>
<th>10</th>
<td>183.3</td>
<td>93.9</td>
</tr>
<tr>
<th>11</th>
<td>197.3</td>
<td>196.9</td>
</tr>
<tr>
<th>12</th>
<td>197.3</td>
<td>196.6</td>
</tr>
<tr>
<th>13</th>
<td>196.6</td>
<td>197.3</td>
</tr>
<tr>
<th>14</th>
<td>197.3</td>
<td>196.6</td>
</tr>
<tr>
<th>15</th>
<td>196.6</td>
<td>197.3</td>
</tr>
<tr>
<th>16</th>
<td>196.6</td>
<td>196.6</td>
</tr>
</tbody>
</table>
</div>
<p>The performance follows a specific pattern with respect to the trials for both <code class="language-plaintext highlighter-rouge">fill0</code> and <code class="language-plaintext highlighter-rouge">fill1</code>: it starts out slow (about 90 GB/s) for the first 9-10 samples then suddenly jumps up the higher performance level (close to 200 GB/s). It turns out this is just <a href="/blog/2020/01/17/avxfreq1.html">voltage and frequency management</a> biting us again. In this case there is no frequency change: the <a href="https://github.com/travisdowns/zero-fill-bench/blob/post2/results/icl512/overall-warm.csv#L546">raw data</a> has a frequency column that shows the trials always run at 3.5 GHz. There is only a voltage change, and while the voltage is changing, the CPU runs with reduced dispatch throughput<sup id="fnref:iclbetter" role="doc-noteref"><a href="#fn:iclbetter" class="footnote" rel="footnote">2</a></sup>.</p>
<p>The reason this effect repeats for every new set of trials (new buffer size value) is that each new set of trials is preceded by a 100 ms spin wait: this spin wait doesn’t run any AVX-512 instructions, so the CPU drops back to the lower voltage level and this process repeats. The effect stops when the benchmark moves into the L2 region, because there it is slow enough that the 10 discarded warmup trials are enough to absorb the time to switch to the higher voltage level.</p>
<p>We can avoid this problem simply by removing the 100 ms warmup (passing <code class="language-plaintext highlighter-rouge">--warmup-ms=0</code> to the benchmark), and for the rest of this post we’ll discuss the no-warmup version (we keep the 10 warmup <em>trials</em> and they should be enough).</p>
<h2 id="elimination-in-ice-lake">Elimination in Ice Lake</h2>
<p>So we’re left with the second effect, which is that the 256-bit store version shows <em>very</em> effective elimination, as opposed to the 512-bit version. For now let’s stop picking favorites between 256 and 512 (push that on your stack, we’ll get back to it), and just focus on the elimination behavior for 256-bit stores.</p>
<p>Here’s the closeup of the L3 region for the 256-bit store version, showing also the L2 eviction type, as discussed in the previous post:</p>
<center><strong>Figure 8</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig8" id="fig8">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig8.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/l2-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig8.html">
<img class="figimg" src="/assets/intel-zero-opt/fig8.svg" alt="Figure 8" width="648" height="432" />
</a>
</div>
<p>We finally have the elusive (near) 100% elimination of redundant zero stores! The <code class="language-plaintext highlighter-rouge">fill0</code> case peaks at 96% silent (eliminated<sup id="fnref:stricly" role="doc-noteref"><a href="#fn:stricly" class="footnote" rel="footnote">3</a></sup>) evictions. Typical L3 bandwidth is ~59 GB/s with elimination and ~42 GB/s without, for a better than 40% speedup! So this is a potentially a big deal on Ice Lake.</p>
<p>Like last time, we can also check the uncore tracker performance counters, to see what happens for larger buffers which would normally write back to memory.</p>
<center><strong>Figure 9</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig9" id="fig9">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig9.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/l3-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig9.html">
<img class="figimg" src="/assets/intel-zero-opt/fig9.svg" alt="Figure 9" width="648" height="432" />
</a>
</div>
<p class="info"><strong>Note:</strong> the way to interpret the events in this plot is the reverse of the above: more uncore tracker writes means <em>less</em> elimination, while in the earlier chart more silent writebacks means <em>more</em> elimination (since every silent writeback replaces a non-silent one).</p>
<p>As with the L3 case, we see that the store elimination appears 96% effective: the number of uncore to memory writebacks flatlines at 4% for the <code class="language-plaintext highlighter-rouge">fill0</code> case. Compare this to <a href="/assets/intel-zero-opt/fig3.svg"><strong>Figure 3</strong></a>, which is the same benchmark running on Skylake-S, and note that only half the writes to RAM are eliminated.</p>
<p>This chart also includes results for the <code class="language-plaintext highlighter-rouge">alt01</code> benchmark. Recall that this benchmark writes 64 bytes of zeros alternating with 64 bytes of ones. This means that, at best, only half the lines can be eliminated by zero-over-zero elimination. On Skylake-S, only about 50% of eligible (zero) lines were eliminated, but here we again get to 96% elimination! That is, in the <code class="language-plaintext highlighter-rouge">alt01</code> case, 48% of all writes were eliminated, half of which are all-ones and not eligible.</p>
<p>The asymptotic speedup for the all zero case for the RAM region is less than the L3 region, at about 23% but that’s still not exactly something to sneeze at. The speedup for the alternating case is 10%, somewhat less than half the benefit of the all zero case<sup id="fnref:altwrites" role="doc-noteref"><a href="#fn:altwrites" class="footnote" rel="footnote">4</a></sup>. In the L3 region, we also note that the benefit of elimination for <code class="language-plaintext highlighter-rouge">alt01</code> is only about 7%, much smaller than the ~20% benefit you’d expect if you cut the 40% benefit the all-zeros case sees. We saw a similar effect in Skylake-S.</p>
<p>Finally it’s worth noting this little uptick in uncore writes in the <code class="language-plaintext highlighter-rouge">fill0</code> case:</p>
<p><img src="/assets/intel-zero-opt/little-uptick.png" alt="Little Uptick" /></p>
<p>This happens right around the transition from L3 to RAM, and this, the writes flatline down to 0.04 per line, but this uptick is fairly consistently reproducible. So there’s some interesting effect there, probably, perhaps related to the adaptive nature of the L3 caching<sup id="fnref:l3adapt" role="doc-noteref"><a href="#fn:l3adapt" class="footnote" rel="footnote">5</a></sup>.</p>
<h3 id="512-bit-stores">512-bit Stores</h3>
<p>If we rewind time, time to pop the mental stack and return to something we noticed earlier: that 256-bit stores seemed to get superior performance for the L3 region compared to 512-bit ones.</p>
<p>Remember that we ended up with 256-bit and 512-bit versions due to unexpected behavior in the <code class="language-plaintext highlighter-rouge">-march</code> flag. Rather they <em>relying</em> on this weirdness<sup id="fnref:gccfix" role="doc-noteref"><a href="#fn:gccfix" class="footnote" rel="footnote">6</a></sup>, let’s just write slighly lazy<sup id="fnref:lazy" role="doc-noteref"><a href="#fn:lazy" class="footnote" rel="footnote">7</a></sup> <a href="https://github.com/travisdowns/zero-fill-bench/blob/master/algos.cpp#L151">methods</a> that explicitly use 256-bit and 512-bit stores but are otherwise identical. <code class="language-plaintext highlighter-rouge">fill256_0</code> uses 256-bit stores and writes zeros, and I’ll let you pattern match the rest of the names.</p>
<p>Here’s how they perform on my ICL hardware:</p>
<center><strong>Figure 10</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig10" id="fig10">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig10.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/256-512.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig10.html">
<img class="figimg" src="/assets/intel-zero-opt/fig10.svg" alt="Figure 10" width="648" height="432" />
</a>
</div>
<p class="warning">This chart shows only the the median of 17 trials. You can look at the raw data for an idea of the trial variance, but it is generally low.</p>
<p>In the L1 region, the 512-bit approach usually wins and there is no apparent difference between writing 0 or 1 (the two halves of the moon mostly line up). Still, 256-bit stores are roughly <em>competitive</em> with 512-bit: they aren’t running at half the throughput. That’s thanks to the second store port on Ice Lake. Without that feature, you’d be limited to 112 GB/s at 3.5 GHz, but here we handily reach ~190 GB/s with 256-bit stores, and ~195 GB/s with 512-bit stores. 512-bit stores probably have a slight advantage just because of fewer total instructions executed (about half of the 256-bit case) and associated second order effects.</p>
<p class="info">Ice Lake has two <em>store ports</em> which lets it execute two stores per cycle, but only a single cache line can be written per cycle. However, if two consecutive stores fall into the <em>same</em> cache line, they will generally both be written in the same cycle. So the maximum sustained throughput is up to two stores per cycle, <em>if</em> they fall in the same line<sup id="fnref:l1port" role="doc-noteref"><a href="#fn:l1port" class="footnote" rel="footnote">8</a></sup>.</p>
<p>In the L2 region, however, the 256-bit approaches seem to pull ahead. This is a bit like the Buffalo Bills winning the Super Bowl: it just isn’t supposed to happen.</p>
<p>Let’s zoom in:</p>
<center><strong>Figure 11</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig11" id="fig11">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig11.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/256-512-l2-l3.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig11.html">
<img class="figimg" src="/assets/intel-zero-opt/fig11.svg" alt="Figure 11" width="648" height="432" />
</a>
</div>
<p>The 256-bit benchmarks start roughly tied with their 512 bit cousins, but then steadily pull away as the region approaches the full size of the L2. By the end of the L2 region, they have nearly a ~13% edge. This applies to <em>both</em> <code class="language-plaintext highlighter-rouge">fill256</code> versions – the zeros-writing and ones-writing flavors. So this effect doesn’t seem explicable by store elimination: we already know ones are not eliminated and, also, elimination only starts to play an obvious role when the region is L3-sized.</p>
<p>In the L3, the situation changes: now the 256-bit version really pulls ahead, <em>but only the version that writes zeros</em>. The 256-bit and 512-bit one-fill versions fall down in throughput, nearly to the same level (but the 256-bit version still seems <em>slightly but measurably ahead</em> at ~2% faster). The 256-bit zero fill version is now ahead by roughly 45%!</p>
<p>Let’s concentrate only on the two benchmarks that write zero: <code class="language-plaintext highlighter-rouge">fill256_0</code> and <code class="language-plaintext highlighter-rouge">fill512_0</code>, and turn on the L2 eviction counters (you probably saw that one coming by now):</p>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig12" id="fig12">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables//intel-zero-opt/fig12.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/icl/256-512-l2-l3.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables//intel-zero-opt/fig12.html">
<img class="figimg" src="/assets/intel-zero-opt/fig12.svg" alt="Figure 12" width="648" height="432" />
</a>
</div>
<p class="warning">Only the <em>L2 Lines Out Silent</em> event is shown – the balance of the evictions are <em>non-silent</em> as usual.</p>
<p>Despite the fact that I had to leave the right axis legend just kind floating around in the middle of the plot, I hope the story is clear: 256-bit stores get eliminated at the usual 96% rate, but 512-bit stores are hovering at a decidedly Skylake-like ~56%. I can’t be sure, but I expect this difference in store elimination largely explains the performance difference.</p>
<p>I checked also the behavior with prefetching off, but the pattern is very similar, except with both approaches having reduced performance in L3 (you can <a href="/assets/intel-zero-opt/fig12-nopf.svg">see for yourself</a>). It is interesting to note that for zero-over-zero stores, the 256-bit store performance <em>in L3</em> is almost the same as the 512-bit store performance <em>in L2!</em> It buys you almost a whole level in the cache hierarchy, performance-wise (in this benchmark).</p>
<p>Normally I’d take a shot at guessing what’s going on here, but this time I’m not going to do it. I just don’t know<sup id="fnref:lied" role="doc-noteref"><a href="#fn:lied" class="footnote" rel="footnote">9</a></sup>. The whole thing is very puzzling, because everything after the L1 operates on a cache-line basis: we expect the fine-grained pattern of stores made by the core, <em>within a line</em> to basically be invisible to the rest of the caching system which sees only full lines. Yet there is some large effect in the L3 and even in RAM<sup id="fnref:RAM" role="doc-noteref"><a href="#fn:RAM" class="footnote" rel="footnote">10</a></sup> related to whether the core is writing a cache line in two 256-bit chunks or a single 512-bit chunk.</p>
<h2 id="summary">Summary</h2>
<p>We have found that the store elimination optimization originally uncovered on Skylake client is still present in Ice Lake and is roughly twice as effective in our fill benchmarks. Elimination of 96% L2 writebacks (to L3) and L3 writebacks (to RAM) was observed, compared to 50% to 60% on Skylake. We found speedups of up to 45% in the L3 region and speedups of about 25% in RAM, compared to improvements of less than 20% in Skylake.</p>
<p>We find that when zero-filling writes occur to a region sized for the L2 cache or larger, 256-bit writes are often significantly <em>faster</em> than 512-bit writes. The effect is largest for the L2, where 256-bit zero-over-zero writes are up to <em>45% faster</em> than 512-bit writes. We find a similar effect even for non-zeroing writes, but only in the L2.</p>
<h2 id="future">Future</h2>
<p>It is an interesting open question whether the as-yet-unreleased <abbr title="The new 7nm microarchitecture used in Ice Lake CPUs.">Sunny Cove</abbr> server chips will exhibit this same optimization.</p>
<h2 id="advice">Advice</h2>
<p>Unless you are developing only for your own laptop, as of May 2020 Ice Lake is deployed on a microscopic fraction of total hosts you would care about, so the headline advice in the previous post applies: this optimization doesn’t apply to enough hardware for you to target it specifically. This might change in the future as Ice Lake and sequels roll out in force. In that case, the magnitude of the effect might make it worth optimizing for in some cases.</p>
<p>For fine-grained advice, see the <a href="/blog/2020/05/13/intel-zero-opt.html#tuning-advice">list in the previous post</a>.</p>
<h2 id="thanks">Thanks</h2>
<p>Vijay and Zach Wegner for pointing out typos.</p>
<p>Ice Lake photo by <a href="https://unsplash.com/@marcuslofvenberg">Marcus Löfvenberg</a> on Unsplash.</p>
<p>Saagar Jha for helping me track down and fix a WebKit rendering <a href="https://github.com/travisdowns/travisdowns.github.io/issues/102">issue</a>.</p>
<h2 id="discussion-and-feedback">Discussion and Feedback</h2>
<p>If you have something to say, leave a comment below. There are also discussions on <a href="https://twitter.com/trav_downs/status/1262428350511022081">Twitter</a> and <a href="https://news.ycombinator.com/item?id=23225260">Hacker News</a>.</p>
<p>Feedback is also warmly welcomed by <a href="mailto:travis.downs@gmail.com">email</a> or as <a href="https://github.com/travisdowns/travisdowns.github.io/issues">a GitHub issue</a>.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:still512" role="doc-endnote">
<p>It’s actually still using the EVEX-encoded AVX-512 instruction <code class="language-plaintext highlighter-rouge">vmovdqu32</code>, which is somewhat more efficient here because AVX-512 has more compact encoding of offsets that are a multiple of the vector size (as they usually are). <a href="#fnref:still512" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:iclbetter" role="doc-endnote">
<p>In this case, the throughput is only halved, versus the 1/4 throughput when we looked at dispatch throttling on <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>, so based on this very preliminary result it seems like the dispatch throttling might be less severe in Ice Lake (this needs a deeper look: we never used stores to test on <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>). <a href="#fnref:iclbetter" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:stricly" role="doc-endnote">
<p>Strictly speaking, a silent writeback is a <em>sufficient</em>, but not a <em>necessary</em> condition for elimination, so it is a lower bound on the number of eliminated stores. For all I know, 100% of stores are eliminated, but out of those 4% are written back not-silently (but not in a modified state). <a href="#fnref:stricly" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:altwrites" role="doc-endnote">
<p>One reason could be that writing only alternating lines is somewhat more expensive than writing half the data but contiguously. Of course this is obviously true closer to the core, since you touch half the number of the pages in the contiguous case, need half the number of page walks, prefetching is more effective since you cross half as many 4K boundaries (prefetch stops at 4K boundaries) and so on. Even at the memory interface, alternating line writes might be less efficient because you get less benefit from opening each DRAM page, can’t do longer than 64-byte bursts, etc. In a pathological case, alternating lines could be <em>half</em> the bandwidth if the controller maps alternating lines to alternating channels, since you’ll only be accessing a single channel. We could try to isolate this effect by trying more coarse grained interleaving. <a href="#fnref:altwrites" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:l3adapt" role="doc-endnote">
<p>The L3 is capable of determining if the current access pattern would be better served by something like an <abbr title="Most recently used - an eviction strategy suitable for data with little temporal locality">MRU</abbr> eviction strategy, for example when a stream of data is being accessed without reuse, it would be better to kick that data out of the cache quickly, rather than evicting other data that may be useful. <a href="#fnref:l3adapt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:gccfix" role="doc-endnote">
<p>After all, there’s a good chance it will be fixed in a later version of gcc. <a href="#fnref:gccfix" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lazy" role="doc-endnote">
<p>These are lazy in the sense that I don’t do any scalar head or tail handling: the final iteration just does a full width <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> store even if there aren’t 64 bytes left: we overwrite the buffer by up to 63 bytes. We account for this when we allocate the buffer by ensuring the allocation is oversized by at least that amount. This doesn’t matter for larger buffers, but it means this version will get a boost for very small buffers versus approaches that do the fill exactly. In any case, we are interested in large buffers here. <a href="#fnref:lazy" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:l1port" role="doc-endnote">
<p>Most likely, the L1 has a single 64 byte wide write port, like <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr>, and the commit logic at the head of the store buffer can look ahead one store to see if it is in the same line in order to dequeue two stores in a single cycle. Without this feature, you could <em>execute</em> two stores per cycle, but only commit one, so the long-run store throughput would be limited to one per cycle. <a href="#fnref:l1port" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lied" role="doc-endnote">
<p>Well I lied. I at least have some ideas. It may be that the CPU power budget is dynamically partitioned between the core and uncore, and with 512-bit stores triggering the AVX-512 power budget, there is less power for the uncore and it runs at a lower frequency (that could be checked). This seems unlikely given that it should not obviously affect the elimination chance. <a href="#fnref:lied" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:RAM" role="doc-endnote">
<p>We didn’t take a close look at the effect in RAM but it persists, albeit at a lower magnitude. 256-bit zero-over-zero writes are about 10% faster than 512-bit writes of the same type. <a href="#fnref:RAM" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comWe look at the zero store optimization as it applies to Intel's newest micro-architecture.Hardware Store Elimination2020-05-13T00:00:00+00:002020-05-13T00:00:00+00:00https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt<!-- boilerplate
page.assets: /assets/intel-zero-opt
assetpath: /assets/intel-zero-opt
tablepath: /misc/tables/intel-zero-opt
-->
<p>I had no plans to write <a href="/blog/2020/01/20/zero.html">another post</a> about zeros, but when life throws you a zero make zeroaid, or something like that. Here we go!</p>
<p>If you want to jump over the winding reveal and just read the summary and advice, <a href="#summary-perma">now is your chance</a>.</p>
<p>When writing simple memory benchmarks I have always taken the position the <em>value</em> written to memory didn’t matter. Recently, while running a straightforward benchmark<sup id="fnref:ubstore" role="doc-noteref"><a href="#fn:ubstore" class="footnote" rel="footnote">1</a></sup> probing the interaction between AVX-512 stores and <a href="https://en.wikipedia.org/wiki/MESI_protocol#Read_For_Ownership">read for ownership</a> I ran into a weird performance deviation. This is that story<sup id="fnref:story" role="doc-noteref"><a href="#fn:story" class="footnote" rel="footnote">2</a></sup>.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#prelude" id="markdown-toc-prelude">Prelude</a> <ul>
<li><a href="#data-dependent-performance" id="markdown-toc-data-dependent-performance">Data Dependent Performance</a></li>
<li><a href="#source" id="markdown-toc-source">Source</a></li>
</ul>
</li>
<li><a href="#benchmarks" id="markdown-toc-benchmarks">Benchmarks</a> <ul>
<li><a href="#a-very-simple-loop" id="markdown-toc-a-very-simple-loop">A Very Simple Loop</a></li>
<li><a href="#our-first-benchmark" id="markdown-toc-our-first-benchmark">Our First Benchmark</a> <ul>
<li><a href="#l1-and-l2" id="markdown-toc-l1-and-l2">L1 and L2</a></li>
<li><a href="#getting-weird-in-the-l3" id="markdown-toc-getting-weird-in-the-l3">Getting Weird in the L3</a></li>
<li><a href="#ram-still-weird" id="markdown-toc-ram-still-weird">RAM: Still Weird</a></li>
</ul>
</li>
<li><a href="#wild-irresponsible-speculation-and-miscellanous-musings" id="markdown-toc-wild-irresponsible-speculation-and-miscellanous-musings">Wild, Irresponsible Speculation and Miscellanous Musings</a> <ul>
<li><a href="#predicting-a-new-predictor" id="markdown-toc-predicting-a-new-predictor">Predicting a New Predictor</a></li>
<li><a href="#predictor-test" id="markdown-toc-predictor-test">Predictor Test</a></li>
</ul>
</li>
<li><a href="#hardware-survey" id="markdown-toc-hardware-survey">Hardware Survey</a></li>
<li><a href="#further-notes" id="markdown-toc-further-notes">Further Notes</a></li>
</ul>
</li>
<li><a href="#wrapping-up" id="markdown-toc-wrapping-up">Wrapping Up</a> <ul>
<li><a href="#findings" id="markdown-toc-findings">Findings</a></li>
<li><a href="#tuning-advice" id="markdown-toc-tuning-advice">Tuning “Advice”</a></li>
<li><a href="#thanks" id="markdown-toc-thanks">Thanks</a></li>
<li><a href="#discussion-and-feedback" id="markdown-toc-discussion-and-feedback">Discussion and Feedback</a></li>
</ul>
</li>
</ul>
<h2 id="prelude">Prelude</h2>
<h3 id="data-dependent-performance">Data Dependent Performance</h3>
<p>On current mainstream CPUs, the timing of most instructions isn’t data-dependent. That is, their performance is the same regardless of the <em>value</em> of the input(s) to the instruction. Unlike you<sup id="fnref:assume" role="doc-noteref"><a href="#fn:assume" class="footnote" rel="footnote">3</a></sup> or me your CPU takes the same time to add <code class="language-plaintext highlighter-rouge">1 + 2</code> as it does to add <code class="language-plaintext highlighter-rouge">68040486 + 80866502</code>.</p>
<p>Now, there are some notable exceptions:</p>
<ul>
<li>Integer division is data-dependent on most x86 CPUs: larger inputs generally take longer although the details vary widely among microarchitectures<sup id="fnref:icldiv" role="doc-noteref"><a href="#fn:icldiv" class="footnote" rel="footnote">4</a></sup>.</li>
<li>BMI2 instructions <code class="language-plaintext highlighter-rouge">pdep</code> and <code class="language-plaintext highlighter-rouge">pext</code> have <a href="https://twitter.com/uops_info/status/1202950247900684290">famously terrible</a> and data-dependent performance on AMD Zen and Zen2 chips.</li>
<li>Floating point instructions often have slower performance when <a href="https://en.wikipedia.org/wiki/Denormal_number#Performance_issues">denormal numbers</a> are encountered, although some rounding modes such as <em>flush to zero</em> may avoid this.</li>
</ul>
<p>That list is not exhaustive: there are other cases of data-dependent performance, especially when you start digging into complex microcoded instructions such as <a href="https://www.felixcloutier.com/x86/cpuid"><code class="language-plaintext highlighter-rouge">cpuid</code></a>. Still, it isn’t unreasonable to assume that most simple instructions not listed above execute in constant time.</p>
<p>How about memory operations, such as loads and stores?</p>
<p>Certainly, the <em>address</em> matters. After all the address determines the caching behavior, and caching can easily account for two orders of magnitude difference in performance<sup id="fnref:memperf" role="doc-noteref"><a href="#fn:memperf" class="footnote" rel="footnote">5</a></sup>. On the other hand, I wouldn’t expect the <em>data values</em> loaded or stored to matter. There is not much reason to expect the memory or caching subsystem to care about the value of the bits loaded or stored, outside of scenarios such as hardware-compressed caches not widely deployed<sup id="fnref:atall" role="doc-noteref"><a href="#fn:atall" class="footnote" rel="footnote">6</a></sup> on x86.</p>
<h3 id="source">Source</h3>
<p>The full benchmark associated with this post (including some additional benchmarks not mention here) is <a href="https://github.com/travisdowns/zero-fill-bench">available on GitHub</a>.</p>
<h2 id="benchmarks">Benchmarks</h2>
<p>That’s enough prelude <img src="/assets/intel-zero-opt/prelude.jpg" alt="a small red car" style="display:inline; height: 1.2em; aspect-ratio: 120/62" /> for now. Let’s write some benchmarks.</p>
<h3 id="a-very-simple-loop">A Very Simple Loop</h3>
<p>Let’s start with a very simple task. Write a function that takes an <code class="language-plaintext highlighter-rouge">int</code> value <code class="language-plaintext highlighter-rouge">val</code> and fills a buffer of a given size with copies of that value. Just like <a href="https://en.cppreference.com/w/c/string/byte/memset"><code class="language-plaintext highlighter-rouge">memset</code></a>, but with an <code class="language-plaintext highlighter-rouge">int</code> value rather than a <code class="language-plaintext highlighter-rouge">char</code> one.</p>
<p>The canonical C implementation is probably some type of for loop, like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">fill_int</span><span class="p">(</span><span class="kt">int</span><span class="o">*</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">size</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>… or maybe this<sup id="fnref:otherc" role="doc-noteref"><a href="#fn:otherc" class="footnote" rel="footnote">7</a></sup>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">fill_int</span><span class="p">(</span><span class="kt">int</span><span class="o">*</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">int</span> <span class="n">val</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span><span class="o">*</span> <span class="n">end</span> <span class="o">=</span> <span class="n">buf</span> <span class="o">+</span> <span class="n">size</span><span class="p">;</span> <span class="n">buf</span> <span class="o">!=</span> <span class="n">end</span><span class="p">;</span> <span class="o">++</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
<span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In C++, we don’t even need that much: we can simply delegate directly to <code class="language-plaintext highlighter-rouge">std::fill</code> which does the same thing as a one-liner<sup id="fnref:bpurp" role="doc-noteref"><a href="#fn:bpurp" class="footnote" rel="footnote">8</a></sup>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">fill</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">buf</span> <span class="o">+</span> <span class="n">size</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>
</code></pre></div></div>
<p>There is nothing magic about <code class="language-plaintext highlighter-rouge">std::fill</code>, it also <a href="https://github.com/gcc-mirror/gcc/blob/866cd688d1b72b0700a7e001428bdf2fe73fbf64/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L698">uses a loop</a> just like the C version above. Not surprisingly, gcc and clang compile them to the <a href="https://godbolt.org/z/R5bJiE">same machine code</a><sup id="fnref:clangv" role="doc-noteref"><a href="#fn:clangv" class="footnote" rel="footnote">9</a></sup>.</p>
<p>With the right compiler arguments (<code class="language-plaintext highlighter-rouge">-march=native -O3 -funroll-loops</code> in our case), we expect this <code class="language-plaintext highlighter-rouge">std::fill</code> version (and all the others) to be implemented with with AVX vector instructions, and <a href="https://godbolt.org/z/SfGVEC">it is so</a>. The part which does the heavy lifting work for large fills looks like this:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.L4:</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">0</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">32</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">64</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">96</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">128</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">160</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">192</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">vmovdqu</span> <span class="nv">YMMWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rax</span> <span class="o">+</span> <span class="mi">224</span><span class="p">],</span> <span class="nv">ymm1</span>
<span class="nf">add</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">256</span>
<span class="nf">cmp</span> <span class="nb">rax</span><span class="p">,</span> <span class="nv">r9</span>
<span class="nf">jne</span> <span class="nv">.L4</span>
</code></pre></div></div>
<p>It copies 256 bytes of data every iteration using eight 32-byte AVX2 store instructions. The full function is much larger, with a scalar portion for buffers smaller than 32 bytes (and which also handles the odd elements after the vectorized part is done), and a vectorized jump table to handle up to seven 32-byte chunks before the main loop. No effort is made to align the destination, but we’ll align everything to 64 bytes in our benchmark so this won’t matter.</p>
<h3 id="our-first-benchmark">Our First Benchmark</h3>
<p>Enough foreplay: let’s take the C++ version out for a spin, with two different fill values (<code class="language-plaintext highlighter-rouge">val</code>) selected completely at random: zero (<code class="language-plaintext highlighter-rouge">fill0</code>) and one (<code class="language-plaintext highlighter-rouge">fill1</code>). We’ll use gcc 9.2.1 and the <code class="language-plaintext highlighter-rouge">-march=native -O3 -funroll-loops</code> flags mentioned above.</p>
<p>We organize it so that for both tests we call the <em>same</em> non-inlined function: the exact same instructions are executed and only the value differs. That is, the compile isn’t making any data-dependent optimizations.</p>
<p>Here’s the fill throughput in GB/s for these two values, for region sizes ranging from 100 up to 100,000,000 bytes.</p>
<center><strong>Figure 1</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig1" id="fig1">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables/intel-zero-opt/fig1.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/overall.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig1.html">
<img class="figimg" src="/assets/intel-zero-opt/fig1.svg" alt="Figure 1" width="648" height="432" />
</a>
</div>
<p class="info"><strong>About this chart:</strong><br />
At each region size (that is, at each position along the x-axis) 17 semi-transparent samples<sup id="fnref:warm" role="doc-noteref"><a href="#fn:warm" class="footnote" rel="footnote">10</a></sup> are plotted and although they usually overlap almost completely (resulting in a single circle), you can see cases where there are outliers that don’t line up with the rest of this sample. This plot tries to give you an idea of the spread of back-to-back samples without hiding them behind error bars<sup id="fnref:errorbars" role="doc-noteref"><a href="#fn:errorbars" class="footnote" rel="footnote">11</a></sup>. Finally, the sizes of the various data caches (32, 256 and 6144 KiB for the L1D, L2 and L3, respectively) are marked for convenience.</p>
<h4 id="l1-and-l2">L1 and L2</h4>
<p>Not surprisingly, the performance depends heavily on what level of cache the filled region fits into.</p>
<p>Everything is fairly sane when the buffer fits in the L1 or L2 cache (up to ~256 KiB<sup id="fnref:l2" role="doc-noteref"><a href="#fn:l2" class="footnote" rel="footnote">12</a></sup>). The relatively poor performance for very small region sizes is explained by the prologue and epilogue of the vectorized implementation: for small sizes a relatively large amount of time is spent in these int-at-a-time loops: rather than copying up to 32 bytes per cycle, we copy only 4.</p>
<p>This also explains the bumpy performance in the fastest region between ~1,000 and ~30,000 bytes: this is highly reproducible and not noise. It occurs because because some sampled values have a larger remainder mod 32. For example, the sample at 740 bytes runs at ~73 GB/s while the next sample at 988 runs at a slower 64 GB/s. That’s because 740 % 32 is 4, while 988 % 32 is 28, so the latter size has 7x more cleanup work to do than the former<sup id="fnref:badvec" role="doc-noteref"><a href="#fn:badvec" class="footnote" rel="footnote">13</a></sup>. Essentially, we are sampling semi-randomly a sawtooth function and if you <a href="/assets/intel-zero-opt/sawtooth.svg">plot this region with finer granularity</a><sup id="fnref:melty" role="doc-noteref"><a href="#fn:melty" class="footnote" rel="footnote">14</a></sup> you can see it quite clearly.</p>
<h4 id="getting-weird-in-the-l3">Getting Weird in the L3</h4>
<p>So while there are some interesting effects in the first half of the results, covering L1 and L2, they are fairly easy to explain and, more to the point, performance for the zero and one cases are identical: the samples are all concentric. As soon as we dip our toes into the L3, however, things start to get <em>weird</em>.</p>
<p>Weird in that we we see a clear divergence between stores of zero versus ones. Remember that this is the exact same function, the same <em>machine</em> code executing the same stream of instructions, only varying in the value of the <code class="language-plaintext highlighter-rouge">ymm1</code> register passed to the store instruction. Storing zero is consistently about 17% to 18% faster than storing one, both in the region covered by the L3 (up to 6 MiB on my system), and beyond that where we expect misses to RAM (it looks like the difference narrows in the RAM region, but it’s mostly a trick of the eye: the relative performance difference is about the same).</p>
<p>What’s going on here? Why does the CPU care <em>what values</em> are being stored, and why is zero special?</p>
<p>We can get some additional insight by measuring the <code class="language-plaintext highlighter-rouge">l2_lines_out.silent</code> and <code class="language-plaintext highlighter-rouge">l2_lines_out.non_silent</code> events while we focus on the regions that fit in L2 or L3. These events measure the number of lines evicted from L2 either <em>silently</em> or <em>non-silently</em>.</p>
<p>Here are Intel’s descriptions of these events:</p>
<p><strong>l2_lines_out.silent</strong></p>
<blockquote>
<p>Counts the number of lines that are silently dropped by L2 cache when triggered by an L2 cache fill. These lines are typically in Shared or Exclusive state.</p>
</blockquote>
<p><strong>l2_lines_out.non_silent</strong></p>
<blockquote>
<p>Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines are in Modified state. Modified lines are written back to L3.</p>
</blockquote>
<p>The states being referred to here are <a href="https://en.wikipedia.org/wiki/MESI_protocol">MESI</a> cache states, commonly abbreviated M (modified), E (exclusive, but not modified) and S (possibly shared, not modified).</p>
<p>The second definition is not completely accurate. In particular, it implies that only modified lines trigger the <em>non-silent</em> event. However, <a href="https://stackoverflow.com/q/52565303/149138">I find</a> that unmodified lines in E state can also trigger this event. Roughly, the behavior for unmodified lines seems to be that lines that miss in L2 <em>and</em> L3 usually get filled into the L2 in a state where they will be evicted <em>non-silently</em>, but unmodified lines that miss in L2 and <em>hit</em> in L3 will generally be evicted silently<sup id="fnref:silent" role="doc-noteref"><a href="#fn:silent" class="footnote" rel="footnote">15</a></sup>. Of course, lines that are modified <em>must</em> be evicted non-silently in order to update the outer levels with the new data.</p>
<p>In summary: silent evictions are associated with unmodified lines in E or S state, while non-silent evictions are associated with M, E or (possibly) S state lines, with the silent vs non-silent choice for E and S being made in some unknown matter.</p>
<p>Let’s look at silent vs non-silent evictions for the <code class="language-plaintext highlighter-rouge">fill0</code> and <code class="language-plaintext highlighter-rouge">fill1</code> cases:</p>
<center><strong>Figure 2</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig2" id="fig2">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables/intel-zero-opt/fig2.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/l2-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig2.html">
<img class="figimg" src="/assets/intel-zero-opt/fig2.svg" alt="Figure 2" width="648" height="432" />
</a>
</div>
<p class="info"><strong>About this chart:</strong><br />
For clarity, I show only the median single sample for each size<sup id="fnref:trust" role="doc-noteref"><a href="#fn:trust" class="footnote" rel="footnote">16</a></sup>. As before, the left axis is fill speed and on the right axis the two types eviction events are plotted, normalized to the number of cache lines accessed in the benchmark. That is, a value of 1.0 means that for every cache line accessed, the event occurred one time.</p>
<p>The <em>total</em> number of evictions (sum of silent and non-silent) is the same for both cases: near zero<sup id="fnref:wb" role="doc-noteref"><a href="#fn:wb" class="footnote" rel="footnote">17</a></sup> when the region fits in L2, and then quickly increases to ~1 eviction per stored cache line. In the L3, <code class="language-plaintext highlighter-rouge">fill1</code> also behaves as we’d expect: essentially all of the evictions are non-silent. This makes sense since modified lines <em>must</em> be evicted non-silently to write their modified data to the next layer of the cache subsystem.</p>
<p>For <code class="language-plaintext highlighter-rouge">fill0</code>, the story is different. Once the buffer size no longer fits in L2, we see the same <em>total</em> number of evictions from L2, but 63% of these are silent, the rest non-silent. Remember, only unmodified lines even have the hope of a silent eviction. This means that at least 63% of the time, the L2<sup id="fnref:orl3" role="doc-noteref"><a href="#fn:orl3" class="footnote" rel="footnote">18</a></sup> is able to detect that the write is <em>redundant</em>: it doesn’t change the value of the line, and so the line is evicted silently. That is, it is never written back to the L3. This is presumably what causes the performance boost: the pressure on the L3 is reduced: although all the implied reads<sup id="fnref:rfo" role="doc-noteref"><a href="#fn:rfo" class="footnote" rel="footnote">19</a></sup> still need to go through the L3, only about 1 out of 3 of those lines ends up getting written back.</p>
<p>Once the test starts to exceed the L3 threshold, all of the evictions become non-silent even in the <code class="language-plaintext highlighter-rouge">fill0</code> case. This doesn’t necessarily mean that the zero optimization stops occurring. As mentioned earlier<sup id="fnref:silent:1" role="doc-noteref"><a href="#fn:silent" class="footnote" rel="footnote">15</a></sup>, it is a typical pattern even for read-only workloads: once lines arrive in L2 as a result of an L3 miss rather than a hit, their subsequent eviction becomes non-silent, even if never written. So we can assume that the lines are probably still detected as not modified, although we lose our visibility into the effect at least as far as the <code class="language-plaintext highlighter-rouge">l2_lines_out</code> events go. That is, although all evictions are non-silent, some fraction of the evictions are still indicating that the outgoing data is unmodified.</p>
<h4 id="ram-still-weird">RAM: Still Weird</h4>
<p>In fact, we can confirm that this apparent optimization still happens as move into RAM using a different set of events. There are several to choose from – and all of those that I tried tell the same story. We’ll focus on <code class="language-plaintext highlighter-rouge">unc_arb_trk_requests.writes</code>, <a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/6th-gen-core-family-uncore-performance-monitoring-manual.pdf">documented</a> as follows:</p>
<blockquote>
<p>Number of writes allocated including any write transaction including full, partials and evictions.</p>
</blockquote>
<p>Important to note that the “uncore tracker” these events monitor is used by data flowing between L3 and memory, not between L2 and L3. So <em>writes</em> here generally refers to writes that will reach memory.</p>
<p>Here’s how this event scales for the same test we’ve been running this whole time (the size range has been shifted for focus on the area of interest)<sup id="fnref:sneaky" role="doc-noteref"><a href="#fn:sneaky" class="footnote" rel="footnote">20</a></sup>:</p>
<center><strong>Figure 3</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig3" id="fig3">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables/intel-zero-opt/fig3.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/l3-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig3.html">
<img class="figimg" src="/assets/intel-zero-opt/fig3.svg" alt="Figure 3" width="648" height="432" />
</a>
</div>
<p>The number of writes for well-behaved <code class="language-plaintext highlighter-rouge">fill1</code> approaches one write per cache line as the buffer exceeds the size of L3 – again, this is as expected. For the more rebellious <code class="language-plaintext highlighter-rouge">fill0</code>, it is almost exactly half that amount. For every two lines written by the benchmark, we only write one back to memory! This same 2:1 ratio is reflected also if we measure memory writes at the integrated memory controller<sup id="fnref:imcevent" role="doc-noteref"><a href="#fn:imcevent" class="footnote" rel="footnote">21</a></sup>: writing zeros results in only half the number of writes at the memory controller.</p>
<h3 id="wild-irresponsible-speculation-and-miscellanous-musings">Wild, Irresponsible Speculation and Miscellanous Musings</h3>
<p>This is all fairly strange. It’s not weird that there would be a “redundant writes” optimization to avoid writing back identical values: this seems like it could benefit some common write patterns.</p>
<p>It is perhaps a bit unusual that it only apparently applies to all-zero values. Maybe this is because zeros overwriting zeros is one of the most common redundant write cases, and detecting zero values can done more cheaply than a full compare. Also, the “is zero?” state can be communicated and stored as a single bit, which might be useful. For example, if the L2 is involved in the duplicate detection (and the <code class="language-plaintext highlighter-rouge">l2_lines_out</code> results suggest it is), perhaps the detection happens when the line is evicted, at which point you want to compare to the line in L3, but you certainly can’t store the entire old value in or near the L2 (that would require storage as large as the L2 itself). You could store an indicator that the line was zero, however, in a single bit and compare the existing line as part of the eviction process.</p>
<h4 id="predicting-a-new-predictor">Predicting a New Predictor</h4>
<p>What is the weirdest of all, however, is that the optimization doesn’t kick in 100% of the time but only for 40% to 60% of the lines, depending on various parameters<sup id="fnref:params" role="doc-noteref"><a href="#fn:params" class="footnote" rel="footnote">22</a></sup>. What would lead to that effect? One could imagine that there could be some type of predictor which determines whether to apply this optimization or not, depending on e.g., whether the optimization has recently been effective – that is, whether redundant stores have been common recently. Perhaps this predictor also considers factors such as the occupancy of outbound queues<sup id="fnref:obbus" role="doc-noteref"><a href="#fn:obbus" class="footnote" rel="footnote">23</a></sup>: when the bus is near capacity, searching for eliminating redundant writes might be more worth the power or latency penalty compared to the case when there is little apparent pressure on the bus.</p>
<p>In this benchmark, any predictor would find that the optimization is 100% effective: <em>every</em> write is redundant! So we might guess that the second condition (queue occupancy) results in a behavior where only some stores are eliminated: as more stores are eliminated, the load on the bus becomes lower and so at some point the predictor no long thinks it is worth it to eliminate stores and you reach a kind of stable state where only a fraction of stores are eliminated based on the predictor threshold.</p>
<h4 id="predictor-test">Predictor Test</h4>
<p>We can kind of test that theory: in this model, any store is <em>capable</em> of being eliminated, but the ratio of eliminated stores is bounded above by the predictor behavior. So if we find that a benchmark of <em>pure</em> redundant zero stores is eliminated at a 60% rate, we might expect that any benchmark with at least 60% redundant stores can reach the 60% rate, and with lower rates, you’d see full elimination of all redundant stores (since now the bus always stays active enough to trigger the predictor).</p>
<p class="info">Apparently analogies are helpful, so an analogy here would be a person controlling the length of a line by redirecting some incoming people. For example, in an airport security line the handler tries to keep the line at a certain maximum length by redirecting (redirecting -> store elimination) people to the priority line if they are eligible and the main line is at or above its limit. Eligible people are those without carry-on luggage (eligible people -> zero-over-zero stores).<br />
<br />
If everyone is eligible (-> 100% zero stores), this control will always be successful and the fraction of people redirected will depend on the relative rate of ingress and egress through security. If security only has a throughput of 40% of the ingress rate, 60% of people will redirected in the steady state. Now, consider what happens if not everyone is eligible: if the eligible fraction is at least 60%, nothing changes. You still redirect 60% of people. Only if the eligible rate drops below 60% is there a problem: now you’ll be redirecting 100% of eligible people, but the primary line will grow beyond your limit.<br />
<br />
Whew! Not sure if that was helpful after all?</p>
<p>Let’s try a benchmark which adds a new implementation, <code class="language-plaintext highlighter-rouge">alt01</code> which alternates between writing a cache line of zeros and a cache line of ones. All the writes are redundant, but only 50% are zeros, so under the theory that a predictor is involved we expect that maybe 50% of the stores will be eliminated (i.e., 100% of the redundant stores are eliminated and they make up 50% of the total).</p>
<p>Here we focus on the L3, similar to Fig. 2 above, showing silent evictions (the non-silent ones make up the rest, adding up to 1 total as before):</p>
<center><strong>Figure 4</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig4" id="fig4">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables/intel-zero-opt/fig4.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/l2-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig4.html">
<img class="figimg" src="/assets/intel-zero-opt/fig4.svg" alt="Figure 4" width="648" height="432" />
</a>
</div>
<p>We don’t see 50% elimination. Rather we see less than half the elimination of the all-zeros case: 27% versus 63%. Performance is better in the L3 region than the all ones case, but only slightly so! So this doesn’t support the theory of a predictor capable of eliminating on any store and operating primarily on outbound queue occupancy.</p>
<p>Similarly, we can examine the region where the buffer fits only in RAM, similar to Fig. 3 above:</p>
<center><strong>Figure 5</strong></center>
<div class="svg-fig">
<div class="svg-fig-links">
<a href="#fig5" id="fig5">[link<span class="only-large"> to this chart</span>]</a>
<a href="/misc/tables/intel-zero-opt/fig5.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/l3-focus.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig5.html">
<img class="figimg" src="/assets/intel-zero-opt/fig5.svg" alt="Figure 5" width="648" height="432" />
</a>
</div>
<p>Recall that the lines show the number of writes reaching the memory subsystem. Here we see that <code class="language-plaintext highlighter-rouge">alt01</code> again splits the difference between the zero and ones case: about 75% of the writes reach memory, versus 48% in the all-zeros case, so the elimination is again roughly half as effective. In this case, the performance also splits the difference between all zeros and all ones: it falls almost exactly half-way between the two other cases.</p>
<p>So I don’t know what’s going on exactly. It seems like maybe only some fraction are of lines are eligible for elimination due to some unknown internal mechanism in the <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>.</p>
<h3 id="hardware-survey">Hardware Survey</h3>
<p>Finally, here are the performance results (same as <strong>Figure 1</strong>) on a variety of other Intel and AMD x86 architectures, as well as IBM’s POWER9 and Amazon’s Graviton 2 ARM processor, one per tab.</p>
<!-- uresults: snb/remote.csv,hsw/remote.csv,skl/remote.csv,skx/remote.csv,cnl/remote.csv,zen2/remote.csv,power9/remote.csv,gra2/remote.csv -->
<div class="tabs" id="tabs-fig6">
<!-- Courtesy of https://codepen.io/Merri/pen/bytea -->
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-1" name="tab-group-fig6" checked="" />
<label class="tab-label" for="tab-fig6-1">Sandy Bridge</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-snb.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/snb/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-snb.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-snb.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-2" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-2">Haswell</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-hsw.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/hsw/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-hsw.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-hsw.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-3" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-3">Skylake-S</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-skl.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/skl/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-skl.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-skl.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-4" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-4">Skylake-X</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-skx.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/skx/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-skx.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-skx.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-5" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-5">Cannon Lake</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-cnl.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/cnl/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-cnl.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-cnl.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-6" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-6">Zen2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-zen2.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/zen2/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-zen2.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-zen2.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-7" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-7">POWER9</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-power9.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/power9/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-power9.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-power9.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
<div class="tab">
<input class="tab-radio" type="radio" id="tab-fig6-8" name="tab-group-fig6" />
<label class="tab-label" for="tab-fig6-8">Graviton 2</label>
<div class="tab-panel">
<div class="tab-content">
<div class="svg-fig">
<div class="svg-fig-links">
<a href="/misc/tables/intel-zero-opt/fig6-gra2.html">[data<span class="only-large"> table</span>]</a>
<a href="https://github.com/travisdowns/zero-fill-bench/tree/master/results/gra2/remote.csv">[raw<span class="only-large"> data</span>]</a>
</div>
<a href="/misc/tables/intel-zero-opt/fig6-gra2.html">
<img class="figimg" src="/assets/intel-zero-opt/fig6-gra2.svg" alt="Figure" width="648" height="432" />
</a>
</div>
</div>
</div>
</div>
</div>
<p>Some observations on these results:</p>
<ul>
<li>The redundant write optimization isn’t evident in the performance profile for <em>any</em> of the other non-<abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr> hardware tested. Not even closely related Intel hardware like Haswell or Skylake-X. I also did a few spot tests with performance counters, and didn’t see any evidence of a reduction in writes. So for now this might a Skylake client only thing (of course, Skylake client is perhaps the most widely deployed Intel <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr> even due to the many identical-<abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>-except-in-name variants: Kaby Lake, Coffee Lake, etc, etc). Note that the Skylake-S result here is for a different (desktop i7-6700) chip than the rest of this post, so we can at least confirm this occurs on two different chips.</li>
<li>Except in the RAM region, Sandy Bridge throughput is half of its successors: a consequence of having only a 16-byte load/store path in the core, despite supporting 32-byte AVX instructions.</li>
<li>AMD Zen2 has <em>excellent</em> write performance in the L2 and L3 regions. All of the Intel chips drop to about half throughput for writes in the L2: slightly above 16 bytes per cycle (around 50 GB/s for most of these chips). Zen2 maintains its L1 throughput and in fact has its highest results in L2: over 100 GB/s. Zen2 also manages more than 70 GB/s in the L3, much better than the Intel chips, in this test.</li>
<li>Both Cannon Lake and Skylake-X exhibit a fair amount of inter-sample variance in the L2 resident region. My theory here would be prefetcher interference which behaves differently than earlier chips, but I am not sure.</li>
<li>Skylake-X, with a different L3 design than the other chips, has quite poor L3 fill throughput, about half of contemporary Intel chips, and less than a third of Zen2.</li>
<li>The POWER9 performance is neither terrible nor great. The most interesting part is probably the high L3 fill throughput: L3 throughput is as high or higher than L1 or L2 throughput, but still not in Zen2 territory.</li>
<li>Amazon’s new Graviton processor is very interesting. It seems to be limited to one 16-byte store per cycle<sup id="fnref:armcompile" role="doc-noteref"><a href="#fn:armcompile" class="footnote" rel="footnote">24</a></sup>, giving it a peak possible store throughput of 40 GB/s, so it doesn’t do well in the L1 region versus competitors that can hit 100 GB/s or more (they have both higher frequency and 32 byte stores), but it sustains the 40 GB/s all the way to RAM sizes, with a RAM result flat enough to serve drinks on, and this on a shared 64-CPU host where I paid for only a single core<sup id="fnref:g2ga" role="doc-noteref"><a href="#fn:g2ga" class="footnote" rel="footnote">25</a></sup>! The RAM performance is the highest out of all hardware tested.</li>
</ul>
<p>You might notice that Ice Lake, Intel’s newest microarchitecture, is missing from this list: that’s because there is a <a href="/blog/2020/05/18/icelake-zero-opt.html">whole separate post</a> on it.</p>
<h3 id="further-notes">Further Notes</h3>
<p>Here’s a grab back of notes and observations that don’t get their a full section, don’t have any associated plots, etc. That doesn’t mean they are less important! Don’t say that! These ones matter too, they really do.</p>
<ul>
<li>Despite any impressions I may have given above: you don’t need to <em>fully</em> overwrite a cache line with zeros for this to kick in, and you can even write <em>non-zero</em> values if you overwrite them “soon enough” with zeros. Rather, the line must initially fully zero, and then all the <em>escaping<sup id="fnref:escaping" role="doc-noteref"><a href="#fn:escaping" class="footnote" rel="footnote">26</a></sup></em> writes must be zero. Another way of thinking about this is that the thing that matters is the value of the cache line written back, as well as the old value of the line that the writeback is replacing: these must both be fully zero, but that doesn’t mean that you need to overwrite the line with zeros: any locations not written are still zero “from before”. I directly test this in the <code class="language-plaintext highlighter-rouge">one_per0</code> and <code class="language-plaintext highlighter-rouge">one_per1</code> tests in the benchmark. These write only a single <code class="language-plaintext highlighter-rouge">int</code> value in each cache line, leaving the other values unchanged. In that benchmark the optimization kicks triggers in exactly the same way when writing a single zero.</li>
<li>Although we didn’t find evidence of this happening on other x86 hardware, nor on POWER or ARM, that doesn’t mean it isn’t or can’t happen. It is possible the conditions for it to happen aren’t triggered, or that it is happening but doesn’t make a difference in performance. Similarly for the POWER9 and ARM chips: we didn’t check any performance counters, so maybe the thing is happening but it just doesn’t make any difference in performance. That’s especially feasible in the ARM case where the performance is totally limited by the core-level 16-bytes-per-cycle write throughput: even if all writes are eliminated later on in the path to memory, we expect the performance to be the same.</li>
<li>We could learn more about this effect by setting up a test which writes ones first to a region of some fixed size, then overwrites it with zeros, and repeats this in a rolling fashion over a larger buffer. This test basically lets the 1s escape to a certain level of the cache hierarchy, and seeing where the optimization kicks in will tell us something interesting.</li>
</ul>
<p><span id="summary-perma"></span></p>
<h2 id="wrapping-up">Wrapping Up</h2>
<h3 id="findings">Findings</h3>
<p>Here’s a brief summary of what we found. This will be a bit redundant if you’ve just read the whole thing, but we need to accommodate everyone who just skipped down to this part, right?</p>
<ul>
<li>Intel chips can apparently eliminate some redundant stores when zeros are written to a cache line that is already all zero (the line doesn’t need to be fully overwritten).</li>
<li>This optimization applies at least as early as L2 writeback to L3, so would apply to the extend that working sets don’t fit in L2.</li>
<li>The effect eliminates both write accesses to L3, and writes to memory depending on the working set size.</li>
<li>For the pure store benchmark discussed here effect of this optimization is a reduction in the number of writes of ~63% (to L3) and ~50% (to memory), with a runtime reduction of between 15% and 20%.</li>
<li>It is unclear why not all redundant zero-over-zero stores are eliminated.</li>
</ul>
<h3 id="tuning-advice">Tuning “Advice”</h3>
<p>So is any of actually useful? Can we use this finding to quadruple the speed of the things that really matter in computation: tasks like bitcoin mining, high-frequency trading and targeting ads in real time?</p>
<p>Nothing like that, no – but it might provide a small boost for some cases.</p>
<p>Many of those cases are probably getting the benefit without any special effort. After all, zero is already a special value: it’s how memory arrives comes from the operating system, and at the language allocation level for some languages. So a lot of cases that could get this benefit, probably already are.</p>
<p>Redundant zero-over-zero probably isn’t as rare as you might think either: consider that in low level languages, memory is often cleared after receiving it from the allocator, but in many cases this memory came directly from the OS so it is already zero<sup id="fnref:calloc" role="doc-noteref"><a href="#fn:calloc" class="footnote" rel="footnote">27</a></sup>. Consider also cases like fairly-sparse matrix multiplication: where your matrix isn’t sparse enough to actually use dedicated sparse routines, but still has a lot of zeros. In that case, you are going to be writing 0 all the time in your final result and scratch buffers. This optimization will reduce the writeback traffic in that case.</p>
<p>If you are making a ton of redundant writes, the first thing you might want to do is look for a way to stop doing that. Beyond that, we can list some ways you <em>might</em> be able to take advantage of this new behavior:</p>
<ul>
<li>In the case you are likely to have redundant writes, prefer zero as the special value that is likely to be redundantly overwritten. For example if you are doing some blind writes, something like <a href="https://richardstartin.github.io/posts/garbage-collector-code-artifacts-card-marking">card marking</a> where you don’t know if your write is redundant, you might consider writing zeros, rather than writing non-zeros, since in the case that some region of card marks gets repeatedly written, it will be all-zero and the optimization can apply. Of course, this cuts the wrong way when you go to clear the marked region: now you have to write non-zero so you don’t get the optimization during clearing (but maybe this happens out of line with the user code that matters). What ends up better depends on the actual write pattern.</li>
<li>In case you might have redundant zero-over-zero writes, pay a bit more attention to 64-byte alignment than you normally would because this optimization only kicks in when a full cache line is zero. So if you have some 64-byte structures that might often be all zero (but with non-zero neighbors), a forced 64-byte alignment will be useful since it would activate the optimization more frequently.</li>
<li>Probably the most practical advice of all: just keep this effect in mind because it can mess up your benchmarks and make you distrust performance counters. I found this when I noticed that the scalar version of a benchmark was writing 2x as much memory as the AVX version, despite them doing the same thing other than the choice of registers. As it happens, the dummy value in vector register I was storing was zero, while in the scalar case it wasn’t: so there was a large difference that had nothing to do with scalar vs vector, but non-zero vs zero instead. Prefer non-zero values in store microbenchmarks, unless you really expect them to be zero in real life!</li>
<li>Keep an eye for a more general version of this optimization: maybe one day we’ll see this effect apply to redundant writes that aren’t zero-over-zero.</li>
<li>Floating point has two zero values: +0 and -0. The representation of +0 is all-bits-zero, so using +0 gives you the chance of getting this optimization. Of course, everyone is already using +0 whenever they explicitly want zero.</li>
</ul>
<p>Of course, the fact that this seems to currently only apply on Skylake and <a href="/blog/2020/05/18/icelake-zero-opt.html">Ice Lake</a> client hardware makes specifically targeting this quite dubious indeed.</p>
<h3 id="thanks">Thanks</h3>
<p>Thanks to Daniel Lemire who provided access to the hardware used in the <a href="#hardware-survey">Hardware Survey</a> part of this post.</p>
<p>Thanks Alex Blewitt and Zach Wegner who pointed out the CSS tab technique (I used the one linked in the <a href="https://twitter.com/zwegner/status/1223701307078402048">comments of this post</a>) and others who replied to <a href="https://twitter.com/trav_downs/status/1223690150175236102">this tweet</a> about image carousels.</p>
<p>Thanks to Tarlinian, 不良大脑的所有者, Bruce Dawson, Zach Wegner and Andrey Penechko who pointed out typos or omissions in the text.</p>
<h3 id="discussion-and-feedback">Discussion and Feedback</h3>
<p>Leave a comment below, or discuss on <a href="https://twitter.com/trav_downs/status/1260620313483771905">Twitter</a>, <a href="https://news.ycombinator.com/item?id=23169605">Hacker News</a>, <a href="https://www.reddit.com/r/asm/comments/gj3xq7/hardware_store_elimination/">reddit</a> or <a href="https://www.realworldtech.com/forum/?threadid=191798&curpostid=191798">RWT</a>.</p>
<p>Feedback is also warmly welcomed by <a href="mailto:travis.downs@gmail.com">email</a> or as <a href="https://github.com/travisdowns/travisdowns.github.io/issues">a GitHub issue</a>.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:ubstore" role="doc-endnote">
<p>Specifically, I was running <code class="language-plaintext highlighter-rouge">uarch-bench.sh --test-name=memory/bandwidth/store/*</code> from <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>-bench. <a href="#fnref:ubstore" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:story" role="doc-endnote">
<p>Like many posts on this blog, what follows is essentially a <em>reconstruction</em>. I encountered the effect originally in a benchmark, as described, and then worked backwards from there to understand the underlying effect. Then, I wrote this post the other way around: building up a new benchmark to display the effect … but at that point I already knew what we’d find. So please don’t think I just started writing the benchmark you find on GitHub and then ran into this issue coincidentally: the arrow of causality points the other way. <a href="#fnref:story" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:assume" role="doc-endnote">
<p>Probably? I don’t like to assume too much about the reader, but this seems like a fair bet. <a href="#fnref:assume" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:icldiv" role="doc-endnote">
<p>Starting with Ice Lake, it seems like Intel has implemented a constant-time integer divide unit. <a href="#fnref:icldiv" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:memperf" role="doc-endnote">
<p>Latency-wise, something like 4-5 cycles for an L1 hit, versus 200-500 cycles for a typical miss to DRAM. Throughput wise there is also a very large gap (256 GB/s L1 throughput <em>per core</em> on a 512-bit wide machine versus usually less than < 100 GB/s <em>per socket</em> on recent Intel). <a href="#fnref:memperf" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:atall" role="doc-endnote">
<p>Is it deployed anywhere at all on x86? Ping me if you know. <a href="#fnref:atall" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:otherc" role="doc-endnote">
<p>It’s hard to say which is faster if they are compiled as written: x86 has indexed addressing modes that make the indexing more or less free, at least for arrays of element size 1, 2, 4 or 8, so the usual arguments againt indexed access mostly don’t apply. Probably, it doesn’t matter: this detail might have made a big difference 20 years ago, but it is unlikely to make a difference on a decent compiler today, which can transform one into the other, depending on the target hardware. <a href="#fnref:otherc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:bpurp" role="doc-endnote">
<p>For benchmarking purposes, we wrap this in <a href="https://github.com/travisdowns/zero-fill-bench/blob/post1/algos.cpp#L28">another function</a> so we can slap a <code class="language-plaintext highlighter-rouge">noinline</code> attribute on this function to ensure that we have a single non-inlined version to call for different values. If we just called <code class="language-plaintext highlighter-rouge">std::fill</code> with a literal <code class="language-plaintext highlighter-rouge">int</code> value, it highly likely to get inlined at the call site and we’d have code with different alignment (and possibly other differences) for each value. <a href="#fnref:bpurp" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:clangv" role="doc-endnote">
<p>Admittedly I didn’t go line-by-line though the long vectorized version produced by clang but the line count is identical and if you squint so the assembly is just a big green and yellow blur they look the same… <a href="#fnref:clangv" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:warm" role="doc-endnote">
<p>There are 27 samples total at each size: the first 10 are discarded as warmup and the remaining 17 are plotted. <a href="#fnref:warm" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:errorbars" role="doc-endnote">
<p>The main problem with error bars are that most performance profiling results, and especially microbenchmarks, are mightly non-normal in their distribution, so displaying an error bar based a statistic like the variance is often highly misleading. <a href="#fnref:errorbars" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:l2" role="doc-endnote">
<p>The ~ is there in ~256 KiB because unless you use huge pages, you might start to see L2 misses even before 256 KiB since only a 256 KiB <em>virtually contiguous</em> buffer is not necessarily well behaved in terms of evictions: it depends how those 4k pages are mapped to physical pages. As soon as you get too many 4k pages mapping to the same group of sets, you’ll see evictions even before 256 KiB. <a href="#fnref:l2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:badvec" role="doc-endnote">
<p>It is worth noting<sup id="fnref:nested" role="doc-noteref"><a href="#fn:nested" class="footnote" rel="footnote">28</a></sup> that this performance variation with buffer size isn’t exactly inescapable. Rather, it is just a consequence of poor remainder handling in the compiler’s auto-vectorizer. An approach that would be much faster and generate much less code to handle the remaining elements would be to do a final full-width vector store but aligned to the end of the buffer. So instead of doing up to 7 additional scalar stores, you do one additional vector store (and suffer up to one fewer branch mispredictions for random lengths, since the scalar outro involves conditional jumps). <a href="#fnref:badvec" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:melty" role="doc-endnote">
<p>Those melty bits where the pattern gets all weird, in the middle and near the right side are not random artifacts: they are consistently reproducible. I suspect a collision in the branch predictor history. <a href="#fnref:melty" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:silent" role="doc-endnote">
<p>This behavior is interesting and a bit puzzling. There are several reasons why you might want to do a non-silent eviction. (1a) would be to keep the L3 snoop filter up to date: if the L3 knows a core no longer has a copy of the line, later requests for that line can avoid snooping the core and are some 30 cycles faster. (1b) Similarly, if the L3 wants to evict this line, this is faster if it knows it can do it without writing back, versus snooping the owning core for a possibly modified line. (2) Keeping the L3 <abbr title="Least recently used - an eviction strategy suitable for data with temporal locality">LRU</abbr> more up to date: the L3 <abbr title="Least recently used - an eviction strategy suitable for data with temporal locality">LRU</abbr> wants to know which lines are hot, but most of the accesses are filtered through the L1 and L2, so the L3 doesn’t get much information – a non-silent eviction can provide some of the missing info (3) If the L3 serves as a victim cache, the L2 needs to write back the line for it to be stored in L3 at all. <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> L3 actually works this way, but despite being a very similar <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>, <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr> apparently doesn’t. However, one can imagine that on a miss to DRAM it may be advantageous to send the line directly to the L2, updating the L3 tags (snoop filter) only, without writing the data into L3. The data only gets written when the line is subsequently evicted from the owning L2. When lines are frequently modified, this cuts the number of writes to L3 in half. This behavior warrants further investigation. <a href="#fnref:silent" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:silent:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:trust" role="doc-endnote">
<p>You’ve already seen in Fig. 1 that there is little inter-sample variation, and this keeps the noise down. You can always check the raw data if you want the detailed view. <a href="#fnref:trust" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:wb" role="doc-endnote">
<p>This shows that the L2 is a write-back cache, not write-through: modified lines can remain in L2 until they are evicted, rather than immediately being written to the outer levels of the memory hierarchy. This type of design is key for high store throughput, since otherwise the long-term store throughput is limited to the bandwidth of the slowest write-through cache level. <a href="#fnref:wb" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:orl3" role="doc-endnote">
<p>I say the L2 because the behavior is already reflected in the L2 performance counters, but it could be teamwork between the L2 and other components, e.g., the L3 could say “OK, I’ve got that line you <abbr title="Request for ownership: when a request for a cache line originates from a store, or a type of prefetch that predicts the location is likely to be the target of a store, an RFO is performed which gets the line in an exclusive MESI state.">RFO</abbr>’d and BTW it is all zeros”. <a href="#fnref:orl3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rfo" role="doc-endnote">
<p>Although only stores appear in the source, at the hardware level this benchmark does at least as many reads as stores: every store must do a <em>read for ownership</em> (<abbr title="Request for ownership: when a request for a cache line originates from a store, or a type of prefetch that predicts the location is likely to be the target of a store, an RFO is performed which gets the line in an exclusive MESI state.">RFO</abbr>) to get the current value of the line before storing to it. <a href="#fnref:rfo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sneaky" role="doc-endnote">
<p>Eagle-eyed readers, all two of them, might notice that the performance in the L3 region is different than the previous figure: here the performance slopes up gradually across most of the L3 range, while in the previous test it was very flat. Absolute performance is also somewhat lower. This is a testing artifact: reading the uncore performance counters necessarily involves a kernel call, taking over 1,000 cycles versus the < 100 cycles required for <code class="language-plaintext highlighter-rouge">rdpmc</code> to measure the CPU performance counters needed for the prior figure. Due to “flaws” (laziness) in the benchmark, this overhead is captured in the shown performance, and larger regions take longer, meaning that this fixed measurement overhead has a smaller relative impact, so you get this <code class="language-plaintext highlighter-rouge">measured = actual - overhead/size</code> type effect. It can be fixed, but I have to reboot my host into single-user mode to capture clean numbers, and I am feeling too lazy to do that right now, although as I look back at the size of the footnote I needed to explain it I am questioning my judgement. <a href="#fnref:sneaky" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:imcevent" role="doc-endnote">
<p>On <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr> client CPUs we can do this with the <code class="language-plaintext highlighter-rouge">uncore_imc/data_writes/</code> events, which polls internal counters in the memory controller itself. This is a socket-wide event, so it is important to do this measurement on as quiet a machine as possible. <a href="#fnref:imcevent" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:params" role="doc-endnote">
<p>I tried a bunch of other stuff that I didn’t write up in detail. Many of them affect the behavior: we still see the optimization but with different levels of effectiveness. For example, with L2 prefetching off, only about 40% of the L2 evictions are eliminated (versus > 60% with prefetch on), and the performance difference between is close to zero despite the large number of eliminations. I tried other sizes of writes, and with narrow writes the effect is reduced until it is eliminated at 4-byte writes. I don’t think the write size <em>directly</em> affects the optimization, but rather narrower writes slow down the maximum possible performance which interacts in some way with the hardware mechanisms that support this to reduce how often it occurs (a similar observation could apply to prefetching). <a href="#fnref:params" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:obbus" role="doc-endnote">
<p>By <em>outbound queues</em> I mean the path between an inner and outer cache level. So for the L2, the outbound bus is the so-called <em>superqueue</em> that connects the L2 to the uncore and L3 cache. <a href="#fnref:obbus" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:armcompile" role="doc-endnote">
<p>The Graviton 2 uses the Cortex A76 <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>, which can <em>execute</em> 2 stores per cycle, but the L1 cache write ports limits sustained execution to only one 128-bit store per cycle. <a href="#fnref:armcompile" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:g2ga" role="doc-endnote">
<p>It was the first full day of general availability for Graviton, so perhaps these hosts are very lightly used at the moment because it certainly felt like I had the whole thing to myself. <a href="#fnref:g2ga" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:escaping" role="doc-endnote">
<p>By <em>escaping</em> I mean that a store that visibly gets to the cache level where this optimization happens. For example, if I write a 1 immediately followed by a 0, the 1 will never make it out of the L1 cache, so from the point of view of the L2 and beyond only a zero was written. I expect the optimization to still trigger in this case. <a href="#fnref:escaping" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:calloc" role="doc-endnote">
<p>This phenomenon is why <code class="language-plaintext highlighter-rouge">calloc</code> is sometimes considerably faster than <code class="language-plaintext highlighter-rouge">malloc + memset</code>. With <code class="language-plaintext highlighter-rouge">calloc</code> the zeroing happens within the allocator, and the allocator can track whether the memory it is about to return is <em>known zero</em> (usually because it the block is fresh from the OS, which always zeros memory before handing it out to userspace), and in the case of <code class="language-plaintext highlighter-rouge">calloc</code> it can avoid the zeroing entirely (so <code class="language-plaintext highlighter-rouge">calloc</code> runs as fast as <code class="language-plaintext highlighter-rouge">malloc</code> in that case). The client code calling <code class="language-plaintext highlighter-rouge">malloc</code> doesn’t receive this information and can’t make the same optimization. If you stretch the analogy almost to the breaking point, one can see what Intel is doing here as “similar, but in hardware”. <a href="#fnref:calloc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:nested" role="doc-endnote">
<p>Ha! To me, everything is “worth noting” if it means another footnote. <a href="#fnref:nested" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comProbing a previously undocumented zero-related optimization on Intel CPUs.Adding Staticman Comments2020-02-05T00:00:00+00:002020-02-05T00:00:00+00:00https://travisdowns.github.io/blog/2020/02/05/now-with-comments<p>I’ve added comments to my blog. You can find the existing comments, if any, and the new comment form <a href="#comment-section">at the bottom</a> of any post.</p>
<p>I thought this would take a couple hours, but it actually took <strong>[REDACTED]</strong>. Estimates are hard.</p>
<p>Here’s what I did.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
<li><a href="#set-up-github-bot-account" id="markdown-toc-set-up-github-bot-account">Set Up GitHub Bot Account</a> <ul>
<li><a href="#generate-personal-access-token" id="markdown-toc-generate-personal-access-token">Generate Personal Access Token</a></li>
</ul>
</li>
<li><a href="#set-up-the-blog-repository-configuration" id="markdown-toc-set-up-the-blog-repository-configuration">Set Up the Blog Repository Configuration</a> <ul>
<li><a href="#configuring-staticmanyml" id="markdown-toc-configuring-staticmanyml">Configuring staticman.yml</a></li>
<li><a href="#configuring-_configyml" id="markdown-toc-configuring-_configyml">Configuring _config.yml</a></li>
</ul>
</li>
<li><a href="#set-up-the-api-bridge" id="markdown-toc-set-up-the-api-bridge">Set Up the API Bridge</a> <ul>
<li><a href="#generate-an-rsa-keypair" id="markdown-toc-generate-an-rsa-keypair">Generate an RSA Keypair</a></li>
<li><a href="#sign-up-for-heroku" id="markdown-toc-sign-up-for-heroku">Sign Up for Heroku</a></li>
<li><a href="#deploy-staticman-bridge-to-heroku" id="markdown-toc-deploy-staticman-bridge-to-heroku">Deploy Staticman Bridge to Heroku</a></li>
<li><a href="#configure-bridge-secrets" id="markdown-toc-configure-bridge-secrets">Configure Bridge Secrets</a></li>
</ul>
</li>
<li><a href="#invite-and-accept-bot-to-blog-repo" id="markdown-toc-invite-and-accept-bot-to-blog-repo">Invite and Accept Bot to Blog Repo</a></li>
<li><a href="#enable-recaptcha" id="markdown-toc-enable-recaptcha">Enable reCAPTCHA</a> <ul>
<li><a href="#sign-up-for-recaptcha" id="markdown-toc-sign-up-for-recaptcha">Sign Up for reCAPTCHA</a></li>
<li><a href="#configure-recaptcha" id="markdown-toc-configure-recaptcha">Configure reCAPTCHA</a></li>
</ul>
</li>
<li><a href="#integrate-comments-into-site" id="markdown-toc-integrate-comments-into-site">Integrate Comments Into Site</a> <ul>
<li><a href="#markdown-part" id="markdown-toc-markdown-part">Markdown Part</a></li>
</ul>
</li>
<li><a href="#testing-on-this-post" id="markdown-toc-testing-on-this-post">Testing on This Post</a></li>
<li><a href="#thanks" id="markdown-toc-thanks">Thanks</a></li>
<li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>
<h2 id="introduction">Introduction</h2>
<p>I am using <a href="https://staticman.net/">staticman</a>, created by <a href="https://github.com/eduardoboucas">Eduardo Bouças</a>, as my comments system for this static site.</p>
<p>The basic flow for comment submission is as follows:</p>
<ol>
<li>A reader submits the comment form on a blog post.</li>
<li>Javascript<sup id="fnref:backup" role="doc-noteref"><a href="#fn:backup" class="footnote" rel="footnote">1</a></sup> attached to the form submits it to my <em>staticman API bridge<sup id="fnref:bridge" role="doc-noteref"><a href="#fn:bridge" class="footnote" rel="footnote">2</a></sup></em> running on Heroku.</li>
<li>The API bridge does some validation of the request and submits a <a href="https://github.com/travisdowns/travisdowns.github.io/issues">pull request</a> to the github repo hosting my blog, consisting of a .yml file with the post content and meta data.</li>
<li>When I accept the pull request, it triggers a regeneration and republishing of the content (this is a GitHub pages feature), so the reply appears almost immediately<sup id="fnref:cache" role="doc-noteref"><a href="#fn:cache" class="footnote" rel="footnote">3</a></sup>.</li>
</ol>
<p>Here are the detailed steps to get this working. There are several other tutorials out there, with varying states of exhaustiveness, some of which
I found only after writing most of this, but I’m going to add the pile anyways. There have been several changes to deploying staticman which mean that existing resources (and this one, of course) are marked by which “era” they were written in.</p>
<p>The major changes are:</p>
<ul>
<li>At one point the idea was that everyone would use the public staticman API bridge, but this proved unsustainable. A large amount of the work in setting up staticman is associated with running your own instance of the bridge.</li>
<li>There are three version of the staticman API: v1, v2 and v3. This guide uses v2 (although v3 is almost identical<sup id="fnref:v3" role="doc-noteref"><a href="#fn:v3" class="footnote" rel="footnote">4</a></sup>), but the v1 version is considerably different.</li>
</ul>
<h2 id="set-up-github-bot-account">Set Up GitHub Bot Account</h2>
<p>You’ll want to create a GitHub <em>bot account</em> which will be the account that the API bridge uses to actually submit the pull requests to your blog repository. In principle, you can skip this step entirely and simply use your existing GitHub account, but I wouldn’t recommend it:</p>
<ul>
<li>You’ll be generating a <em>personal access token</em> for this account, and uploading it to the cloud (Heroku) and if this somehow gets compromised, it’s better that it’s a throwaway bot account than your real account.</li>
<li>Having a dedicated account makes it easy to segregate work done by the bot, versus what you’ve done yourself. That is, you probably don’t want all the commits and pushes the bot does to show up on your personal account.</li>
</ul>
<p>The <em>bot account</em> is nothing special: it is just a regular personal account that you’ll only be using from the API bridge. So, open a private browser window, go to <a href="https://github.com">GitHub</a> and choose “Sign Up”. Call your bot something specific, which I’ll refer to as <em>GITHUB-BOT-NAME</em> from here forwards.</p>
<h3 id="generate-personal-access-token">Generate Personal Access Token</h3>
<p>Next, you’ll need to generate a GitHub <em>personal access token</em>, for your bot account. The <a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token">GitHub doc</a> does a better job of explaining this than I can. If you just want everything to work for sure now and in the future, select every single scope when it prompts you, but if you care about security you should only need the <em>repo</em> and <em>user</em> scopes (today):</p>
<p><strong>Repo scope:</strong></p>
<p><img src="/assets/now-with-comments/scopes-repo.png" alt="Repo scope" /></p>
<p><strong>User scope:</strong></p>
<p><img src="/assets/now-with-comments/scopes-user.png" alt="User scope" /></p>
<p>Copy and paste the displayed token somewhere safe: you’ll need this token in a later step where I’ll refer to it as <em>${github_token}</em>. Once you close this page there is no way to recover the access token.</p>
<h2 id="set-up-the-blog-repository-configuration">Set Up the Blog Repository Configuration</h2>
<p>You’ll need to include configuration for staticman in two separate places in your blog repository: <code class="language-plaintext highlighter-rouge">_config.yml</code> (the primary Jekyll config file) and <code class="language-plaintext highlighter-rouge">staticman.yml</code>, both at the top level of the repository.</p>
<p>In general, the stuff that goes in <code class="language-plaintext highlighter-rouge">_config.yml</code> is for use within the static generation phase of your site, e.g., controlling the generation of the comment form and the associated javascipt. The stuff in <code class="language-plaintext highlighter-rouge">staticman.yml</code> isn’t used during generation, but is used dynamically by the API bridge (read directly from GitHub on each request) to configure the activities of the bridge. A few thigns are duplicated in both places.</p>
<h3 id="configuring-staticmanyml">Configuring staticman.yml</h3>
<p>Most of the configuration for the ABI bridge is set in <code class="language-plaintext highlighter-rouge">staticman.yml</code> which lives in the top level of your <em>blog repository</em>. This means that one API bridge can support many different blog repositories, each with their own configuration (indeed, this feature was critical for the original design of a shared ABI bridge).</p>
<p><a href="https://github.com/eduardoboucas/staticman/blob/master/staticman.sample.yml">Here’s a sample file</a> from the staticman GitHub repository, but you might want to use <a href="https://github.com/travisdowns/travisdowns.github.io/blob/master/staticman.yml">this one</a> from my repository as it is a bit more fleshed out.</p>
<p>The main things you want to change are shown below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1"># all of these fields are nested under the comments key, which corresponds to the final element</span>
<span class="c1"># of the API bridge enpoint, i.e., you can different configurations even within the same staticman.yml</span>
<span class="c1"># file all under different keys</span>
<span class="na">comments</span><span class="pi">:</span>
<span class="c1"># There are many more required config values here, not shown:</span>
<span class="c1"># use the file linked above as a template</span>
<span class="c1"># I guess used only for email notifications?</span>
<span class="na">name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Performance</span><span class="nv"> </span><span class="s">Matters</span><span class="nv"> </span><span class="s">Blog"</span>
<span class="c1"># You may want a different set of "required fields". Staticman will</span>
<span class="c1"># reject posts without all of these fields</span>
<span class="na">requiredFields</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">name"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">email"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">message"</span><span class="pi">]</span>
<span class="c1"># you are going to want reCaptcha set up, but for now leave it disabled because we need the API</span>
<span class="c1"># bridge up and running in order to encrypt the secrets that go in this section</span>
<span class="na">reCaptcha</span><span class="pi">:</span>
<span class="na">enabled</span><span class="pi">:</span> <span class="no">false</span>
<span class="c1"># siteKey: 6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1</span>
<span class="c1"># secret: a big encrypted secret (see Note above)</span>
</code></pre></div></div>
<h3 id="configuring-_configyml">Configuring _config.yml</h3>
<p>The remainder of the configuration goes in <code class="language-plaintext highlighter-rouge">_config.yml</code>. Here’s the configuration I added to start with (we’ll add a bit more later):</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The URL for the staticman API bridge endpoint</span>
<span class="c1"># You will want to modify some of the values:</span>
<span class="c1"># ${github-username}: the username of the account with which you publish your blog</span>
<span class="c1"># ${blog-repo}: the name of your blog repository in github</span>
<span class="c1"># master: this the branch out of which your blog is published, often master or gh-pages</span>
<span class="c1"># ${bridge_app_name}: the name you chose in Heroku for your bridge API</span>
<span class="c1"># comments: the so-called property, this defines the key in staticman.yml where the configuration is found</span>
<span class="c1">#</span>
<span class="c1"># for me, this line reads:</span>
<span class="c1"># https://staticman-travisdownsio.herokuapp.com/v2/entry/travisdowns/travisdowns.github.io/master/comments</span>
<span class="na">staticman_url</span><span class="pi">:</span> <span class="s">https://${bridge_app_name}.herokuapp.com/v2/entry/${github-username}/${blog-repo}/master/comments</span>
</code></pre></div></div>
<h2 id="set-up-the-api-bridge">Set Up the API Bridge</h2>
<p>This section covers deploying a private instances of the API bridge to Heroku.</p>
<h3 id="generate-an-rsa-keypair">Generate an RSA Keypair</h3>
<p>This keypair will be used to encrypt secrets that will be stored in public places, such as your reCAPTCHA site secret. The sececrets will be encrypted with the public half of the keypair, and decriped in the Bridge API server with the private part.</p>
<p>Use the following on your local to generate to generate the pair:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh-keygen -m PEM -t rsa -b 4096 -C "staticman key" -f ~/.ssh/staticman_key
</code></pre></div></div>
<p>Don’t use any passphrase<sup id="fnref:pass" role="doc-noteref"><a href="#fn:pass" class="footnote" rel="footnote">5</a></sup>. You can change the <code class="language-plaintext highlighter-rouge">-f</code> argument if you want to save the key somewhere else, in which case you’ll have to use the new location when setting up the Heroku config below.</p>
<p>You can verify the key was genreated by running:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>head -2 ~/.ssh/staticman_key
</code></pre></div></div>
<p>Which should output something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-----BEGIN RSA PRIVATE KEY-----
MIIJKAIBAAKCAgEAud7+fPWXzuxCoyyGbQTYCGi9C1N984roI/Tr7yJi074F+Cfp
</code></pre></div></div>
<p>Your second line will vary of course, but the first line must be <code class="language-plaintext highlighter-rouge">-----BEGIN RSA PRIVATE KEY-----</code>. If you see something else, perhaps mentioning <code class="language-plaintext highlighter-rouge">OPENSSH PRIVATE KEY</code>, it won’t work.</p>
<h3 id="sign-up-for-heroku">Sign Up for Heroku</h3>
<p>The original idea of staticman was to have a public API bridge that everyone uses for free. However, in practice this hasn’t proved sustainable as whatever free tier the thing was running on tends to hit its limits and then the fun stops. So the current recommendation is to set up a free instance of the API bridge on Heroku. So let’s do that.</p>
<p><a href="https://signup.heroku.com/">Sign up</a> for a free account on Heroku. No credit card is required and a free account should give you enough juice for at least 1,000 comments a month<sup id="fnref:juice" role="doc-noteref"><a href="#fn:juice" class="footnote" rel="footnote">6</a></sup>.</p>
<h3 id="deploy-staticman-bridge-to-heroku">Deploy Staticman Bridge to Heroku</h3>
<p>The easiest way to do this is simply to click the <em>Deploy to Heroku</em> button in the <a href="https://github.com/eduardoboucas/staticman">README on the staticman repo</a>:</p>
<p><img src="/assets/now-with-comments/deploy.png" alt="Deploy" width="50%" /></p>
<p>You’ll see probably some logging indicating that the project is downloading, building and then successfully deployed.</p>
<h3 id="configure-bridge-secrets">Configure Bridge Secrets</h3>
<p>The bridge needs a couple of secrets to do its job:</p>
<ul>
<li>The <em>GitHub personal access token</em> of your bot account. This lets it do work on behalf of your bot account (in particular, submit pull requests to your blog repository).</li>
<li>The private key of the keypair you generated earlier.</li>
</ul>
<p>If you want, you can add both of these through the Heroku web dashboard: go to Settings -> Reveal Config Vars, and enter them <a href="/assets/now-with-comments/config-vars.png">like this</a>).</p>
<p>However, you might as well get familiar with the Heroku command line, because it’s pretty cool and allows you to complete this flow without having your GitHub token flow through your clipboard and makes it easy to remove the newline characters in the private key.</p>
<p>Follow <a href="https://devcenter.heroku.com/articles/heroku-cli">the instructions</a> to install and login to the Heroku CLI, then issue the following commands from any directory (note that <code class="language-plaintext highlighter-rouge">${github_token}</code> is the <em>personal access token</em> you generated earlier: copy and paste it into the command):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>heroku config:add <span class="nt">--app</span> <span class="k">${</span><span class="nv">bridge_app_name</span><span class="k">}</span> <span class="s2">"RSA_PRIVATE_KEY=</span><span class="si">$(</span><span class="nb">cat</span> ~/.ssh/staticman_key | <span class="nb">tr</span> <span class="nt">-d</span> <span class="s1">'\n'</span><span class="si">)</span><span class="s2">"</span>
heroku config:add <span class="nt">--app</span> <span class="k">${</span><span class="nv">bridge_app_name</span><span class="k">}</span> <span class="s2">"GITHUB_TOKEN=</span><span class="k">${</span><span class="nv">github_token</span><span class="k">}</span><span class="s2">"</span>
</code></pre></div></div>
<p>Here, the <code class="language-plaintext highlighter-rouge">tr -d '\n'</code> part of the pipeline is removing the newlines from the private key, since Heroku config variables can’t handle them and/or the API bridge can’t handle them.</p>
<p>You can check that the config was correctly set by outputting it as follows:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>heroku config <span class="nt">--app</span> <span class="k">${</span><span class="nv">bridge_app_name</span><span class="k">}</span>
</code></pre></div></div>
<h2 id="invite-and-accept-bot-to-blog-repo">Invite and Accept Bot to Blog Repo</h2>
<p>Finally, you need to invite your GitHub <em>bot account</em> that you created earlier to your blog repository<sup id="fnref:whycollab" role="doc-noteref"><a href="#fn:whycollab" class="footnote" rel="footnote">7</a></sup> and accept the invite.</p>
<p>Open your blog repository, go to <em>Settings -> Collaborators</em> and search for and add the GitHub bot account that you created earlier as a collaborator:</p>
<p><img src="/assets/now-with-comments/add-collab.png" alt="Adding Collaborators" /></p>
<p>Next, accept<sup id="fnref:invite" role="doc-noteref"><a href="#fn:invite" class="footnote" rel="footnote">8</a></sup> the invitation using the bridge API, by going to the following URL:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://${bridge_app_name}.herokuapp.com/v2/connect/${github-username}/${blog-repo}
</code></pre></div></div>
<p>You should see <code class="language-plaintext highlighter-rouge">OK!</code> as the output if it worked: this only appears <em>once</em> when the invitation got accepted, at all other times it will show <code class="language-plaintext highlighter-rouge">Invitation not found</code>.</p>
<h2 id="enable-recaptcha">Enable reCAPTCHA</h2>
<p>You are going to want to gate comment submission using reCAPTCHA or a similar system so you don’t get destroyed by spam (even if you have moderation enabled, dealing with all the pull requests will probably be tiring).</p>
<p>Here we’ll cover setting up reCAPTCHA, which has built-in support in staticman. Although it involves modifying the same <code class="language-plaintext highlighter-rouge">_config.yml</code> and <code class="language-plaintext highlighter-rouge">staticman.yml</code> files that we’ve modified before, this part of the configuration needs to occur after the bridge is running because we use the <code class="language-plaintext highlighter-rouge">/encrypt</code> endpoint on the bridge as part of the setup.</p>
<h3 id="sign-up-for-recaptcha">Sign Up for reCAPTCHA</h3>
<p>Go to <a href="https://developers.google.com/recaptcha">reCAPTCHA</a> and sign up if you haven’t already, and create a new site. We are going to use the “v2, Checkbox” variant (<a href="https://developers.google.com/recaptcha/docs/display">docs here</a>), although I’m interested to hear how it works out with other variants.</p>
<p>You will need the reCAPTCHA <em>site key</em> and <em>secret key</em> for configuration in the next section.</p>
<h3 id="configure-recaptcha">Configure reCAPTCHA</h3>
<p>Next, we need to add the <em>site key</em> and <em>secret key</em> to the <code class="language-plaintext highlighter-rouge">_config.yml</code> and <code class="language-plaintext highlighter-rouge">staticman.yml</code> config files.</p>
<p>The <em>site key</em> will be used as-is, but the <em>secret key</em> property will be <a href="https://staticman.net/docs/encryption"><em>encrypted</em></a> so that it is not exposed in plaintext in your configuration files. To encrypt the secret key copy the secret from the reCAPTCHA admin console, and load the following URL from your API bridge, replacing <code class="language-plaintext highlighter-rouge">YOUR_SITE_KEY</code> at with the copied secret key.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://${bridge_app_name}.herokuapp.com/v2/encrypt/YOUR_SITE_KEY
</code></pre></div></div>
<p>You should get a blob of characters back as a result (considerably longer than the original secret) – it is <em>this</em> value that you need to include as <code class="language-plaintext highlighter-rouge">reCaptcha.secret</code> in both <code class="language-plaintext highlighter-rouge">staticman.yml</code> and in <code class="language-plaintext highlighter-rouge">_config.yml</code>.</p>
<p>The reCAPTCHA configuration for both files is almost the same. It looks like this for <code class="language-plaintext highlighter-rouge">staticman.yml</code>:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">comments</span><span class="pi">:</span>
<span class="c1"># more stuff</span>
<span class="c1"># note that reCaptcha is nested under comments</span>
<span class="na">reCaptcha</span><span class="pi">:</span>
<span class="na">enabled</span><span class="pi">:</span> <span class="no">true</span>
<span class="c1"># the siteKey is used as-is (no encryption)</span>
<span class="na">siteKey</span><span class="pi">:</span> <span class="s">6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1</span>
<span class="c1"># the secret is the encrypted blob you got back from the encrypt call</span>
<span class="na">secret</span><span class="pi">:</span> <span class="s">a big encrypted secret (see description above)</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">_config.yml</code> version is similar except that they key appears at the top level and there is no enabled property:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># reCaptcha configuration info: the exact same site key and *encrypted* secret that you used in staticman.yml</span>
<span class="c1"># I personally don't think the secret needs to be included in the generated site, but the staticman API bridge uses</span>
<span class="c1"># it to ensure the site configuration and bridge configuration match (but why not just compare the site key?)</span>
<span class="na">reCaptcha</span><span class="pi">:</span>
<span class="na">siteKey</span><span class="pi">:</span> <span class="s">6LcWstQUAAAAALoGBcmKsgCFbMQqkiGiEt361nK1</span>
<span class="na">secret</span><span class="pi">:</span> <span class="s">exactly the same secret as the staticman.yml file</span>
</code></pre></div></div>
<h2 id="integrate-comments-into-site">Integrate Comments Into Site</h2>
<p>Finally, you need to integrate code to display the existing comments and submit new comments.</p>
<p>I used a mash up of commenting code from the <a href="https://spinningnumbers.org">spinningnumbers.org</a> blog as well as the staticman integration in the <a href="https://mmistakes.github.io/minimal-mistakes/docs/configuration/#static-based-comments-via-staticman">minimal mistakes theme</a>. The advantage the former has over the latter is that the comments allow one level of nesting (replies to top-level comments are nested beneath it).</p>
<p>I planed to extract the associated markdown, liquid and JavaScript code to a separate repository as a single point where people could collaborate on this part of the integration, but man I’ve already spent way to long on this. I may still do it, but for now here’s how I did the integration.</p>
<h3 id="markdown-part">Markdown Part</h3>
<p>The key thing you need to do is include a blob of HTML and associated JavaScript in any page where you want to display and accept comments. I do this as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% if page.comments == true %}
{% include comments.html %}
{% endif %}
</code></pre></div></div>
<p>You can paste it into any post, or better add it to the <code class="language-plaintext highlighter-rouge">footer.html</code> include or something like that (details depend on your theme). The invariant is that wherever this appears, the existing comments appear, followed by a form to submit new comments. You can see the <a href="https://github.com/travisdowns/travisdowns.github.io/blob/master/_includes/comments.html"><code class="language-plaintext highlighter-rouge">comments.html</code> include here</a> – in turn, it includes <code class="language-plaintext highlighter-rouge">comment.html</code> (once per comment, generates the comment html) and <code class="language-plaintext highlighter-rouge">comment_form.html</code> which generates the new comment form.</p>
<p>This ultimately includes <a href="https://github.com/travisdowns/travisdowns.github.io/blob/4a57e4f6d8cb4ef0ac8801a31740d1ca32dfa8ae/_includes/comments.html#L28">external JavaScript</a> for JQuery and reCAPTCHA, as well as <a href="https://github.com/travisdowns/travisdowns.github.io/blob/master/assets/main.js">main.js</a> which includes the JavaScript to implement the replies (moving the form when the “reply to” button is clicked, and submitting the form via AJAX to the API bridge).</p>
<p>You can try to use this same integration in your Jekyll blog. You’d need to:</p>
<ul>
<li>Copy the <code class="language-plaintext highlighter-rouge">_includes/comment.html</code>, <code class="language-plaintext highlighter-rouge">_includes/comments.html</code>, <code class="language-plaintext highlighter-rouge">_includes/comment_form.html</code>, <code class="language-plaintext highlighter-rouge">assets/main.js</code>, and <code class="language-plaintext highlighter-rouge">_sass/comment-styles.css</code> from my <a href="https://github.com/travisdowns/travisdowns.github.io">blog files</a> to your blog repository.</li>
<li>In <code class="language-plaintext highlighter-rouge">assets/main.js</code>, replace the link to <code class="language-plaintext highlighter-rouge">https://github.com/travisdowns/travisdowns.github.io/pulls</code> with a link to your own repository (or otherwise customize the “success” message as you see fit).</li>
<li>Include <code class="language-plaintext highlighter-rouge">@import "comment-styles";</code> in your <code class="language-plaintext highlighter-rouge">assets/main.scss</code> file. If you don’t have one, you’ll need to create it following the rules for your theme. Usually this just means a <code class="language-plaintext highlighter-rouge">main.scss</code> with empty front-matter and an <code class="language-plaintext highlighter-rouge">@import "your-theme";</code> line to import the theme SCSS. Alternately, you could avoid putting anything in <code class="language-plaintext highlighter-rouge">main.scss</code> and just include the comment styles as a separate file, but this adds another request to each post.</li>
<li>Do the <code class="language-plaintext highlighter-rouge">include comments.html</code> thing shown above in an appropriate place in your template/theme.</li>
<li>Set <code class="language-plaintext highlighter-rouge">comments: true</code> in the front matter of posts you want to have comments (or set it as a default in <code class="language-plaintext highlighter-rouge">_config.yml</code>).</li>
</ul>
<h2 id="testing-on-this-post">Testing on This Post</h2>
<p>If you want to leave a non-trivial comment on the content of this post, you can do so below. However, if you’d instead like to make a “just testing” post, to see how the http request works, or check the created pull request, etc, please do it <a href="/misc/comments-test.html">over here</a>. Testing only comments made on this post will be closed without accepting them.</p>
<h2 id="thanks">Thanks</h2>
<p>Thanks to Eduardo Boucas for creating staticman.</p>
<p>Thanks to <a href="https://spinningnumbers.org/">Willy McAllister</a> for nested comment display work I unabashedly cribbed, and helping me sort out an RSA key genreation problem, and pointing out some inconsistencies in the doc.</p>
<h2 id="references">References</h2>
<p>Things that were handy references while getting this working.</p>
<p>This comment on <a href="https://github.com/eduardoboucas/staticman/issues/318#issuecomment-552755165">GitHub issue #318</a> was the list that I more or less follwed (I didn’t use the dev branch though).</p>
<p>Willy McAllister describes setting up staticman <a href="https://spinningnumbers.org/a/staticman.html">in this post</a> – his implemented of nested comments forms the basis the one I used.</p>
<p>Another <a href="https://gist.github.com/jannispaul/3787603317fc9bbb96e99c51fe169731">list of steps</a> to get staticman working and some troubleshooting.</p>
<p>Michael Rose, the author of minimal mistakes Jekyll theme <a href="https://mademistakes.com/articles/improving-jekyll-static-comments/">describes setting up nested staticman comments</a> – I cribbed some stuff from here such as the submitting spinner.</p>
<p>Willy McAllister subsequently wrote a <a href="https://spinningnumbers.org/a/staticman-heroku.html">great guide</a> to setting up Staticman, similar in nature to this one but with some additional sections such as <em>troubleshooting</em> and setting up reply notifications via MailGun. If that one was around when I started out, I wouldn’t have felt the need to write this one.</p>
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:backup" role="doc-endnote">
<p>If javascript is disabled, a regular POST action takes over. <a href="#fnref:backup" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:bridge" role="doc-endnote">
<p>I don’t think you’ll find this <em>bridge</em> term in the official documentation, but I’m going to use it here. <a href="#fnref:bridge" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:cache" role="doc-endnote">
<p>Well, subject to whatever edge caching GitHub pages is using – btw you can bust the cache by appending any random query parameter to the page: <code class="language-plaintext highlighter-rouge">...post.html?foo=1234</code>. <a href="#fnref:cache" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:v3" role="doc-endnote">
<p>v3 mostly just extends to the URL format for the <code class="language-plaintext highlighter-rouge">/event</code> endpoint to include the hosting provider (either GitHub or GitLab), allowing the use of GitHub in addition to GitLab. Almost everything in this guide would remain unchanged. <a href="#fnref:v3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:pass" role="doc-endnote">
<p>You could use a passphrase, but then you’ll have to change the <code class="language-plaintext highlighter-rouge">cat</code> used below to echo the key into the Heroku config. If you want to be super safe, best is to generate the key to a transient location like ramfs and then simply delete the private portion after you’ve uploaded it to the Heroku config. <a href="#fnref:pass" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:juice" role="doc-endnote">
<p>In particular, the <em>unverified</em> (no credit card) free tier gives you 550 hours of uptime a month, and since the <em>dyno</em> (heroku speak for their on-demand host) sleeps after 30 minutes, I figure you can handle 550/0.5 = 1100 sparsely submitted comments. Of course, if comments come in bursts, you could handle much more than that, since you’ve already “paid” for the 30 minute uptime. <a href="#fnref:juice" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whycollab" role="doc-endnote">
<p>The bot needs to be a collaborator to, at a minimum, commit comments to the repository, and to delete branches (using the delete branches webhook which cleans up comment related branches). However, it is possible to not use either of these features if you have moderation enabled (in which case comments arrive as a PR, which doesn’t require any particular permissions), and aren’t using the webhook. So maybe you could do without the collaborator status in that case? I haven’t tested it. <a href="#fnref:whycollab" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:invite" role="doc-endnote">
<p>I guess you can also just accept the invitation by opening the email sent to you by github and following the link there. This workflow involving the <code class="language-plaintext highlighter-rouge">v2/connect</code> endpoint probably made more sense when the API was meant to be shared among many uses using a common github bot account. <a href="#fnref:invite" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comAdding static comments to a static blog using staticman. Static.The Hunt for the Fastest Zero2020-01-20T00:00:00+00:002020-01-20T00:00:00+00:00https://travisdowns.github.io/blog/2020/01/20/zero<p>Let’s say I ask you to fill a <code class="language-plaintext highlighter-rouge">char</code> array of size <code class="language-plaintext highlighter-rouge">n</code> with zeros. I don’t know why, exactly, but please play along for now.</p>
<p>If this were C, we would probably reach for <code class="language-plaintext highlighter-rouge">memset</code>, but let’s pretend we are trying to write idiomatic C++ instead.</p>
<p>You might come up with something like<sup id="fnref:function" role="doc-noteref"><a href="#fn:function" class="footnote" rel="footnote">1</a></sup>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">fill1</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">fill</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="n">n</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I’d give this solution full marks. In fact, I’d call it more or less the canonical modern C++ solution to the problem.</p>
<p>What if told you there was a solution that was up to about 29 times faster? It doesn’t even require sacrificing any goats to the C++ gods, either: just adding three characters:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">fill2</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">fill</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="n">n</span><span class="p">,</span> <span class="sc">'\0'</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Yes, switching <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">'\0'</code> speeds this up by nearly a factor of <em>thirty</em> on my <abbr title="Intel's Skylake (client) architecture, aka 6th Generation Intel Core i3,i5,i7">SKL</abbr> box<sup id="fnref:qb" role="doc-noteref"><a href="#fn:qb" class="footnote" rel="footnote">2</a></sup>, at least with my default compiler (gcc) and optimization level (-O2):</p>
<div class="table-wrapper table-nowrap-header">
<table>
<thead>
<tr>
<th>Function</th>
<th style="text-align: right">Bytes / Cycle</th>
</tr>
</thead>
<tbody>
<tr>
<td>fill1</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td>fill2</td>
<td style="text-align: right">29.1</td>
</tr>
</tbody>
</table>
</div>
<p>The why becomes obvious if you look at the <a href="https://godbolt.org/z/O_f3Jx">assembly</a>:</p>
<p><strong>fill1:</strong></p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">fill1</span><span class="p">(</span><span class="nb">ch</span><span class="nv">ar</span><span class="o">*</span><span class="p">,</span> <span class="nv">unsigned</span> <span class="nv">long</span><span class="p">):</span>
<span class="nf">add</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">rdi</span>
<span class="nf">cmp</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">rdi</span>
<span class="nf">je</span> <span class="nv">.L1</span>
<span class="nl">.L3:</span>
<span class="nf">mov</span> <span class="kt">BYTE</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="mi">0</span> <span class="c1">; store 0 into memory at [rdi]</span>
<span class="nf">add</span> <span class="nb">rdi</span><span class="p">,</span> <span class="mi">1</span> <span class="c1">; increment rdi</span>
<span class="nf">cmp</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">rdi</span> <span class="c1">; compare rdi to size</span>
<span class="nf">jne</span> <span class="nv">.L3</span> <span class="c1">; keep going if rdi < size</span>
<span class="nl">.L1:</span>
<span class="nf">ret</span>
</code></pre></div></div>
<p>This version is using a byte-by-byte copy loop, which I’ve annotated – it is more or less a 1:1 translation of how you’d imagine <code class="language-plaintext highlighter-rouge">std::fill</code> is written. The result of 1 cycle per byte is exactly what we’d expect using <a href="/blog/2019/06/11/speed-limits.html">speed limit analysis</a>: it is simultaneously limited by two different bottlenecks: 1 taken branch per cycle, and 1 store per cycle.</p>
<p>The <code class="language-plaintext highlighter-rouge">fill2</code> version doesn’t have a loop at all:</p>
<p><strong>fill2:</strong></p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">fill2</span><span class="p">(</span><span class="nb">ch</span><span class="nv">ar</span><span class="o">*</span><span class="p">,</span> <span class="nv">unsigned</span> <span class="nv">long</span><span class="p">):</span>
<span class="nf">test</span> <span class="nb">rsi</span><span class="p">,</span> <span class="nb">rsi</span>
<span class="nf">jne</span> <span class="nv">.L8</span> <span class="c1">; skip the memcpy call if size == 0</span>
<span class="nf">ret</span>
<span class="nl">.L8:</span>
<span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span> <span class="nb">rsi</span>
<span class="nf">xor</span> <span class="nb">esi</span><span class="p">,</span> <span class="nb">esi</span>
<span class="nf">jmp</span> <span class="nv">memset</span> <span class="c1">; tailcall to memset</span>
</code></pre></div></div>
<p>Rather, it simply defers immediately to <code class="language-plaintext highlighter-rouge">memset</code>. We aren’t going to dig into the assembly for <code class="language-plaintext highlighter-rouge">memset</code> here, but the fastest possible <code class="language-plaintext highlighter-rouge">memset</code> would run at 32 bytes/cycle, limited by 1 store/cycle and maximum vector the width of 32 bytes on my machine, so the measured value of 29 bytes/cycle indicates it’s using an implementation something along those lines.</p>
<p>So that’s the <em>why</em>, but what’s the <em>why of the why</em> (second order why)?</p>
<p>I thought this had something to do with the optimizer. After all, at <code class="language-plaintext highlighter-rouge">-O3</code> even the <code class="language-plaintext highlighter-rouge">fill1</code> version using the plain <code class="language-plaintext highlighter-rouge">0</code> constant calls <code class="language-plaintext highlighter-rouge">memset</code>.</p>
<p>I was wrong, however. The answer actually lies in the implementation of the C++ standard library (there are various, gcc is using libstdc++ in this case). Let’s take a look at the implementation of <code class="language-plaintext highlighter-rouge">std::fill</code> (I’ve reformatted the code for clarity and removed some compile-time concept checks):</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/*
* ...
*
* This function fills a range with copies of the same value. For char
* types filling contiguous areas of memory, this becomes an inline call
* to @c memset or @c wmemset.
*/</span>
<span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">_ForwardIterator</span><span class="p">,</span> <span class="k">typename</span> <span class="nc">_Tp</span><span class="p">></span>
<span class="kr">inline</span> <span class="kt">void</span> <span class="nf">fill</span><span class="p">(</span><span class="n">_ForwardIterator</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_ForwardIterator</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tp</span><span class="o">&</span> <span class="n">__value</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">__fill_a</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">__niter_base</span><span class="p">(</span><span class="n">__first</span><span class="p">),</span> <span class="n">std</span><span class="o">::</span><span class="n">__niter_base</span><span class="p">(</span><span class="n">__last</span><span class="p">),</span> <span class="n">__value</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The included part of the comment<sup id="fnref:wmem" role="doc-noteref"><a href="#fn:wmem" class="footnote" rel="footnote">3</a></sup> already hints at what is to come: the implementor of <code class="language-plaintext highlighter-rouge">std::fill</code> has apparently considered specifically optimizing the call to a <code class="language-plaintext highlighter-rouge">memset</code> in some scenarios. So we keep following the trail, which brings us to the helper method <code class="language-plaintext highlighter-rouge">std::__fill_a</code>. There are two overloads that are relevant here, the general method and an overload which handles the special case:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">_ForwardIterator</span><span class="p">,</span> <span class="k">typename</span> <span class="nc">_Tp</span><span class="p">></span>
<span class="kr">inline</span> <span class="k">typename</span>
<span class="n">__gnu_cxx</span><span class="o">::</span><span class="n">__enable_if</span><span class="o"><!</span><span class="n">__is_scalar</span><span class="o"><</span><span class="n">_Tp</span><span class="o">>::</span><span class="n">__value</span><span class="p">,</span> <span class="kt">void</span><span class="o">>::</span><span class="n">__type</span>
<span class="nf">__fill_a</span><span class="p">(</span><span class="n">_ForwardIterator</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_ForwardIterator</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tp</span><span class="o">&</span> <span class="n">__value</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">__first</span> <span class="o">!=</span> <span class="n">__last</span><span class="p">;</span> <span class="o">++</span><span class="n">__first</span><span class="p">)</span>
<span class="o">*</span><span class="n">__first</span> <span class="o">=</span> <span class="n">__value</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Specialization: for char types we can use memset.</span>
<span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">_Tp</span><span class="p">></span>
<span class="kr">inline</span> <span class="k">typename</span>
<span class="n">__gnu_cxx</span><span class="o">::</span><span class="n">__enable_if</span><span class="o"><</span><span class="n">__is_byte</span><span class="o"><</span><span class="n">_Tp</span><span class="o">>::</span><span class="n">__value</span><span class="p">,</span> <span class="kt">void</span><span class="o">>::</span><span class="n">__type</span>
<span class="nf">__fill_a</span><span class="p">(</span><span class="n">_Tp</span><span class="o">*</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_Tp</span><span class="o">*</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tp</span><span class="o">&</span> <span class="n">__c</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="n">_Tp</span> <span class="n">__tmp</span> <span class="o">=</span> <span class="n">__c</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="k">const</span> <span class="kt">size_t</span> <span class="n">__len</span> <span class="o">=</span> <span class="n">__last</span> <span class="o">-</span> <span class="n">__first</span><span class="p">)</span>
<span class="n">__builtin_memset</span><span class="p">(</span><span class="n">__first</span><span class="p">,</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">unsigned</span> <span class="kt">char</span><span class="o">></span><span class="p">(</span><span class="n">__tmp</span><span class="p">),</span> <span class="n">__len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now we see how the <code class="language-plaintext highlighter-rouge">memset</code> appears. It is called explicitly by the second implementation shown above, selected by <code class="language-plaintext highlighter-rouge">enable_if</code> when the SFINAE condition <code class="language-plaintext highlighter-rouge">__is_byte<_Tp></code> is true. Note, however, that unlike the general function, this variant has a single template argument: <code class="language-plaintext highlighter-rouge">template<typename _Tp></code>, and the function signature is:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__fill_a</span><span class="p">(</span><span class="n">_Tp</span><span class="o">*</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_Tp</span><span class="o">*</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tp</span><span class="o">&</span> <span class="n">__c</span><span class="p">)</span>
</code></pre></div></div>
<p>Hence, it will only be considered when the <code class="language-plaintext highlighter-rouge">__first</code> and <code class="language-plaintext highlighter-rouge">__last</code> pointers which delimit the range have the <em>exact same type as the value being filled</em>. When when you write <code class="language-plaintext highlighter-rouge">std::fill(p, p + n, 0)</code> where <code class="language-plaintext highlighter-rouge">p</code> is <code class="language-plaintext highlighter-rouge">char *</code>, you rely on template type deduction for the parameters, which ends up deducing <code class="language-plaintext highlighter-rouge">char *</code> and <code class="language-plaintext highlighter-rouge">int</code> for the iterator type and value-to-fill type, <em>because <code class="language-plaintext highlighter-rouge">0</code> is an integer constant</em>.</p>
<p>That is, it is if you had written:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">std</span><span class="o">::</span><span class="n">fill</span><span class="o"><</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="n">n</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p>This prevents the clever <code class="language-plaintext highlighter-rouge">memset</code> optimization from taking place: the overload that does it is never called because the iterator value type is different than the value-to-fill type.</p>
<p>This suggests a fix: we can simply force the template argument types rather than rely on type deduction:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">fill3</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">p</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">fill</span><span class="o"><</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span><span class="o">></span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="n">n</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This way, we <a href="https://godbolt.org/z/VTssh9">get the <code class="language-plaintext highlighter-rouge">memset</code> version</a>.</p>
<p>Finally, why does <code class="language-plaintext highlighter-rouge">fill2</code> using <code class="language-plaintext highlighter-rouge">'\0'</code> get the fast version, without forcing the template arguments? Well <code class="language-plaintext highlighter-rouge">'\0'</code> is a <code class="language-plaintext highlighter-rouge">char</code> constant, so the value-to-assign type is <code class="language-plaintext highlighter-rouge">char</code>. You could achieve the same effect with a cast, e.g., <code class="language-plaintext highlighter-rouge">static_cast<char>(0)</code> – and for buffers which have types like <code class="language-plaintext highlighter-rouge">unsigned char</code> this is necessary because <code class="language-plaintext highlighter-rouge">'\0'</code> does not have the same type as <code class="language-plaintext highlighter-rouge">unsigned char</code> (at least <a href="https://godbolt.org/z/YQKp7V">on gcc</a>).</p>
<p>One might reasonably ask if this could be fixed in the standard library. I think so.</p>
<p>One idea would be to keying off of <em>only</em> the type of the <code class="language-plaintext highlighter-rouge">first</code> and <code class="language-plaintext highlighter-rouge">last</code> pointers, like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">_Tp</span><span class="p">,</span> <span class="k">typename</span> <span class="nc">_Tvalue</span><span class="p">></span>
<span class="kr">inline</span> <span class="k">typename</span>
<span class="n">__gnu_cxx</span><span class="o">::</span><span class="n">__enable_if</span><span class="o"><</span><span class="n">__is_byte</span><span class="o"><</span><span class="n">_Tp</span><span class="o">>::</span><span class="n">__value</span><span class="p">,</span> <span class="kt">void</span><span class="o">>::</span><span class="n">__type</span>
<span class="nf">__fill_a</span><span class="p">(</span><span class="n">_Tp</span><span class="o">*</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_Tp</span><span class="o">*</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tvalue</span><span class="o">&</span> <span class="n">__c</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="n">_Tvalue</span> <span class="n">__tmp</span> <span class="o">=</span> <span class="n">__c</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="k">const</span> <span class="kt">size_t</span> <span class="n">__len</span> <span class="o">=</span> <span class="n">__last</span> <span class="o">-</span> <span class="n">__first</span><span class="p">)</span>
<span class="n">__builtin_memset</span><span class="p">(</span><span class="n">__first</span><span class="p">,</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">unsigned</span> <span class="kt">char</span><span class="o">></span><span class="p">(</span><span class="n">__tmp</span><span class="p">),</span> <span class="n">__len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This says: who cares about the type of the value, it is going to get converted during assignment to the value type of the pointer anyways, so just look at the pointer type. E.g., if the type of the value-to-assign <code class="language-plaintext highlighter-rouge">_Tvalue</code> is <code class="language-plaintext highlighter-rouge">int</code>, but <code class="language-plaintext highlighter-rouge">_Tp</code> is <code class="language-plaintext highlighter-rouge">char</code> then this expands to this version, which is totally equivalent:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">__fill_a</span><span class="p">(</span><span class="kt">char</span><span class="o">*</span> <span class="n">__first</span><span class="p">,</span> <span class="kt">char</span><span class="o">*</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="kt">int</span><span class="o">&</span> <span class="n">__c</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">int</span> <span class="n">__tmp</span> <span class="o">=</span> <span class="n">__c</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="k">const</span> <span class="kt">size_t</span> <span class="n">__len</span> <span class="o">=</span> <span class="n">__last</span> <span class="o">-</span> <span class="n">__first</span><span class="p">)</span>
<span class="n">__builtin_memset</span><span class="p">(</span><span class="n">__first</span><span class="p">,</span> <span class="k">static_cast</span><span class="o"><</span><span class="kt">unsigned</span> <span class="kt">char</span><span class="o">></span><span class="p">(</span><span class="n">__tmp</span><span class="p">),</span> <span class="n">__len</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This works … for simple types like <code class="language-plaintext highlighter-rouge">int</code>. Where it fails is if the value to fill has a tricky, non-primitive type, like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct conv_counting_int {
int v_;
mutable size_t count_ = 0;
operator char() const {
count_++;
return (char)v_;
}
};
size_t fill5(char *p, size_t n) {
conv_counting_int zero{0};
std::fill(p, p + n, zero);
return zero.count_;
}
</code></pre></div></div>
<p>Here, the pointer type passed to <code class="language-plaintext highlighter-rouge">std::fill</code> is <code class="language-plaintext highlighter-rouge">char</code>, but you cannot safely apply the <code class="language-plaintext highlighter-rouge">memset</code> optimization above, since the <code class="language-plaintext highlighter-rouge">conv_counting_int</code> counts the number of times it is converted to <code class="language-plaintext highlighter-rouge">char</code>, and this value will be wrong (in particular, it will be <code class="language-plaintext highlighter-rouge">1</code>, not <code class="language-plaintext highlighter-rouge">n</code>) if you perform the above optimization.</p>
<p>This can be fixed. You could limit the optimization to the case where the pointer type is char-like <em>and</em> the value-to-assign type is “simple” in the sense that it won’t notice how many times it has been converted. A sufficient check would be that the type is scalar, i.e. <code class="language-plaintext highlighter-rouge">std::is_scalar<T></code> – although there is probably a less conservative check possible. So something like this for the SNIFAE check:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">_Tpointer</span><span class="p">,</span> <span class="k">typename</span> <span class="nc">_Tp</span><span class="p">></span>
<span class="kr">inline</span> <span class="k">typename</span>
<span class="n">__gnu_cxx</span><span class="o">::</span><span class="n">__enable_if</span><span class="o"><</span><span class="n">__is_byte</span><span class="o"><</span><span class="n">_Tpointer</span><span class="o">>::</span><span class="n">__value</span> <span class="o">&&</span> <span class="n">__is_scalar</span><span class="o"><</span><span class="n">_Tp</span><span class="o">>::</span><span class="n">__value</span><span class="p">,</span> <span class="kt">void</span><span class="o">>::</span><span class="n">__type</span>
<span class="nf">__fill_a</span><span class="p">(</span> <span class="n">_Tpointer</span><span class="o">*</span> <span class="n">__first</span><span class="p">,</span> <span class="n">_Tpointer</span><span class="o">*</span> <span class="n">__last</span><span class="p">,</span> <span class="k">const</span> <span class="n">_Tp</span><span class="o">&</span> <span class="n">__value</span><span class="p">)</span> <span class="p">{</span>
<span class="p">...</span>
</code></pre></div></div>
<p>Here’s <a href="https://godbolt.org/z/PXRWSB">an example</a> of how that would work. It’s not fully fleshed out but shows the idea.</p>
<p>Finally, one might ask why <code class="language-plaintext highlighter-rouge">memset</code> <em>is</em> used when gcc is run at <code class="language-plaintext highlighter-rouge">-O3</code> or when clang is used (<a href="https://godbolt.org/z/9nhWAh">like this</a>). The answer is the optimizer. Even if the compile-time semantics of the language select what appears to a byte-by-byte copy loop, the compiler itself can transform that into <code class="language-plaintext highlighter-rouge">memset</code>, or something else like a vectorized loop, if it can prove it is <code class="language-plaintext highlighter-rouge">as-if</code> equivalent. That recognition happens at <code class="language-plaintext highlighter-rouge">-O3</code> for <code class="language-plaintext highlighter-rouge">gcc</code> but at <code class="language-plaintext highlighter-rouge">-O2</code> for clang.</p>
<h3 id="what-does-it-mean">What Does It Mean</h3>
<p>So what does it all mean? Is there a moral to this story?</p>
<p>Some use this as evidence that the somehow C++ and/or the STL are irreparably broken. I don’t agree. Some other languages, even “fast” ones, will <em>never</em> give you the <code class="language-plaintext highlighter-rouge">memset</code> speed, although many will - but many of those that do (e.g., <code class="language-plaintext highlighter-rouge">java.util.Arrays.fill()</code>) do it via special recognition or handling of the function by the compiler or runtime. In the C++ standard library, the optimization the library writers have done is available to anyone, which is a big advantage. That the optimization fails, perhaps unexpectedly, in some cases is unfortunate but it’s nice that you can fix it yourself.</p>
<p>Also, C++ gets <em>two</em> shots at this one: many other languages rely on the compiler to optimize these patterns, and this also occurs in C++. It’s just a bit of a quirk of gcc that optimization doesn’t help here: it doesn’t vectorize at -O2, nor does it do <em>idiom recognition</em>. Both of those result in much faster code: we’ve seen the effect of idiom recognition already: it results in a <code class="language-plaintext highlighter-rouge">memset</code>. Even if idiom recognition wasn’t enabled or didn’t work, vectorization would help a lot: here’s <a href="https://godbolt.org/z/53c6W5561">gcc at -O3</a>, but with idiom recognition disabled. It uses 32-byte stores (<code class="language-plaintext highlighter-rouge">vmovdqu YMMWORD PTR [rax], ymm0</code>) which will be close to <code class="language-plaintext highlighter-rouge">memset</code> speed (but a bit of unrolling woudl have helped). In many other languages it would only be up to the compiler: there wouldn’t be a chance to get <code class="language-plaintext highlighter-rouge">memset</code> even with no optimization as there is in C++.</p>
<p>Do we throw out modern C++ idioms, at least where performance matters, for example by replacing <code class="language-plaintext highlighter-rouge">std::fill</code> with <code class="language-plaintext highlighter-rouge">memset</code>? I don’t think so. It is far from clear where <code class="language-plaintext highlighter-rouge">memset</code> can even be used safely in C++. Unlike say <code class="language-plaintext highlighter-rouge">memcpy</code> and <em>trivially copyable</em>, there is no type trait for “memset is equivalent to zeroing”. It’s probably OK for byte-like types, and is widely used for other primitive types (which we can be sure are trivial, but can’t always be sure of the representation), but even that may not be safe. Once you introduced even simple structures or classes, the footguns multiply. I recommend <code class="language-plaintext highlighter-rouge">std::fill</code> and more generally sticking to modern idioms, except in very rare cases where profiling has identified a hotspot, and even then you should take the safest approach that still provides the performance you need (e.g., by passing <code class="language-plaintext highlighter-rouge">(char)0</code> in this case).</p>
<h3 id="source">Source</h3>
<p>The source for my benchmark is <a href="https://github.com/travisdowns/fill-bench">available on GitHub</a>.</p>
<h3 id="thanks">Thanks</h3>
<p>Thanks to <a href="https://twitter.com/mattgodbolt">Matt Godbolt</a> for creating <a href="https://godbolt.org/">Compiler Explorer</a>, without which this type of investigation would be much more painful – to the point where it often wouldn’t happen at all.</p>
<p>Matt Godbolt, tc, Nathan Kurz and Pocak for finding typos.</p>
<h3 id="discuss">Discuss</h3>
<p>I am <em>still</em> working on my comments system (no, I don’t want Disqus), but in the meantime you can discuss this post on <a href="https://news.ycombinator.com/item?id=22104576">Hacker News</a>, <a href="https://www.reddit.com/r/cpp/comments/erialk/the_hunt_for_the_fastest_zero/">Reddit</a> or <a href="https://lobste.rs/s/bylri4/hunt_for_fastest_zero">lobste.rs</a>.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:function" role="doc-endnote">
<p>Of course, you wouldn’t wrap the <code class="language-plaintext highlighter-rouge">std::fill</code> function in another <code class="language-plaintext highlighter-rouge">fill</code> function that just forwards directly to the standard function: you’d just call <code class="language-plaintext highlighter-rouge">std::fill</code> directly. We use a function here so you can see the parameter types and we can examine the disassembly easily. <a href="#fnref:function" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:qb" role="doc-endnote">
<p>On <a href="http://quick-bench.com/yGy2Mzlr2ZZhWVxoH7HscmbEC94">quickbench</a>, the difference varies slightly from run to run but is usually around 31 to 32 times. <a href="#fnref:qb" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:wmem" role="doc-endnote">
<p>Interestingly, the comment mentions <code class="language-plaintext highlighter-rouge">wmemset</code> in addition to <code class="language-plaintext highlighter-rouge">memset</code> which would presumably be applied for values of type <code class="language-plaintext highlighter-rouge">wchar_t</code> (32-bits on this platform), but I don’t find any evidence that is actually the case via experiment or by examining the code – the optimization appears to only be currently implemented for byte-like values and <code class="language-plaintext highlighter-rouge">memset</code>. <a href="#fnref:wmem" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comUnexpected performance deviations depending on how you spell zero.Gathering Intel on Intel AVX-512 Transitions2020-01-17T00:00:00+00:002020-01-17T00:00:00+00:00https://travisdowns.github.io/blog/2020/01/17/avxfreq1<h2 id="introduction">Introduction</h2>
<p>This is a post about AVX and AVX-512 related frequency scaling<sup id="fnref:first" role="doc-noteref"><a href="#fn:first" class="footnote" rel="footnote">1</a></sup>.</p>
<p>Now, something more than nothing has been written about this already, including <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/">cautionary tales</a> of performance loss and some <a href="https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/">broad guidelines</a><sup id="fnref:dmore" role="doc-noteref"><a href="#fn:dmore" class="footnote" rel="footnote">2</a></sup>, so do we really need to add to the pile?</p>
<p>Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, <em>quantitative</em> advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.</p>
<p>Now I wasn’t really planning on writing about this just now, but I got off on a (nested) tangent<sup id="fnref:intro" role="doc-noteref"><a href="#fn:intro" class="footnote" rel="footnote">3</a></sup>, so let’s examine the AVX-512 downclocking behavior using target tests. At a minimum, this is necessary background for the next post, but I hope that it is also standalone interesting.</p>
<p class="info"><strong>Note:</strong> For the really short version, you can <a href="#summary">skip to the summary</a>, but then what will do you for the rest of the day?</p>
<h3 id="table-of-contents">Table of Contents</h3>
<p>You could perhaps trying skipping ahead to a section that interests you using this obligatory table of contents, but sections are not self contained, so you’ll be better off reading the whole thing linearly.</p>
<ul id="markdown-toc">
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul>
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
</ul>
</li>
<li><a href="#the-source" id="markdown-toc-the-source">The Source</a></li>
<li><a href="#test-structure" id="markdown-toc-test-structure">Test Structure</a> <ul>
<li><a href="#hardware" id="markdown-toc-hardware">Hardware</a></li>
</ul>
</li>
<li><a href="#tests" id="markdown-toc-tests">Tests</a> <ul>
<li><a href="#256-bit-integer-simd-avx" id="markdown-toc-256-bit-integer-simd-avx">256-bit Integer <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> (AVX)</a></li>
<li><a href="#512-bit-integer-simd-avx-512" id="markdown-toc-512-bit-integer-simd-avx-512">512-bit Integer <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> (AVX-512)</a> <ul>
<li><a href="#enter-ipc" id="markdown-toc-enter-ipc">Enter <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr></a></li>
<li><a href="#voltage-only-transitions" id="markdown-toc-voltage-only-transitions">Voltage Only Transitions</a></li>
<li><a href="#attenuation" id="markdown-toc-attenuation">Attenuation</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#what-was-left-out" id="markdown-toc-what-was-left-out">What Was Left Out</a></li>
<li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
<li><a href="#thanks" id="markdown-toc-thanks">Thanks</a></li>
<li><a href="#discuss" id="markdown-toc-discuss">Discuss</a></li>
</ul>
<h2 id="the-source">The Source</h2>
<p>All of the code underlying this post is available in the <a href="https://github.com/travisdowns/freq-bench/tree/post1">post1 branch of freq-bench</a>, so you can follow along at home, check my work, and check out the behavior on your own hardware. It requires Linux and the <a href="https://github.com/travisdowns/freq-bench/blob/post1/README.md">README</a> gives basic clues on getting started.</p>
<p>The source includes the <a href="https://github.com/travisdowns/freq-bench/blob/post1/scripts/data.sh">data generation scripts</a> as well as those to <a href="https://github.com/travisdowns/freq-bench/blob/post1/scripts/plots.sh">generate the plots</a>. Neither shell scripting nor Python are my forte, so be gentle.</p>
<h2 id="test-structure">Test Structure</h2>
<p>We want to investigate what happens when instruction stream related performance transitions occur. The most famous example is what happens when you execute an AVX-512 instruction<sup id="fnref:widthmatters" role="doc-noteref"><a href="#fn:widthmatters" class="footnote" rel="footnote">4</a></sup> for the first time in a while, but as we will see there are other cases.</p>
<p>The basic idea is that the test has a <em>duty period</em> and every time this period elapses, we run a test-specific payload for the duration of the <em>payload period</em> which consists of one or more “interesting” instructions (which depend on the test). During the entire test we sample various metrics at a best-effort fixed frequency. This repeats for the entire test period. The sample period will generally be much smaller than the duty period<sup id="fnref:speriod" role="doc-noteref"><a href="#fn:speriod" class="footnote" rel="footnote">5</a></sup>: in our tests we use a 5,000 μs duty period and a sample period of 1 μs, mostly.</p>
<p>Visually, it is something like this (showing a single duty period: one benchmark is composed of multiple duty cycles back to back):</p>
<p><img src="/assets/avxfreq1/test-structure.png" alt="Test Structure" /></p>
<p>This diagram shows the payload period as occupying a non-negligible amount of time. However, in the first few first tests, the payload period is essentially zero: we run the payload function (which consists of only a couple instructions) only once, so it is really a payload <em>moment</em> rather than period.</p>
<h3 id="hardware">Hardware</h3>
<p>We are running these tests on a <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> architecture W-series CPU: a W-2104<sup id="fnref:f2104" role="doc-noteref"><a href="#fn:f2104" class="footnote" rel="footnote">6</a></sup> with the following <a href="https://stackoverflow.com/a/56861355/149138">license-based</a> frequencies<sup id="fnref:avxt" role="doc-noteref"><a href="#fn:avxt" class="footnote" rel="footnote">7</a></sup>:</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>License</th>
<th>Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-AVX Turbo</td>
<td>L0</td>
<td>3.2 GHz</td>
</tr>
<tr>
<td>AVX Turbo</td>
<td>L1</td>
<td>2.8 GHz</td>
</tr>
<tr>
<td>AVX-512 Turbo</td>
<td>L2</td>
<td>2.4 GHz</td>
</tr>
</tbody>
</table>
<p>For one (voltage) test I also use my Skylake (mobile) <a href="https://ark.intel.com/content/www/us/en/ark/products/88967/intel-core-i7-6700hq-processor-6m-cache-up-to-3-50-ghz.html">i7-6700HQ</a>, running at either it’s nominal frequency of 2.6 GHz, or the turbo frequency of 3.5 GHz.</p>
<h2 id="tests">Tests</h2>
<p>The basic approach this post will take is examining the CPU behavior using the test framework above, primarily varying what the payload is, and what metrics we look at. Let’s get the ball rolling with 256-bit instructions.</p>
<h3 id="256-bit-integer-simd-avx">256-bit Integer <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> (AVX)</h3>
<p>For the first test will use as payload the <code class="language-plaintext highlighter-rouge">vporymm_vz</code> function, which is just a single 256-bit <code class="language-plaintext highlighter-rouge">vpor</code> instruction, followed by a <code class="language-plaintext highlighter-rouge">vzeroupper</code><sup id="fnref:vz" role="doc-noteref"><a href="#fn:vz" class="footnote" rel="footnote">8</a></sup>:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">vporymm_vz:</span>
<span class="nf">vpor</span> <span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span>
<span class="nf">vzeroupper</span>
<span class="nf">ret</span>
</code></pre></div></div>
<p>We call the payload function only once at the start of each duty period<sup id="fnref:pperiod" role="doc-noteref"><a href="#fn:pperiod" class="footnote" rel="footnote">9</a></sup>. The duty period is set to 5000 μs and the sample period to 1 μs, and the total test time is set to 31,000 μs (so the payload will execute 7 times).</p>
<p>Here’s the result (plot notes<sup id="fnref:plotnotes" role="doc-noteref"><a href="#fn:plotnotes" class="footnote" rel="footnote">10</a></sup>), with time along the x axis<sup id="fnref:falk" role="doc-noteref"><a href="#fn:falk" class="footnote" rel="footnote">11</a></sup>, showing the measured frequency at each sample (there are three separate test runs shown<sup id="fnref:threerun" role="doc-noteref"><a href="#fn:threerun" class="footnote" rel="footnote">12</a></sup>):</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-vporvz256-740w.png" srcset="
/assets/avxfreq1/fig-vporvz256-740w.png 740w,
/assets/avxfreq1/fig-vporvz256-1480w.png 1480w,
/assets/avxfreq1/fig-vporvz256-2220w.png 2220w,
/assets/avxfreq1/fig-vporvz256-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="256-bit vpor transitions" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-vporvz256.svg">SVG version of this plot</a>
</div>
<p>Well that’s really boring. The entire test runs consistently at 3.2 GHz, the nominal (L0 license) frequency, if we ignore the a few uninteresting outliers<sup id="fnref:outlie" role="doc-noteref"><a href="#fn:outlie" class="footnote" rel="footnote">13</a></sup>.</p>
<h3 id="512-bit-integer-simd-avx-512">512-bit Integer <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> (AVX-512)</h3>
<p>Before the crowd gets too rowdy, let’s quickly move on to the next test, which is identical except that it uses 512-bit <code class="language-plaintext highlighter-rouge">zmm</code> registers:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">vporzmm_vz:</span>
<span class="nf">vpor</span> <span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span>
<span class="nf">vzeroupper</span>
<span class="nf">ret</span>
</code></pre></div></div>
<p>Here is the result:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-vporvz512-740w.png" srcset="
/assets/avxfreq1/fig-vporvz512-740w.png 740w,
/assets/avxfreq1/fig-vporvz512-1480w.png 1480w,
/assets/avxfreq1/fig-vporvz512-2220w.png 2220w,
/assets/avxfreq1/fig-vporvz512-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="512-bit vpor transitions" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-vporvz512.svg">SVG version of this plot</a>
</div>
<p>We’ve got something to sink our teeth into!</p>
<p>Remember that the duty cycle is 5000 μs, so at each x-axis tick we execute the payload. Now the behavior is clear: every time the payload instruction executes (at multiples of 5000 μs), the frequency drops from the 3.2 GHz L0 license down to 2.8 GHz L1 license frequency. So far this is all pretty much as expected.</p>
<p>Let’s zoom in on one of the transition points at 15,000 μs:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-vpor-zoomed-740w.png" srcset="
/assets/avxfreq1/fig-vpor-zoomed-740w.png 740w,
/assets/avxfreq1/fig-vpor-zoomed-1480w.png 1480w,
/assets/avxfreq1/fig-vpor-zoomed-2220w.png 2220w,
/assets/avxfreq1/fig-vpor-zoomed-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="512-bit vpor transitions (zoomed)" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-vpor-zoomed.svg">SVG version of this plot</a>
</div>
<p>We can make the following observations:</p>
<ol>
<li>There is a transition period (the rightmost of the two shaded regions, in orange<sup id="fnref:peachpuff" role="doc-noteref"><a href="#fn:peachpuff" class="footnote" rel="footnote">14</a></sup>) of ~11 μs<sup id="fnref:resnote" role="doc-noteref"><a href="#fn:resnote" class="footnote" rel="footnote">15</a></sup> where the CPU is halted: no samples occur during this period<sup id="fnref:halted" role="doc-noteref"><a href="#fn:halted" class="footnote" rel="footnote">16</a></sup>. For fun, I’ll call this a <em>frequency transition</em>.</li>
<li>The leftmost shaded region, shown in purple<sup id="fnref:thistle" role="doc-noteref"><a href="#fn:thistle" class="footnote" rel="footnote">17</a></sup>, immediately following the payload execution at 15,000 μs and prior to the halted region, is ~9 μs long and the frequency remains unchanged. This is not just a test issue or measurement error: this period occurs after the payload and is consistently reproducible<sup id="fnref:confirm" role="doc-noteref"><a href="#fn:confirm" class="footnote" rel="footnote">18</a></sup>. Although it looks like nothing interesting is going on in this region, we’ll soon see it is indeed special and will call this region a <em>voltage-only</em> transition.</li>
<li>Although not fully shown in the zoomed plot, the lower 2.8 GHz frequency period lasts for ~650 μs.</li>
<li>Not shown in the zoomed plot (but seen as a second downwards spike on the full plot, after the ~650 μs period of low frequency), there is another fully halted period of ~11 us, after which the CPU returns to it’s maximum speed of 3.2 GHz (L0 license).</li>
<li>These attributes are mostly consistent across the three runs (so much that the series, in green, mostly overlaps and obscures the others) – but there are a few outliers in where the return to 3.2 GHz takes somewhat longer. This is consistent across runs: recovery is never <em>faster</em> than ~650 μs, but sometimes longer. I believe it occurs when an interrupt during the L1 region “resets the timer”.</li>
</ol>
<h4 id="enter-ipc">Enter <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr></h4>
<p>Although it is not visible in this plot, there is something special about the behavior of 512-bit instructions in the first shaded (purple) region – that is, in the 9 microseconds between the execution of the payload instruction and before the subsequent halted period: <em>they execute much slower than usual</em>.</p>
<p>This is easiest to see if we extend the payload period: instead executing the payload function once every 5000 μs, then looping on <code class="language-plaintext highlighter-rouge">rdtsc</code>, waiting for the next sample, we will continue to execute the payload function for 100 μs after a new duty period starts (that is, the <em>payload period</em> is set to 100 μs). During this time we still take samples as usual, every 1 μs – but in between samples we are executing the payload instruction(s)<sup id="fnref:whynot" role="doc-noteref"><a href="#fn:whynot" class="footnote" rel="footnote">19</a></sup>. So one duty period now looks like 100 μs of payload followed by 4850 μs of normal payload-free hot spinning.</p>
<p>We lengthen the payload period in order to examine the performance of the payload instructions. There are several metrics we could look at, but a simple one is to look at <em>instructions per second</em>. As long as we make sure the large majority of the executed instructions are payload instructions, the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> will largely reflect the execution of the payload<sup id="fnref:ideal" role="doc-noteref"><a href="#fn:ideal" class="footnote" rel="footnote">20</a></sup>.</p>
<p>As payload, we will use a function composed simply of 1,000 dependent 512-bit <code class="language-plaintext highlighter-rouge">vpord</code> instructions:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">vpord</span> <span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span>
<span class="nf">vpord</span> <span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span>
<span class="c1">; ... 997 more like this</span>
<span class="nf">vpord</span> <span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span><span class="p">,</span><span class="nv">zmm0</span>
<span class="nf">vzeroupper</span>
<span class="nf">ret</span>
</code></pre></div></div>
<p>We <a href="https://uops.info/table.html?search=vpord%20(zmm%2C%20zmm%2C%20zmm)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKX=on&cb_measurements=on&cb_iaca30=on&cb_avx512=on">know</a> these <code class="language-plaintext highlighter-rouge">vpord</code> instructions have a latency of 1 cycle and here they are serially dependent so we expect this function to take 1,000 cycles, give or take<sup id="fnref:give" role="doc-noteref"><a href="#fn:give" class="footnote" rel="footnote">21</a></sup>, for an <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> of 1.0.</p>
<p>Here’s what a the same zoomed transition point for this looks like, with <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> plotted on the secondary axis:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-ipc-zoomed-zmm-740w.png" srcset="
/assets/avxfreq1/fig-ipc-zoomed-zmm-740w.png 740w,
/assets/avxfreq1/fig-ipc-zoomed-zmm-1480w.png 1480w,
/assets/avxfreq1/fig-ipc-zoomed-zmm-2220w.png 2220w,
/assets/avxfreq1/fig-ipc-zoomed-zmm-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="512-bit vpor transitions (with IPC)" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-ipc-zoomed-zmm.svg">SVG version of this plot</a>
</div>
<p>First, note that in the unshaded regions on the left (before 15,000 μs) and right (after 15,100 μs), the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> is basically irrelevant: no payload instructions are being executed, so the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> there is just the whatever the measurement code happens to net out to. We only care about the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> in the shaded regions, where the payload is executing.</p>
<p>Let’s tackle the regions from right to left, which happens to correspond to obvious to less obvious.</p>
<p>We have the blue region, running from ~15020 μs to 15100 μs (where the extra payload period ends). Here the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> is right at 1 instruction per cycle. So the payload is executing right at the expected rate, i.e., <em>full speed</em>. Keeners may point out that the very beginning of the blue period, the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> (and the measured frequency) is a bit noisier and slightly above 1. This is not a CPU effect, but rather a measurement one: during this phase the benchmark is <em>catching up</em> on samples missed during the previous halted period, which changes the payload to overhead ratio and bumps up the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> (details<sup id="fnref:catchup" role="doc-noteref"><a href="#fn:catchup" class="footnote" rel="footnote">22</a></sup>).</p>
<p>The middle, orange, region shows us what we’ve already seen: the CPU is halted, so no samples occur. <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> doesn’t tell us much here.</p>
<h4 id="voltage-only-transitions">Voltage Only Transitions</h4>
<p>The most interesting part is the first shaded region (purple): after the payload starts running but before the halt which I call a <em>voltage only</em> transition for reasons that will soon become clear.</p>
<p>Here, we see that the payload executes <em>much</em> more slowly, with an <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> of ~0.25. So in this region, the <code class="language-plaintext highlighter-rouge">vpord</code> instructions are apparently executing at <em>four times</em> their normal latency. I also observe an identical 4x slowdown for <code class="language-plaintext highlighter-rouge">vpord</code> throughput, using an identical <a href="https://github.com/travisdowns/freq-bench/blob/434c7cf5db73e2d48061e78525c7bbf7eb7757a3/basic-impls.cpp#L64">test</a> except with independent <code class="language-plaintext highlighter-rouge">vpord</code> instructions<sup id="fnref:tput" role="doc-noteref"><a href="#fn:tput" class="footnote" rel="footnote">23</a></sup>.</p>
<p>Perhaps surprisingly, this same slowdown occurs for 256-bit <code class="language-plaintext highlighter-rouge">ymm</code> instructions as well. This contradicts the conventional wisdom that on AVX-512 chips there is no penalty to using light 256-bit instructions:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-ipc-zoomed-ymm-740w.png" srcset="
/assets/avxfreq1/fig-ipc-zoomed-ymm-740w.png 740w,
/assets/avxfreq1/fig-ipc-zoomed-ymm-1480w.png 1480w,
/assets/avxfreq1/fig-ipc-zoomed-ymm-2220w.png 2220w,
/assets/avxfreq1/fig-ipc-zoomed-ymm-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="256-bit vpor transitions (with IPC)" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-ipc-zoomed-ymm.svg">SVG version of this plot</a>
</div>
<p>The results shown above are for a test identical to the 512-version except that it uses 256-bit <code class="language-plaintext highlighter-rouge">vpor ymm0, ymm0, ymm0</code> as the payload. It shows the same slowdown for ~9 μs after the payload starts executing, but no subsequent halt and no frequency transition. That is, it shows a voltage-only transition (lack of frequency transition is expected because we don’t expect a turbo license change for light 256-bit instructions).</p>
<p><a name="xmmeffect"></a>By now, you are probably wondering about 128-bit <code class="language-plaintext highlighter-rouge">xmm</code> registers. The good news is that these show no effect at all:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-ipc-zoomed-xmm-740w.png" srcset="
/assets/avxfreq1/fig-ipc-zoomed-xmm-740w.png 740w,
/assets/avxfreq1/fig-ipc-zoomed-xmm-1480w.png 1480w,
/assets/avxfreq1/fig-ipc-zoomed-xmm-2220w.png 2220w,
/assets/avxfreq1/fig-ipc-zoomed-xmm-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="128-bit vpor transitions (with IPC)" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-ipc-zoomed-xmm.svg">SVG version of this plot</a>
</div>
<p>Here, the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> jumps immediately to the expected value. So it appears that the CPU runs in a state where the 128-bit lanes are ready to go at all times<sup id="fnref:orisit" role="doc-noteref"><a href="#fn:orisit" class="footnote" rel="footnote">24</a></sup>.</p>
<p>The conventional wisdom regarding this “warmup” period is that the upper part<sup id="fnref:upper" role="doc-noteref"><a href="#fn:upper" class="footnote" rel="footnote">25</a></sup> of the vector units is shut down when not in use, and takes time to power up. The story goes that during this power-up period the CPU does not need to halt but it runs <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions at a reduced throughput <em>by splitting up the input</em> into 128-bit chunks and passing the data two or more times through the powered-on 128-bit lanes<sup id="fnref:amd" role="doc-noteref"><a href="#fn:amd" class="footnote" rel="footnote">26</a></sup>.</p>
<p>However, there are some observations that seem to contradict this hypothesis (in rough order from least to most convincing):</p>
<ol>
<li>The observed impact to latency and throughput is ~4x, whereas I would expect 2x for simple instructions such as <code class="language-plaintext highlighter-rouge">vpor</code>.</li>
<li>The timing is the same for 256-bit and 512-bit instructions: despite that 512-bit instructions take at least 2x the work, i.e., need to be passed through the 128-bit unit at least 4 times.</li>
<li>Some instructions are more difficult to implement using this type of splitting, e.g., instructions where both high and low output lanes depend on all of the input lanes<sup id="fnref:ewise" role="doc-noteref"><a href="#fn:ewise" class="footnote" rel="footnote">27</a></sup> (see how slow they are on Zen). I expected that maybe these instructions would be slower when running in split mode, but I tested <code class="language-plaintext highlighter-rouge">vpermd</code> and found that it runs at 4L4T<sup id="fnref:lt" role="doc-noteref"><a href="#fn:lt" class="footnote" rel="footnote">28</a></sup>, compared to 3L1T normally. So <code class="language-plaintext highlighter-rouge">vpermd</code> (including the 512-bit version) didn’t slow more than <code class="language-plaintext highlighter-rouge">vpor</code>, and in fact in a relative sense it slowed down <em>less</em> (e.g., the latency only changed from 3 to 4). The fact that the latency and throughput reacted differently for this instruction seems odd, and that it has now the exact same 4L4T timing as <code class="language-plaintext highlighter-rouge">vpor</code> seems like a strange coincidence.</li>
<li>Oddly, when I tried to time the slowdown more precisely, I kept coming with fractional value around 4.2x, not 4.0x, kind of contradicting the idea that the instruction is simply operating in a different mode, which should still have an integral latency.</li>
<li>As it turns out, <em>all ALU<sup id="fnref:alu" role="doc-noteref"><a href="#fn:alu" class="footnote" rel="footnote">29</a></sup> instructions</em> are slower in this mode, not just wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> ones.</li>
</ol>
<p>It was 5 that sealed the deal on this not being a slowdown related to split execution. I believe what is actually happening is the CPU is doing very fine-grained throttling when wider instructions are executing in the core. That is, the upper lanes <em>are</em> being used in this mode (they are either not gated at at all, or are gated but enabling them is very quick, less than 1 μs) but execution frequency is reduced by 4x because CPU power delivery is not a state that can handle full-speed execution of these wider instructions, yet. While the CPU waits (e.g., for voltage to rise, fattening the guardband) for higher power execution to be allowed, this fine-grained throttling occurs.</p>
<p>This throttling affects non-<abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions too, causing them to execute at 4x their normal latency and inverse throughput. We can show with the following test, which combines a single <code class="language-plaintext highlighter-rouge">vpor ymm0, ymm0, ymm0</code> with N chained <code class="language-plaintext highlighter-rouge">add eax, 0</code> instructions, shown here for N = 3:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">vpor</span> <span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span>
<span class="nf">add</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0x0</span>
<span class="nf">add</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0x0</span>
<span class="nf">add</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0x0</span>
<span class="c1">; repeated 9 more times</span>
</code></pre></div></div>
<p>If only <code class="language-plaintext highlighter-rouge">vpor</code> is slowed down, each block of 4 instructions will take 4 cycles, limited by the <code class="language-plaintext highlighter-rouge">vpor</code> chain (the <code class="language-plaintext highlighter-rouge">add</code> chain is 3 cycles long). However, I actually measure ~12 cycles, indicating that we are instead limited by the <code class="language-plaintext highlighter-rouge">add</code> chain, each of which takes 4 cycles for a total of 12.</p>
<p><a name="throttle-anchor"></a>We can vary the number of <code class="language-plaintext highlighter-rouge">add</code> instructions (N) to see how long this effect persists. This table is the result:</p>
<table>
<thead>
<tr>
<th style="text-align: right">ADD instructions (N)</th>
<th style="text-align: right">Cycles/ADD</th>
<th style="text-align: right">Delta Cycles (slow)</th>
<th style="text-align: right">Delta Cycles (fast)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">2</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">2.3</td>
<td style="text-align: right">-0.2</td>
</tr>
<tr>
<td style="text-align: right">3</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">0.8</td>
</tr>
<tr>
<td style="text-align: right">4</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">1.1</td>
</tr>
<tr>
<td style="text-align: right">5</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">3.9</td>
<td style="text-align: right">1.1</td>
</tr>
<tr>
<td style="text-align: right">6</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.3</td>
<td style="text-align: right">0.7</td>
</tr>
<tr>
<td style="text-align: right">7</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">3.4</td>
<td style="text-align: right">1.1</td>
</tr>
<tr>
<td style="text-align: right">8</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">4.2</td>
<td style="text-align: right">0.9</td>
</tr>
<tr>
<td style="text-align: right">9</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">3.9</td>
<td style="text-align: right">0.8</td>
</tr>
<tr>
<td style="text-align: right">10</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.3</td>
<td style="text-align: right">1.1</td>
</tr>
<tr>
<td style="text-align: right">20</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">30</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">4.0</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">40</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.3</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">50</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">3.9</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">60</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">4.4</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">70</td>
<td style="text-align: right">4.2</td>
<td style="text-align: right">4.5</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">80</td>
<td style="text-align: right">4.1</td>
<td style="text-align: right">3.4</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">90</td>
<td style="text-align: right">3.6</td>
<td style="text-align: right">-0.2</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">100</td>
<td style="text-align: right">3.3</td>
<td style="text-align: right">1.1</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">120</td>
<td style="text-align: right">2.9</td>
<td style="text-align: right">0.9</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">140</td>
<td style="text-align: right">2.7</td>
<td style="text-align: right">1.1</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">160</td>
<td style="text-align: right">2.5</td>
<td style="text-align: right">1.2</td>
<td style="text-align: right">1.0</td>
</tr>
<tr>
<td style="text-align: right">180</td>
<td style="text-align: right">2.3</td>
<td style="text-align: right">0.7</td>
<td style="text-align: right">0.9</td>
</tr>
<tr>
<td style="text-align: right">200</td>
<td style="text-align: right">2.2</td>
<td style="text-align: right">0.8</td>
<td style="text-align: right">1.0</td>
</tr>
</tbody>
</table>
<p>The <strong>Cycles/ADD</strong> column shows the number of cycles taken per add instruction over the entire slow region (roughly the first 8-10 μs after the payload starts executing). The <strong>Delta Cycles (slow)</strong> shows how many cycles each additional <code class="language-plaintext highlighter-rouge">add</code> instruction took compared to the previous row: i.e., for row N = 30, it determines how much longer the 10 additional <code class="language-plaintext highlighter-rouge">add</code> instructions took compared to the row N = 20. The <strong>Delta Cycles (fast)</strong> column is the same thing, but applies to the samples after ~10 μs when the CPU is back up to full speed (that column shows the expected 1.0 cycles per additional add).</p>
<p>Here we clearly see that up to roughly 70 <code class="language-plaintext highlighter-rouge">add</code> instructions, interleaved with a single <code class="language-plaintext highlighter-rouge">vpor</code>, all the <code class="language-plaintext highlighter-rouge">add</code> instructions are taking 4 cycles, i.e., the CPU is throttled. Somewhere between 80 and 90 a transition happens: <em>additional</em> <code class="language-plaintext highlighter-rouge">add</code> instructions now take 1 cycle, but the overall time per <code class="language-plaintext highlighter-rouge">add</code> is (initially) close to 4. This shows that when <code class="language-plaintext highlighter-rouge">add</code> (and presumably any non-wide instruction) is far enough away from the closest wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instruction, they start executing at full speed. So the timings for the larger N values can be understood as a blend of a slow section of ~70-80 <code class="language-plaintext highlighter-rouge">add</code> instructions near the <code class="language-plaintext highlighter-rouge">vpor</code> which run at 1 per 4 cycles, and the remaining section where they run at full speed: 1 per cycle.</p>
<p>We can probably conclude the CPU is not just throttling frequency or “duty cycling”: in that case every instruction would be slowed down by the same factor, but instead the rule is more like “latency extended to the next multiple of 4 cycles”, e.g., a latency 3 instruction like <code class="language-plaintext highlighter-rouge">imul eax, eax, 0</code> ends up taking 4 cycles when the CPU is throttling. It is likely that the throttling happens at some part of the pipeline before execution, e.g., at issue or dispatch.</p>
<p>The transition to fast mode when the <code class="language-plaintext highlighter-rouge">vpor</code> instructions are spread sufficiently apart probably reflects the size of some structure such as the <abbr title="Queue that collects incoming instructions from the decoder, uop cache or microcode engine and delivers them to the renamer (RAT).">IDQ</abbr> (64 entries in Skylake) or scheduler (97 entries claimed<sup id="fnref:ratentries" role="doc-noteref"><a href="#fn:ratentries" class="footnote" rel="footnote">30</a></sup>). The core could track whether <em>any</em> wide instruction currently in that structure, and enforce the slow mode if so. The <code class="language-plaintext highlighter-rouge">vpor</code> instructions are close enough together, there is <em>always</em> at least one present, but once they are spaced out enough, you get periods of fast mode.</p>
<p><strong>Voltage Effects</strong></p>
<p>We can actually test the theory that this transition is associated with waiting for a change in power delivery configuration. Specifically, we can observe the CPU core voltage, using bits 47:32 of the <code class="language-plaintext highlighter-rouge">MSR_PERF_STATUS</code> MSR. Volume 4 of the Intel Software Development Manual let’s us on a secret: these bits expose the <em>core voltage<sup id="fnref:vcc" role="doc-noteref"><a href="#fn:vcc" class="footnote" rel="footnote">31</a></sup></em>:</p>
<p><img src="/assets/avxfreq1/msr_198h.png" alt="Intel SDM Volume 4: Table 2-20" /></p>
<p>Let’s zoom as usual on a transition point, in this case using a 256-bit (ymm) payload of 1000 dependent <code class="language-plaintext highlighter-rouge">vpor</code> instructions. This 256-bit payload means no frequency transition, only a dispatch throttling period associated with running 256-bit instructions for the first time in a while. We plot the time it takes to run an iteration of the payload<sup id="fnref:whyp" role="doc-noteref"><a href="#fn:whyp" class="footnote" rel="footnote">32</a></sup>, along with the measured voltage:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-volts256-1-740w.png" srcset="
/assets/avxfreq1/fig-volts256-1-740w.png 740w,
/assets/avxfreq1/fig-volts256-1-1480w.png 1480w,
/assets/avxfreq1/fig-volts256-1-2220w.png 2220w,
/assets/avxfreq1/fig-volts256-1-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="Voltage Changes" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-volts256-1.svg">SVG version of this plot</a>
</div>
<p>The length of the throttling period is around 10 μs as usual, as shown by the period where the payload takes ~4,000 cycles (the usual 4x throttling), and the voltage is unchanged from the pre-transition period (at about 0.951 V) during the throttling period. At the moment the throttling stops, the voltage jumps to about 0.957, a change of about 6 mV. This happens at 2.6 GHz, the nominal non-turbo speed of my i7-6700HQ. At 3.5 GHz, the transition is from 1.167 to 1.182, so both the absolute voltages and the difference (about 15 mV) are larger, consistent the basic idea that higher frequencies need higher voltages.</p>
<p>So one theory is that this type of transition represents the period between when the CPU has requested a higher voltage (because wider 256-bit instructions imply a larger worst-case current delta event, hence a worst-case voltage drop) and when the higher voltage is delivered. While the core waits for the change to take effect, throttling is in effect in order to reduce the worst-case drop: without throttling<sup id="fnref:cthrottle" role="doc-noteref"><a href="#fn:cthrottle" class="footnote" rel="footnote">33</a></sup> there is no guarantee that a burst of wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions won’t drop the voltage below the minimum voltage required for safe operation at this frequency.</p>
<h4 id="attenuation">Attenuation</h4>
<p>We might check if there is any <em>attenuation</em> of either type of transition. By <em>attenuation</em> I mean that if a core is transitioning between frequencies too frequently, the power management algorithm may decide to simply keep the core at the lower frequency, which can provide more overall performance when considering the halted periods needed in each transition. This is exactly what happens for active core count<sup id="fnref:acc" role="doc-noteref"><a href="#fn:acc" class="footnote" rel="footnote">34</a></sup> transitions: too many transitions in a short period of time and the CPU will just decide to run at the lower frequency rather than incurring the halts need to transition between e.g. the 1-core and 2-core turbos<sup id="fnref:hiddenbo" role="doc-noteref"><a href="#fn:hiddenbo" class="footnote" rel="footnote">35</a></sup>.</p>
<p>We check this by setting a duty period which is just above the observed recovery time from 2.8 to 3.2 GHz, to see if we still see transitions. Here’s a duty cycle of 760 μs, about 10 μs more than the observed recovery period for this test<sup id="fnref:recovery" role="doc-noteref"><a href="#fn:recovery" class="footnote" rel="footnote">36</a></sup>:</p>
<div style="position: relative; margin-bottom: 1em;">
<img src="/assets/avxfreq1/fig-vporvz512-ipc-p760-740w.png" srcset="
/assets/avxfreq1/fig-vporvz512-ipc-p760-740w.png 740w,
/assets/avxfreq1/fig-vporvz512-ipc-p760-1480w.png 1480w,
/assets/avxfreq1/fig-vporvz512-ipc-p760-2220w.png 2220w,
/assets/avxfreq1/fig-vporvz512-ipc-p760-2960w.png 2960w
" sizes="(max-width: 800px) calc(100vw - 30px), 740px" alt="760 μs period closeup" />
<a style="position: absolute; font-size: 70%; right: 0px; bottom: 0px" href="/assets/avxfreq1/fig-vporvz512-ipc-p760.svg">SVG version of this plot</a>
</div>
<p>I’m not going to color the regions here, as by now I think you are probably (over?) familiar with them. The key points are:</p>
<ul>
<li>The payload starts executing at 7600 μs, which is <em>before</em> the upwards frequency transition, we are still executing at 2.8 GHz - so initially the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> is high, 1 per cycle.</li>
<li>Despite the fact that we are already executing again 512-bit instructions, the frequency adjusts upwards a few μs later. Most likely what happened is that the power logic already evaluated earlier (say at ~7558 μs, just before the payload started) that an upwards transition should occur, but as we’ve seen the response is generally delayed by 8 to 10 μs so it occurs after the payload has already started executing.</li>
<li>Of course, as soon as the transition occurs, the core is no longer in a suitable state for full-speed wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> execution, so <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> drops to ~0.25.</li>
<li>Another transition back to low frequency occurs ~10 μs later and then full speed execution can resume.</li>
</ul>
<p>So there is no attenuation, but attenuation isn’t really needed: the long (~650 μs) cooldown period between the last wide instruction and subsequent frequency boost means that the damage from halt periods are fairly limited: this is unlike the active core count scenario where the CPU has no control over the transition frequency (rather it is driven by interrupts and changes in runnable processes and threads). Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> payload represents a vanishingly small part of the total work).</p>
<h2 id="what-was-left-out">What Was Left Out</h2>
<p>There’s lots we’ve left out. We haven’t even touched:</p>
<ul>
<li>Checking whether xmm registers also cause a voltage-only transition, if they haven’t been used for a while. We didn’t find <a href="#xmmeffect">any effect</a>, but it also certain that some 128-bit instructions appear in the measurement loop which would hide the effect.</li>
<li>Checking whether the voltage-only transition implied by 256-bit instructions are disjoint from those for 512-bit. That is, if you execute a 256-bit instruction after a while without any, you get a voltage-only transition (confirmed above). If you then execute a 512-bit instruction, before the relaxation period expires, do you get a second throttling period prior to the frequency transition? I believe so but I haven’t checked it.</li>
<li>Any type of investigation of “heavy” 256-bit or 512-bit instructions. These require a license one level (numerically) higher than light instructions, and knowing if any of the key timings change would be interesting<sup id="fnref:heavy" role="doc-noteref"><a href="#fn:heavy" class="footnote" rel="footnote">37</a></sup>.</li>
<li>Almost no investigation was made how any of these timings (and the magnitude of voltage changes) vary with frequency. For example, if we are already running at a lower frequency, frequency transitions are presumably not needed, and voltage-only transitions may be shorter.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>For the benefit of anyone who just skipped to the bottom, or whose eyes glazed over at some point, here’s a summary of the key findings:</p>
<ul>
<li>After a period of about 680 μs not using the <em>AVX upper bits</em> (255:128) or <em>AVX-512 upper bits</em> (511:256) the processor enters a mode where using those bits again requires at least a voltage transition, and sometimes a frequency transition.</li>
<li>The processor continues executing instructions during a voltage transition, but at a greatly reduced speed: 1/4th the usual instruction dispatch rate. However, this throttling is fine-grained: it only applies when wide instructions are <em>in flight</em> (<a href="#throttle-anchor">details</a>).</li>
<li>Voltage transitions end when the voltage reaches the desired level, this depends on the magnitude of the transition but 8 to 20 μs is common on the hardware I tested.</li>
<li>In some cases a frequency transitions is also required, e.g., because the involved instruction requires a higher power license. These transitions seem to <em>first</em> incur a throttling period similar to a voltage-only transition, and then a halted period of 8 to 10 μs while the frequency changes.</li>
<li>A key motivator for this post was to give concrete, qualitative guidance on how to write code that is as fast as possible given this behavior. It got bumped to part 2.</li>
</ul>
<p>We also summarize the key timings in this beautifully rendered table:</p>
<table>
<thead>
<tr>
<th>What</th>
<th>Time</th>
<th>Description</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>Voltage Transition</td>
<td>~8 to 20 μs</td>
<td>Time required for a voltage transition, depends on frequency</td>
<td><sup id="fnref:t1deets" role="doc-noteref"><a href="#fn:t1deets" class="footnote" rel="footnote">38</a></sup></td>
</tr>
<tr>
<td>Frequency Transition</td>
<td>~10 μs</td>
<td>Time required for the halted part of a frequency transition</td>
<td><sup id="fnref:fdeets" role="doc-noteref"><a href="#fn:fdeets" class="footnote" rel="footnote">39</a></sup></td>
</tr>
<tr>
<td>Relaxation Period</td>
<td>~680 μs</td>
<td>Time required to go back to a lower power license, measured from the last instruction requiring the higher license</td>
<td><sup id="fnref:lldeets" role="doc-noteref"><a href="#fn:lldeets" class="footnote" rel="footnote">40</a></sup></td>
</tr>
</tbody>
</table>
<h2 id="thanks">Thanks</h2>
<p><a href="https://lemire.me">Daniel Lemire</a> who provided access to the AVX-512 system I used for testing.</p>
<p><a href="https://twitter.com/thekanter">David Kanter</a> of <a href="http://www.realworldtech.com">RWT</a> for a fruitful discussion on power and voltage management in modern chips.</p>
<p>RWT forum members anon³, Ray, Etienne, Ricardo B, Foyle and Tim McCaffrey who provided feedback on this post and helped me understand the VR landscape for recent Intel chips.</p>
<p>Alexander Monakov, Kharzette and Justin Lebar for finding typos.</p>
<p><a href="https://twitter.com/JeffSmith888">Jeff Smith</a> for teaching me about spread spectrum clocking.</p>
<h2 id="discuss">Discuss</h2>
<p>Discussion on <a href="https://news.ycombinator.com/item?id=22077974">Hacker News</a>, <a href="https://twitter.com/trav_downs/status/1218238653354344449">Twitter</a> and <a href="https://lobste.rs/s/qaqmyo/gathering_intel_on_intel_avx_512">lobste.rs</a>.</p>
<p>Direct feedback also welcomed by <a href="mailto:travis.downs@gmail.com">email</a> or as <a href="https://github.com/travisdowns/travisdowns.github.io/issues">a GitHub issue</a>.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<hr />
<p><br /></p>
<p><a name="footnotes"></a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:first" role="doc-endnote">
<p>… and also <em>non frequency</em> related performance events, which I mention only in a footnote not to spoil the fun for non-footnote type people and also to pad by footnote count. That’s why I call this <em>Performance</em> Transitions, instead of Frequency Transitions. <a href="#fnref:first" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:dmore" role="doc-endnote">
<p>Note that Daniel has <a href="https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-instructions-are-maybe-not-so-dangerous/">written</a> <a href="https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-throttling-a-3-impact/">much more</a> <a href="https://lemire.me/blog/2018/08/24/trying-harder-to-make-avx-512-look-bad-my-quantified-and-reproducible-results/">than</a> <a href="https://lemire.me/blog/2018/09/04/per-core-frequency-scaling-and-avx-512-an-experiment/">just that</a> one. <a href="#fnref:dmore" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:intro" role="doc-endnote">
<p>This was going to be the actual post I was trying to write when I went off on <a href="/blog/2019/11/19/toupper.html">this tangent about clang-format</a>. In fact I <em>was</em> writing that post, when I went off this current tangent, but then a footnote turned into several paragraphs, then got its own section and ultimately graduated into a whole post: the one you are reading. So consider this background reading for the “interesting” post still to come, although honestly the stuff here is probably more generally useful than the next part. <a href="#fnref:intro" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:widthmatters" role="doc-endnote">
<p>We should be clear here: when I say AVX-512 instruction in this context, I mean specifically a <em>512-bit wide instruction</em> (which currently only exist in AVX-512). The distinction is that AVX-512 includes 128-bit and 256-bit versions of almost every instruction it introduces, yet these narrower instructions behave just like 128-bit SSE* and 256-bit AVX* instructions in terms of performance transitions. So, for example, <code class="language-plaintext highlighter-rouge">vpermw</code> is unabiguously <em>AVX-512</em> instruction: it only exists in AVX-512BW, but only the version that takes <code class="language-plaintext highlighter-rouge">zmm</code> registers causes “AVX-512 like” performance transitions: the versions that take <code class="language-plaintext highlighter-rouge">ymm</code> or <code class="language-plaintext highlighter-rouge">xmm</code> registers act as any other integer 256-bit AVX2 or 128-bit SSE instruction. <a href="#fnref:widthmatters" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:speriod" role="doc-endnote">
<p>In fact, we generally want the sample period to be as small as possible, to give the best resolution and insight into short-lived events. We can’t make it <em>too</em> short though as the samples themselves have a minimum time to capture, and very short samples tend to be noisy due to non-atomicity, quantization effects, etc. <a href="#fnref:speriod" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:f2104" role="doc-endnote">
<p>The CPU is an <a href="https://en.wikichip.org/wiki/intel/xeon_w/w-2104">Intel W-2104</a>, a Xeon-W chip based on the <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>. It has no turbo and an on-the-box speed of 3200 MHz, but accurate tools will probably report it running at 3192 MHz due to <em><a href="https://twitter.com/JeffSmith888/status/1211821823035351043">spread spectrum clocking</a></em> (SSC). In fact, we can see the typical 0.5% spread spectrum triangle wave on almost any of the plots in this post, at the right zoom level, <a href="/assets/avxfreq1/fig-ssc.svg">like this one</a>. This occurs because we measure time (the x-axis) based on <code class="language-plaintext highlighter-rouge">rdtsc</code> which is based off of a different clock not subject to SSC, while the unhalted cycles counter counts CPU cycles, which are based off of the 100 MHz BLCK which is subject to SSC. <a href="#fnref:f2104" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:avxt" role="doc-endnote">
<p><a href="https://www.realworldtech.com/forum/?threadid=179358&curpostid=179652">Measured</a> with <a href="https://github.com/travisdowns/avx-turbo">avx-turbo</a>. <a href="#fnref:avxt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vz" role="doc-endnote">
<p>This is the part where I just gloss over what that <code class="language-plaintext highlighter-rouge">vzeroupper</code> is doing there. It’s there due to <em>implicit widening</em>. That’s a new term I just invented and it’s the first and last time I’m mentioning it this post, because it really deserves an entire post of its own. The very short version is that any time an <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instruction writes to an N-bit register (N in {256, 512}), all subsequent <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions are <em>implicitly</em> N bits wide, regardless of their actual width, for the purposes of determining turbo licenses and other transitions discussed here. This sounds a bit like the <a href="https://stackoverflow.com/q/41303780/149138">mixed-VEX penalties thing</a>, but it is very different. This a mini-bombshell hidden in a footnote, so if you want to scoop me you can, but I’m not coming to your birthday party. <a href="#fnref:vz" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:pperiod" role="doc-endnote">
<p>I.e., the <em>payload period</em> is zero, but the structure of the test ensures the function is called once, at the start of the payload period, no matter how small the payload period is. <a href="#fnref:pperiod" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:plotnotes" role="doc-endnote">
<p>All of the plots in this post are SVG images, meaning they can be zoomed arbitrarily: so if you want to zoom in on any region feel free (the browser limits the zoom amount, but just save as a file and open it with any SVG viewer). <a href="#fnref:plotnotes" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:falk" role="doc-endnote">
<p>These are basically a poor man’s version of the <em>Falk Diagrams</em> that Brandon Falk describes in <a href="https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html">this post</a> among others. Poor in the sense that they have ~3000 cycle resolution instead of 1 cycle resolution, and because the measurement code has to be integrated directly into the system under test. Basically they are nothing like <em>Falk Diagrams</em> except that they have time on the x-axis and some performance counter event on the y-axis, but they are good enough for our purposes. <a href="#fnref:falk" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:threerun" role="doc-endnote">
<p>Specifically, all three runs are identical and I show them just to give a rough impression of which effects are reproducible and which are outliers. The second and third runs have the suffixes <code class="language-plaintext highlighter-rouge">_1</code> and <code class="language-plaintext highlighter-rouge">_2</code>, respectively, in the legends. <a href="#fnref:threerun" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:outlie" role="doc-endnote">
<p>These outliers usually occur when an interrupt occurs during the measurement. Originally I used a 3 μs sample time, and there were almost no visible outliers with that period, but the 1 μs value I settled on is much better in most respects other than outliers. The main problem is when an interrupt takes more than the sample time of ~1 μs: this causes one or more samples to be very short, because a long sample will cause subsequent ones to be short (maybe <em>very</em> short) until we catch up to the fixed sample schedule, and very short samples are subject to much more noise because the absolute metric values are much smaller but the error sources usually have fixed absolute error. Another source of outliers is when an interrupt <em>splits the stamp</em>: the <em>stamp</em> is the series of metrics we calculate at the sample point. These various metrics aren’t sampled atomically: if an interrupt occurs in the middle of the sampling, some metrics will reflect a much shorter time period (before the interrupt) and some a longer one (after). This effect tends to cause bidirectional spike outliers: where an upspike and downspike occur in consecutive samples. I try to avoid this by retrying the stamp if I think I’ve detected an interrupt during measurement (up to a retry limit). We could avoid all this nonsense by running the benchmark itself in the kernel, where we can disable interrupts (although some SMIs might still sneak in). Maybe next time! <a href="#fnref:outlie" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:peachpuff" role="doc-endnote">
<p>More precisely, the color is <a href="https://encycolorpedia.com/ffdab9">peachpuff</a>. <a href="#fnref:peachpuff" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:resnote" role="doc-endnote">
<p>Note that the resolution of the sampling is 1 μs, so when we say things like <em>9 μs</em> and <em>11 μs</em> it could be off by up to 1 μs. This 1 μs “error” isn’t exactly randomly distributed, because the sampling interval is “exact” and in phase with the payload execution. So the samples are always at 1.0, 2.0, 3.0, … μs after the payload executes (plus or minus small variation on the order of 10 nanoseconds), so if the true time until halt is anywhere between 9.00 and 9.99 μs, we will always read 9 μs (because time shown for the sample is the <em>end</em> of the sample and covers the preceding 1 us). For the halt time, the scenario is reversed: the start of the interval is uncertain, but the end should be exact modulo the delay in taking the sample. <a href="#fnref:resnote" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:halted" role="doc-endnote">
<p>Generally you can detect halted periods when <code class="language-plaintext highlighter-rouge">rdtsc</code> jumps forward, but performance counters like “non-halted cycles” do not, and there are not indications of a larger interruption such as a context switch. You can find a similar case of halted periods <a href="https://stackoverflow.com/q/45472147/149138">in this question</a> which was also caused by frequency transitions (in this case, to obey the different active core count turbo ratios). <a href="#fnref:halted" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:thistle" role="doc-endnote">
<p>More precisely, the color is <a href="https://encycolorpedia.com/d8bfd8">thistle</a>. <a href="#fnref:thistle" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:confirm" role="doc-endnote">
<p>In particular, I have confirmed that these three samples all occur after the payload has executed and retired using the <em>period</em> column in the output, which indicates clearly which sames are before or after a given execution of the payload. The payload itself is followed by an <code class="language-plaintext highlighter-rouge">lfence</code> to ensure it has retired before taking further samples (and in any case the number of instructions per sample is too large be accommodated by the <abbr title="Out-of-order execution allows CPUs to execute instructions out of order with respect to the source.">OoO</abbr> window). <a href="#fnref:confirm" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whynot" role="doc-endnote">
<p>Those who are still awake at this point might be wondering why introduce this new variant of the test now: why not just execute the payload instructions during the wait period in the original tests too? One reason is that by hot spinning on <code class="language-plaintext highlighter-rouge">rdtsc</code> we get somewhat more consistent results when we care only about measuring the frequency in that we almost always sample at exactly the specified period (plus or minus the <code class="language-plaintext highlighter-rouge">rdtsc</code> latency, more or less). When we execute the longish payload function during the wait, the sample point diverges a bit more from the ideal, and the number of spins per sample suffers more quantization effects (i.e., the pattern of spin counts might be 3,3,3,2,3,3,3,2… rather than 450,450,451,450…), which can sometimes lead to a slight sawtooth effect in the samples. <a href="#fnref:whynot" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:ideal" role="doc-endnote">
<p>In practice, we don’t exactly reach this ideal: we execute the payload function 2 or 3 times, for 2,000 or 3,000 <code class="language-plaintext highlighter-rouge">vpor</code> instructions, but there are about 600 additional instructions of overhead associated with taking a sample, so the overhead instructions are a significant portion. Probably 600 instructions is too many, I haven’t optimized that and it could likely be lowered significantly. However, we can also improve the ratio simply by decreasing the sampling resolution (i.e., increasing the sampling time). I selected 1 μs as a tradeoff between these competing factors. Note: Since I wrote this footnote I optimized several things in the sampling loop, so measured IPCs are now very close to their theoretical values, but I guess this footnote still has value?. <a href="#fnref:ideal" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:give" role="doc-endnote">
<p>The main uncertainty in the timing of the function itself concerns the boundary conditions: if we run this function 10 times <em>without</em> touching <code class="language-plaintext highlighter-rouge">zmm0</code> in between, the dependency on <code class="language-plaintext highlighter-rouge">zmm0</code> will be carried between functions and the total time will be very close to 10 x 1000 = 10,000 cycles. However, if some compiler generated code in between calls to the payload function happens to write to <code class="language-plaintext highlighter-rouge">zmm0</code>, breaking the dependency, the individual chains for each function no longer depend on each other, so some overlap is possible. The amount of this overlap is limited by the size of the RS, so the effect won’t be <em>huge</em> but it could be noticeable (with say 100 payload instructions rather than 1,000 it could be very significant). We basically sidestep this whole issue by putting an <code class="language-plaintext highlighter-rouge">lfence</code> between each call to the payload function, which forms an <em>execution barrier</em> between calls. <a href="#fnref:give" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:catchup" role="doc-endnote">
<p>The way the sampling works in this test could be described <em>locked interval without skipping</em>. Here, <em>locked interval</em> means we calculate the next target sample time based on the previous <em>target</em> sample time (rather than the <em>actual</em> sample time), plus the period. So if we are sampling at 10 μs, the target sample times will always be 10 μs, 20 μs, etc. In particular, the series of target sample time doesn’t depend on what happens during the test, e.g., it doesn’t depend on the actual sample time: even if we actually end up sampling at time = 12 μs rather than 10, we target time = 20 μs as the next sample, not 12 + 10 = 22 μs. This raises the question about what happens when some delay (e.g., an interrupt, a frequency transition) causes us to miss more than 1 entire sample period.</p>
<p>E.g., with 10 μs resolution we just sampled at 90 μs, so the next sample target is 100 μs, but a delay causes the next sample to be taken at 125 μs. We are now behind! The next sample should occur at 110 μs, but of course that is in the past. The current test design still takes all the required samples, as quickly as possible (but with a minimum of one payload call if we are in the <em>extra payload period</em>) – that’s the <em>no skipping</em> part. In the current example, it means we would take subsequent samples quickly until we are caught up, at say 125 μs (target 110 μs), 126 μs (target 120 μs), 130 μs (target 130 μs), with that last sample being “on target”.</p>
<p>These samples occur quickly: very quickly in the case of normal samples which take less than 0.2 μs each, or more slowly in the case of the <em>extra payload</em> region, where the payload call bumps that to about 0.5 μs. So that’s what’s happening in the green region: we just had a frequency transition halt of ~10 μs, so we are ~10 samples beyind and so the following samples are taken more quickly (as you can see because the data point markers are spaced more closely), with only a single payload call each (versus 2 or 3 usually). This changes the ratio of payload instructions to overhead instructions, which tends to bump the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> a bit (for the same reason the caught-up blue region almost shows <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> > 1). This effect can be reduced or eliminated by increasing the sampling period, since that reduces the number of catchup samples.</p>
<p>Incidentally, this also explains the oscillating pattern you see in the blue region: the ideal number of payload calls to get a 1 μs sample rate is ~2.5, so the sampling strategy tends to alternative between two and three calls: more calls means <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> closer to 1: so the peaks of the oscillating are two calls and the valleys, three. <a href="#fnref:catchup" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tput" role="doc-endnote">
<p>I’m not going to fully analyze the throughput case, but the thorough among us can <a href="/assets/avxfreq1/fig-ipc-zoomed-zmm-tput.svg">find the chart here</a>. Note that the overhead here cuts the other way: pushing the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> below the expected value of 2.0 (there are 2 512-bit vector units capable of executing <code class="language-plaintext highlighter-rouge">vpord</code>) because here the throughput limited instructions compete for ports with the overhead instructions. <a href="#fnref:tput" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:orisit" role="doc-endnote">
<p>Of course, another possibility is that something in the rest of the test loop uses <code class="language-plaintext highlighter-rouge">xmm</code> registers so they remain “hot”: they are “baseline” for x86-64 after all, so the compiler is free to use them without any special flags. We could test this theory with a more compact test loop audited to be free of any <code class="language-plaintext highlighter-rouge">xmm</code> use… but I’m not going to bother. I’m pretty sure these guys are powered up all the time as their use is pervasive in most x86-64 code. <a href="#fnref:orisit" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:upper" role="doc-endnote">
<p>Specifically, the the part handing the second (bits 128:255) for the <code class="language-plaintext highlighter-rouge">ymm</code> case, and the upper 3 lanes (bits 128:511) in the <code class="language-plaintext highlighter-rouge">zmm</code> case. <a href="#fnref:upper" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:amd" role="doc-endnote">
<p>This isn’t far fetched - that’s exactly how AMD Zen always executes 256-bit AVX instructions with its 128 bit units, and similar to how SVB and <abbr title="Intel's Ivy Bridge architecture, aka 3rd Generation Intel Core i3,i5,i7">IVB</abbr> did the same for 256-bit load and store instructions. So using narrower vector units to implement wider instructions is definitely a thing. <a href="#fnref:amd" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:ewise" role="doc-endnote">
<p>These are usually called <em>cross lane</em> instructions (where a lane is 128 bits on Intel). For example, <code class="language-plaintext highlighter-rouge">vpor</code> is <em>not</em> like this: it is an <em>element-wise</em> operation where each output element (down to each bit, in this case) depends only on the element in the same position in the input vectors. On other the hand, <code class="language-plaintext highlighter-rouge">vpermd</code> is: each 32-bit element in the output can come from <em>any</em> position in the input vector (but it still behaves as element-wise wrt the mask register<sup id="fnref:bonus" role="doc-noteref"><a href="#fn:bonus" class="footnote" rel="footnote">41</a></sup>). <a href="#fnref:ewise" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lt" role="doc-endnote">
<p>The notation 4L4T means: “4 cycles of latency and 4 cycles of inverse throughput”. That is, an given instance of this instruction takes 4 cycles to finish (latency), and a new instruction can start every 4 cycles (inverse throughput, hence the throughput is 0.25). When the CPU is running normally, most single-<abbr title="Micro-operation: instructions are translated into one or more uops, which are simple operations executed by the CPU's execution units.">uop</abbr> <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions have latency of 1 (in lane), 3 (most cross lane integer or shuffle ops) or 4 (most FP arithmetic). <a href="#fnref:lt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:alu" role="doc-endnote">
<p>I say <em>ALU instructions</em> here, but I strongly suspect it might be all instructions: you can certainly test that at home using the same test (the <code class="language-plaintext highlighter-rouge">vporxymm250_*</code> group of tests) but with other types of instructions such as loads replacing the <code class="language-plaintext highlighter-rouge">add</code>. I didn’t really test <em>all</em> ALU instructions either: just a few - but it is fair to assume that if <code class="language-plaintext highlighter-rouge">add</code> is slowed down, it is something generic, probably affecting at least all ALU stuff, not something instruction specific. <a href="#fnref:alu" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:ratentries" role="doc-endnote">
<p>Documentation claims 97 entries, but my testing seems to indicate they are not unified in Skylake: apparently only 64 can be used for ALU ops, and 33 for memory ops. <a href="#fnref:ratentries" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vcc" role="doc-endnote">
<p>Try as I might, I can’t determine if this refers to <em>measured</em> core voltage, i.e., true voltage as determine at some sensor within the core, or <em>demanded</em> voltage, i.e., the voltage the processor wants right now based on the various power-relevant parameters, sometimes called the VID. In any case, we expect those values to track each other fairly closely, perhaps with some offset and since we are looking for voltage <em>changes</em> either one works. That said, I am very interested if you know the answer to this question. <a href="#fnref:vcc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:whyp" role="doc-endnote">
<p>This payload time series is meant to show the exact same thing as the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> series in earlier charts: we just want an indication of when the Type 1 throttling starts and stops. I used <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> initially because it was easy (I didn’t have to instrument the payload section specially) - but it doesn’t work when reading volts because that measurement involves a ton of additional instructions and a user-kernel transition, so it throws the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr> off completely. So I went ahead and instrumented the payload section directly, so we can still see the throttling, but no way I want to go back and change the other plots that use <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr>. <a href="#fnref:whyp" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:cthrottle" role="doc-endnote">
<p>The throttling here is quite conservative I think: this is only a very small voltage change (less than 1%), so it is hard to believe that 4x throttling is <em>necessary.</em> It seems likely, for example, that cutting the dispatch rate in half would be enough to compensate for the missing 6 mV – but it is easy to imagine that just having a big conservative throttling number for all these voltage-too-low throttling scenarios is easy and safe, and these periods don’t occur often enough for it to really matter. <a href="#fnref:cthrottle" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:acc" role="doc-endnote">
<p>Essentially all modern Intel CPUs have varying maximum turbo ratios depending on the number of active (not halted or in a sleep state). E.g., my Skylake CPU can run at at a max speed of 3.5, 3.3, 3.2 or 3.1 GHz if 1, 2, 3 or 4 cores are active, respectively. If only one core is currently running, at 3.5 GHz, and another core becomes active (e.g., because the scheduler found something to run) – the first core has to immediately transition down to 3.3 GHz and as we’ve seen above, it takes an ~8-10 μs halt to do so. When the other core stops running, it can return to 3.3 GHz. If cores are flipping between inactive and active quickly enough, those halts add up and cut into your effective frequency. At some point, you might get less work done by trying to run at max turbo, versus just running at 3.3 GHz all the time since in this case no halts need to be taken when the second core starts. Further explored <a href="https://stackoverflow.com/a/45592838/149138">over here</a>. <a href="#fnref:acc" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hiddenbo" role="doc-endnote">
<p>Avoiding these transitions are a hidden bonus of making the 1 and 2-core turbos the same, and more generally “grouping” the turbo ratios in a more coarse grained way across core counts: you don’t need any transition when the “from” and “to” core counts have the same max turbo ratio. <a href="#fnref:hiddenbo" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:recovery" role="doc-endnote">
<p>Earlier we mentioned a low frequency duration of 650 μs, but that was the test that ran only a single payload instruction. The recovery period is measured from the <em>last</em> wide instruction and in this test (that shows the <abbr title="Instructions per cycle: calculated over an interval by measuring the number of instructions executed and the duration in cycles.">IPC</abbr>) we execute 100 μs of payload, so the recovery time will be 100 μs + 650 μs = ~750 μs. <a href="#fnref:recovery" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:heavy" role="doc-endnote">
<p>Unlike the transitions discussed here, transitions related to heavy instructions are <em>soft</em> transitions: they do not occur unconditionally after a single instruction of the relevant type is executed, but rather only after some <em>density</em> threshold is reached for those instructions. Exploring this threshold would be interesting. There is another effect mentioned in the Intel optimization manual: heavy instructions may cause a license transition even in cases where it wouldn’t normally occur, when light instructions of one license level are mixed with <em>half-width</em> heavy instructions. That is, 128-bit heavy instructions can use the fastest L0 license, as can 256-bit light instructions. However, apparently, mixing these can cause a request for the L1 license. Similarly for 256-bit heavy instructions and 512-bit light instructions, where a downgrade from L1 to L2 could occur. <a href="#fnref:heavy" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:t1deets" role="doc-endnote">
<p>I give a range of 8 to 20 μs because that’s what I measured in my testing, but the highest frequency I measured for a voltage-only transition at was 3.5 GHz, with a 15 mV delta. It is entirely possible that at higher frequencies and voltages, times are much longer. It could also depend on the hardware, e.g., the presence or absence of a <abbr title="Fully Integrated Voltage Regulator">FIVR</abbr>. <a href="#fnref:t1deets" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fdeets" role="doc-endnote">
<p>This transition time seems to be required by <em>any</em> frequency transition, whether up or down in frequency and regardless of the cause. This includes transition causes not tested here: for example, when the max turbo ratio changes because the active core count changes, or when the ideal frequency changes for any other reason. <a href="#fnref:fdeets" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lldeets" role="doc-endnote">
<p>This same relaxation period appears to apply to both of the transitions types discussed in this post, e.g., both voltage and frequency. The relaxation timer is reset any type an instruction that needs the current license is executed. In this case, the 680 μs period is measured from the instruction that causes the transition (which is also the last relevant instruction since only a single payload instruction is used), until the time that the CPU resumes executing again at the higher frequency. This period includes one dispatch throttling period and two frequency transitions, so only about 650 μs of the 680 μs is spent executing instructions at full speed. <a href="#fnref:lldeets" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:bonus" role="doc-endnote">
<p>Bonus question: are there <em>any</em> single-<abbr title="Micro-operation: instructions are translated into one or more uops, which are simple operations executed by the CPU's execution units.">uop</abbr> AVX/AVX-512 instructions which are <abbr title="A SIMD operation whose lane-wise output depends on elements from lanes other than the same lane in the inputs (lanes are 128 bits in x86).">cross-lane</abbr> in at least two of their inputs? There are 3-input shuffles, like <code class="language-plaintext highlighter-rouge">VPERMI2B</code>, which have 2 of their 3 inputs as <abbr title="A SIMD operation whose lane-wise output depends on elements from lanes other than the same lane in the inputs (lanes are 128 bits in x86).">cross-lane</abbr> (the two input tables), but they need 2 uops. <a href="#fnref:bonus" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comInvestigating some details of SIMD related frequency transitions on Intel CPUs.A Note on Mask Registers2019-12-05T16:30:00+00:002019-12-05T16:30:00+00:00https://travisdowns.github.io/blog/2019/12/05/kreg-facts<p>AVX-512 introduced eight so-called <em>mask registers</em><sup id="fnref:naming" role="doc-noteref"><a href="#fn:naming" class="footnote" rel="footnote">1</a></sup>, <code class="language-plaintext highlighter-rouge">k0</code><sup id="fnref:k0note" role="doc-noteref"><a href="#fn:k0note" class="footnote" rel="footnote">2</a></sup> through <code class="language-plaintext highlighter-rouge">k7</code>, which apply to most ALU operations and allow you to apply a zero-masking or merging<sup id="fnref:maskmerge" role="doc-noteref"><a href="#fn:maskmerge" class="footnote" rel="footnote">3</a></sup> operation on a per-element basis, speeding up code that would otherwise require extra blending operations in AVX2 and earlier.</p>
<p>If that single sentence doesn’t immediately indoctrinate you into the mask register religion, here’s a copy and paste from <a href="https://en.wikipedia.org/wiki/AVX-512#Opmask_registers">Wikipedia</a> that should fill in the gaps and close the deal:</p>
<blockquote>
<p>Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register <code class="language-plaintext highlighter-rouge">k0</code> is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, <code class="language-plaintext highlighter-rouge">k0</code> is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be “zero”, which zeros everything not selected by the mask, or “merge”, which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.</p>
</blockquote>
<p>So mask registers<sup id="fnref:kreg" role="doc-noteref"><a href="#fn:kreg" class="footnote" rel="footnote">4</a></sup> are important, but are not household names unlike say general purpose registers (<code class="language-plaintext highlighter-rouge">eax</code>, <code class="language-plaintext highlighter-rouge">rsi</code> and friends) or <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers (<code class="language-plaintext highlighter-rouge">xmm0</code>, <code class="language-plaintext highlighter-rouge">ymm5</code>, etc). They certainly aren’t going to show up on Intel slides disclosing the size of <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr> resources, like these:</p>
<p><img src="/assets/kreg/min/intel-skx-slide.png" alt="Intel Slide" class="invert-rotate-img" /></p>
<p><br /></p>
<p>In particular, I don’t think the size of the mask register physical register file (<abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr>) has ever been reported. Let’s fix that today.</p>
<p>We use an updated version of the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size <a href="https://github.com/travisdowns/robsize">probing tool</a> originally authored and <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">described by Henry Wong</a><sup id="fnref:hcite" role="doc-noteref"><a href="#fn:hcite" class="footnote" rel="footnote">5</a></sup> (hereafter simply <em>Henry</em>), who used it to probe the size of various documented and undocumented <abbr title="Out-of-order execution allows CPUs to execute instructions out of order with respect to the source.">out-of-order</abbr> structures on earlier architecture. If you haven’t already read that post, stop now and do it. This post will be here when you get back.</p>
<p>You’ve already read Henry’s blog for a full description (right?), but for the naughty among you here’s the fast food version:</p>
<h4 id="fast-food-method-of-operation">Fast Food Method of Operation</h4>
<p>We separate two cache miss load instructions<sup id="fnref:misstime" role="doc-noteref"><a href="#fn:misstime" class="footnote" rel="footnote">6</a></sup> by a variable number of <em>filler instructions</em> which vary based on the CPU resource we are probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the total execution time is roughly<sup id="fnref:roughly" role="doc-noteref"><a href="#fn:roughly" class="footnote" rel="footnote">7</a></sup> as long as a single miss.</p>
<p>However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the baseline cache miss latency.</p>
<p>Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> instructions usually consume one physical register from the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> so are a good choice to measure the size of that resource.</p>
<h4 id="mask-register-prf-size">Mask Register <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> Size</h4>
<p>Here, we use instructions that write a mask register, so can measure the size of the mask register <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr>.</p>
<p>To start, we use a series of <code class="language-plaintext highlighter-rouge">kaddd k1, k2, k3</code> instructions, as such (shown for 16 filler instructions):</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span> <span class="c1">; first cache miss load</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rdx</span><span class="p">]</span> <span class="c1">; second cache miss load</span>
<span class="nf">lfence</span> <span class="c1">; stop issue until the above block completes</span>
<span class="c1">; this block is repeated 16 more times</span>
</code></pre></div></div>
<p>Each <code class="language-plaintext highlighter-rouge">kaddd</code> instruction consumes one physical mask register. If number of filler instructions is equal to or less than the number of mask registers, we expect the misses to happen in parallel, otherwise the misses will be resolved serially. So we expect at that point to see a large spike in the running time.</p>
<p>That’s exactly what we see:</p>
<p><img src="/assets/kreg/min/skx-27.svg" alt="Test 27 kaddd instructions" /></p>
<p>Let’s zoom in on the critical region, where the spike occurs:</p>
<p><img src="/assets/kreg/min/skx-27-zoomed.svg" alt="Test 27 zoomed" /></p>
<p>Here we clearly see that the transition isn’t <em>sharp</em> – when the filler instruction count is between 130 and 134, we the runtime is intermediate: falling between the low and high levels. Henry calls this <em>non ideal</em> behavior and I have seen it repeatedly across many but not all of these resource size tests. The idea is that the hardware implementation doesn’t always allow all of the resources to be used as you approach the limit<sup id="fnref:nonideal" role="doc-noteref"><a href="#fn:nonideal" class="footnote" rel="footnote">8</a></sup> - sometimes you get to use every last resource, but in other cases you may hit the limit a few filler instructions before the theoretical limit.</p>
<p>Under this assumption, we want to look at the last (rightmost) point which is still faster than the slow performance level, since it indicates that <em>sometimes</em> that many resources are available, implying that at least that many are physically present. Here, we see that final point occurs at 134 filler instructions.</p>
<p>So we conclude that <em><abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> has 134 physical registers available to hold speculative mask register values</em>. As Henry indicates on the original post, it is likely that there are 8 physical registers dedicated to holding the non-speculative architectural state of the 8 mask registers, so our best guess at the total size of the mask register <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> is 142. That’s somewhat smaller than the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> (180 entires) or the <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> (168 entries), but still quite large (see <a href="/blog/2019/06/11/speed-limits.html#ooo-table">this table of out of order resource sizes</a> for sizes on other platforms).</p>
<p>In particular, it is definitely large enough that you aren’t likely to run into this limit in practical code: it’s hard to imagine non-contrived code where almost 60%<sup id="fnref:twothirds" role="doc-noteref"><a href="#fn:twothirds" class="footnote" rel="footnote">9</a></sup> of the instructions <em>write</em><sup id="fnref:write" role="doc-noteref"><a href="#fn:write" class="footnote" rel="footnote">10</a></sup> to mask registers, because that’s what you’d need to hit this limit.</p>
<h4 id="are-they-distinct-prfs">Are They Distinct PRFs?</h4>
<p>You may have noticed that so far I’m simply <em>assuming</em> that the mask register <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> is distinct from the others. I think this is highly likely, given the way mask registers are used and since they are part of a disjoint renaming domain<sup id="fnref:rename" role="doc-noteref"><a href="#fn:rename" class="footnote" rel="footnote">11</a></sup>. It is also supported by the fact that that apparent mask register PFR size doesn’t match either the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> or <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> sizes, but we can go further and actually test it!</p>
<p>To do that, we use a similar test to the above, but with the filler instructions alternating between the same <code class="language-plaintext highlighter-rouge">kaddd</code> instruction as the original test and an instruction that uses either a <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> or <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> register. If the register file is shared, we expect to hit a limit at size of the <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr>. If the PRFs are not shared, we expect that neither <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> limit will be hit, and we will instead hit a different limit such as the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size.</p>
<p><a href="https://github.com/travisdowns/robsize/blob/fb039f212f1364e2e65b8cb2a0c3f8023c85777f/asm-gold/asm-29.asm">Test 29</a> alternates <code class="language-plaintext highlighter-rouge">kaddd</code> and scalar <code class="language-plaintext highlighter-rouge">add</code> instructions, like this:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
<span class="nf">add</span> <span class="nb">ebx</span><span class="p">,</span><span class="nb">ebx</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">esi</span><span class="p">,</span><span class="nb">esi</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">ebx</span><span class="p">,</span><span class="nb">ebx</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">esi</span><span class="p">,</span><span class="nb">esi</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">ebx</span><span class="p">,</span><span class="nb">ebx</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">esi</span><span class="p">,</span><span class="nb">esi</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">add</span> <span class="nb">ebx</span><span class="p">,</span><span class="nb">ebx</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rdx</span><span class="p">]</span>
<span class="nf">lfence</span>
</code></pre></div></div>
<p>Here’s the chart:</p>
<p><img src="/assets/kreg/min/skx-29.svg" alt="Test 29: alternating kaddd and scalar add" /></p>
<p>We see that the spike is at a filler count larger than the <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> and <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> sizes. So we can conclude that the mask and <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> PRFs are not shared.</p>
<p>Maybe the mask register is shared with the <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr>? After all, mask registers are more closely associated with <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions than general purpose ones, so maybe there is some synergy there.</p>
<p>To check, here’s <a href="https://github.com/travisdowns/robsize/blob/fb039f212f1364e2e65b8cb2a0c3f8023c85777f/asm-gold/asm-35.asm">Test 35</a>, which is similar to 29 except that it alternates between <code class="language-plaintext highlighter-rouge">kaddd</code> and <code class="language-plaintext highlighter-rouge">vxorps</code>, like so:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rcx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rcx</span><span class="p">]</span>
<span class="nf">vxorps</span> <span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm1</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm2</span><span class="p">,</span><span class="nv">ymm2</span><span class="p">,</span><span class="nv">ymm3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm4</span><span class="p">,</span><span class="nv">ymm4</span><span class="p">,</span><span class="nv">ymm5</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm6</span><span class="p">,</span><span class="nv">ymm6</span><span class="p">,</span><span class="nv">ymm7</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm0</span><span class="p">,</span><span class="nv">ymm1</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm2</span><span class="p">,</span><span class="nv">ymm2</span><span class="p">,</span><span class="nv">ymm3</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">vxorps</span> <span class="nv">ymm4</span><span class="p">,</span><span class="nv">ymm4</span><span class="p">,</span><span class="nv">ymm5</span>
<span class="nf">kaddd</span> <span class="nv">k1</span><span class="p">,</span><span class="nv">k2</span><span class="p">,</span><span class="nv">k3</span>
<span class="nf">mov</span> <span class="nb">rdx</span><span class="p">,</span><span class="kt">QWORD</span> <span class="nv">PTR</span> <span class="p">[</span><span class="nb">rdx</span><span class="p">]</span>
<span class="nf">lfence</span>
</code></pre></div></div>
<p>Here’s the corresponding chart:</p>
<p><img src="/assets/kreg/min/skx-35.svg" alt="Test 35: alternating kaddd and SIMD xor" /></p>
<p>The behavior is basically identical to the prior test, so we conclude that there is no direct sharing between the mask register and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> PRFs either.</p>
<p class="warning">This turned out not to be the end of the story. The mask registers <em>are</em> shared, just not with the general purpose or SSE/AVX register file. For all the details, see this <a href="/blog/2020/05/26/kreg2.html">follow up post</a>.</p>
<h4 id="an-unresolved-puzzle">An Unresolved Puzzle</h4>
<p>Something we notice in both of the above tests, however, is that the spike seems to finish around 212 filler instructions. However, the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size for this microarchtiecture is 224. Is this just <em>non ideal behavior</em> as we saw earlier? Well we can test this by comparing against Test 4, which just uses <code class="language-plaintext highlighter-rouge">nop</code> instructions as the filler: these shouldn’t consume almost any resources beyond <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> entries. Here’s Test 4 (<code class="language-plaintext highlighter-rouge">nop</code> filer) versus Test 29 (alternating <code class="language-plaintext highlighter-rouge">kaddd</code> and scalar <code class="language-plaintext highlighter-rouge">add</code>):</p>
<p><img src="/assets/kreg/min/skx-4-29.svg" alt="Test 4 vs 29" /></p>
<p>The <code class="language-plaintext highlighter-rouge">nop</code>-using <a href="https://github.com/travisdowns/robsize/blob/fb039f212f1364e2e65b8cb2a0c3f8023c85777f/asm-gold/asm-4.asm">Test 4</a> <em>nails</em> the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size at exactly 224 (these charts are SVG so feel free to “View Image” and zoom in confirm). So it seems that we hit some other limit around 212 when we mix mask and <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> registers, or when we mix mask and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers. In fact the same limit applies even between <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers, if we compare Test 4 and <a href="https://github.com/travisdowns/robsize/blob/fb039f212f1364e2e65b8cb2a0c3f8023c85777f/asm-gold/asm-21.asm">Test 21</a> (which mixes <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> adds with <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> <code class="language-plaintext highlighter-rouge">vxorps</code>):</p>
<p><img src="/assets/kreg/min/skx-4-21.svg" alt="Test 4 vs 21" /></p>
<p>Henry mentions a more extreme version of the same thing in the original <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">blog entry</a>, in the section also headed <strong>Unresolved Puzzle</strong>:</p>
<blockquote>
<p>Sandy Bridge AVX or SSE interleaved with integer instructions seems to be limited to looking ahead ~147 instructions by something other than the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr>. Having tried other combinations (e.g., varying the ordering and proportion of AVX vs. integer instructions, inserting some NOPs into the mix), it seems as though both SSE/AVX and integer instructions consume registers from some form of shared pool, as the instruction window is always limited to around 147 regardless of how many of each type of instruction are used, as long as neither type exhausts its own <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> supply on its own.</p>
</blockquote>
<p>Read the full section for all the details. The effect is similar here but smaller: we at least get 95% of the way to the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size, but still stop before it. It is possible the shared resource is related to register reclamation, e.g., the PRRT<sup id="fnref:prrt" role="doc-noteref"><a href="#fn:prrt" class="footnote" rel="footnote">12</a></sup> - a table which keeps track of which registers can be reclaimed when a given instruction retires.</p>
<p>Finally, we finish this party off with a few miscellaneous notes on mask registers, checking for parity with some features available to <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers.</p>
<h3 id="move-elimination">Move Elimination</h3>
<p>Both <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> registers are eligible for so-called <em>move elimination</em>. This means that a register to register move like <code class="language-plaintext highlighter-rouge">mov eax, edx</code> or <code class="language-plaintext highlighter-rouge">vmovdqu ymm1, ymm2</code> can be eliminated at rename by “simply”<sup id="fnref:simply" role="doc-noteref"><a href="#fn:simply" class="footnote" rel="footnote">13</a></sup> pointing the destination register entry in the <abbr title="Register alias table: a table which maps an architectural register identifier to a physical register.">RAT</abbr> to the same physical register as the source, without involving the ALU.</p>
<p>Let’s check if something like <code class="language-plaintext highlighter-rouge">kmov k1, k2</code> also qualifies for move elimination. First, we check the chart for <a href="https://github.com/travisdowns/robsize/blob/fb039f212f1364e2e65b8cb2a0c3f8023c85777f/asm-gold/asm-28.asm">Test 28</a>, where the filler instruction is <code class="language-plaintext highlighter-rouge">kmovd k1, k2</code>:</p>
<p><img src="/assets/kreg/min/skx-28.svg" alt="Test 28" /></p>
<p>It looks exactly like Test 27 we saw earlier with <code class="language-plaintext highlighter-rouge">kaddd</code>. So we would suspect that physical registers are being consumed, unless we have happened to hit a different move-elimination related limit with exactly the same size and limiting behavior<sup id="fnref:moves" role="doc-noteref"><a href="#fn:moves" class="footnote" rel="footnote">14</a></sup>.</p>
<p>Additional confirmation comes from uops.info which <a href="https://uops.info/table.html?search=kmov%20(K%2C%20K)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKX=on&cb_measurements=on&cb_avx512=on">shows that</a> all variants of mask to mask register <code class="language-plaintext highlighter-rouge">kmov</code> take one <abbr title="Micro-operation: instructions are translated into one or more uops, which are simple operations executed by the CPU's execution units.">uop</abbr> dispatched to <abbr title="port 0 (GP and SIMD ALU, not-taken branches)">p0</abbr>. If the move is eliminated, we wouldn’t see any dispatched uops.</p>
<p>Therefore I conclude that register to register<sup id="fnref:regreg" role="doc-noteref"><a href="#fn:regreg" class="footnote" rel="footnote">15</a></sup> moves involving mask registers are not eliminated.</p>
<h3 id="dependency-breaking-idioms">Dependency Breaking Idioms</h3>
<p>The <a href="https://stackoverflow.com/a/33668295/149138">best way</a> to set a <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> register to zero in x86 is via the xor zeroing idiom: <code class="language-plaintext highlighter-rouge">xor reg, reg</code>. This works because any value xored with itself is zero. This is smaller (fewer instruction bytes) than the more obvious <code class="language-plaintext highlighter-rouge">mov eax, 0</code>, and also faster since the processor recognizes it as a <em>zeroing idiom</em> and performs the necessary work at rename<sup id="fnref:zero" role="doc-noteref"><a href="#fn:zero" class="footnote" rel="footnote">16</a></sup>, so no ALU is involved and no <abbr title="Micro-operation: instructions are translated into one or more uops, which are simple operations executed by the CPU's execution units.">uop</abbr> is dispatched.</p>
<p>Furthermore, the idiom is <em>dependency breaking:</em> although <code class="language-plaintext highlighter-rouge">xor reg1, reg2</code> in general depends on the value of both <code class="language-plaintext highlighter-rouge">reg1</code> and <code class="language-plaintext highlighter-rouge">reg2</code>, in the special case that <code class="language-plaintext highlighter-rouge">reg1</code> and <code class="language-plaintext highlighter-rouge">reg2</code> are the same, there is no dependency as the result is zero regardless of the inputs. All modern x86 CPUs recognize this<sup id="fnref:otherzero" role="doc-noteref"><a href="#fn:otherzero" class="footnote" rel="footnote">17</a></sup> special case for <code class="language-plaintext highlighter-rouge">xor</code>. The same applies to <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> versions of xor such as integer <a href="https://www.felixcloutier.com/x86/pxor"><code class="language-plaintext highlighter-rouge">vpxor</code></a> and floating point <a href="https://www.felixcloutier.com/x86/xorps"><code class="language-plaintext highlighter-rouge">vxorps</code></a> and <a href="https://www.felixcloutier.com/x86/xorpd"><code class="language-plaintext highlighter-rouge">vxorpd</code></a>.</p>
<p>That background out of the way, a curious person might wonder if the <code class="language-plaintext highlighter-rouge">kxor</code> <a href="https://www.felixcloutier.com/x86/kxorw:kxorb:kxorq:kxord">variants</a> are treated the same way. Is <code class="language-plaintext highlighter-rouge">kxorb k1, k1, k1</code><sup id="fnref:notall" role="doc-noteref"><a href="#fn:notall" class="footnote" rel="footnote">18</a></sup> treated as a zeroing idiom?</p>
<p>This is actually two separate questions, since there are two aspects to zeroing idioms:</p>
<ul>
<li>Zero latency execution with no execution unit (elimination)</li>
<li>Dependency breaking</li>
</ul>
<p>Let’s look at each in turn.</p>
<h4 id="execution-elimination">Execution Elimination</h4>
<p>So are zeroing xors like <code class="language-plaintext highlighter-rouge">kxorb k1, k1, k1</code> executed at rename without latency and without needing an execution unit?</p>
<p>No.</p>
<p>Here, I don’t even have to do any work: uops.info has our back because they’ve performed <a href="https://uops.info/html-tp/SKX/KXORD_K_K_K-Measurements.html#sameReg">this exact test</a> and report a latency of 1 cycle and one <abbr title="port 0 (GP and SIMD ALU, not-taken branches)">p0</abbr> <abbr title="Micro-operation: instructions are translated into one or more uops, which are simple operations executed by the CPU's execution units.">uop</abbr> used. So we can conclude that zeroing xors of mask registers are not eliminated.</p>
<h4 id="dependency-breaking">Dependency Breaking</h4>
<p>Well maybe zeroing kxors are dependency breaking, even though they require an execution unit?</p>
<p>In this case, we can’t simply check uops.info. <code class="language-plaintext highlighter-rouge">kxor</code> is a one cycle latency instruction that runs only on a single execution port (<abbr title="port 0 (GP and SIMD ALU, not-taken branches)">p0</abbr>), so we hit the interesting (?) case where a chain of <code class="language-plaintext highlighter-rouge">kxor</code> runs at the same speed regardless of whether the are dependent or independent: the throughput bottleneck of 1/cycle is the same as the latency bottleneck of 1/cycle!</p>
<p>Don’t worry, we’ve got other tricks up our sleeve. We can test this by constructing a tests which involve a <code class="language-plaintext highlighter-rouge">kxor</code> in a carried dependency chain with enough total latency so that the chain latency is the bottleneck. If the <code class="language-plaintext highlighter-rouge">kxor</code> carries a dependency, the runtime will be equal to the sum of the latencies in the chain. If the instruction is dependency breaking, the chain is broken and the different disconnected chains can overlap and performance will likely be limited by some throughput restriction (e.g., <a href="/blog/2019/06/11/speed-limits.html#portexecution-unit-limits">port contention</a>). This could use a good diagram, but I’m not good at diagrams.</p>
<p>All the tests are in <a href="https://github.com/travisdowns/uarch-bench/blob/ccbebbec39ab02d6460a1837857d052e120c0946/x86_avx512.asm#L20"><abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr> bench</a>, but I’ll show the key parts here.</p>
<p>First we get a baseline measurement for the latency of moving from a mask register to a <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> register and back:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">kmovb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nf">kmovb</span> <span class="nb">eax</span><span class="p">,</span> <span class="nv">k0</span>
<span class="c1">; repeated 127 more times</span>
</code></pre></div></div>
<p>This pair clocks in<sup id="fnref:runit" role="doc-noteref"><a href="#fn:runit" class="footnote" rel="footnote">19</a></sup> at 4 cycles. It’s hard to know how to partition the latency between the two instructions: are they both 2 cycles or is there a 3-1 split one way or the other<sup id="fnref:fyiuops" role="doc-noteref"><a href="#fn:fyiuops" class="footnote" rel="footnote">20</a></sup>, but for our purposes it doesn’t matter because we just care about the latency of the round-trip. Importantly, the post-based throughput limit of this sequence is 1/cycle, 4x faster than the latency limit, because each instruction goes to a different port (<abbr title="port 5 (GP and SIMD ALU, vector shuffles)">p5</abbr> and <abbr title="port 0 (GP and SIMD ALU, not-taken branches)">p0</abbr>, respectively). This means we will be able to tease out latency effects independent of throughput.</p>
<p>Next, we throw a <code class="language-plaintext highlighter-rouge">kxor</code> into the chain that we know is <em>not</em> zeroing:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">kmovb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nf">kxorb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nv">k0</span><span class="p">,</span> <span class="nv">k1</span>
<span class="nf">kmovb</span> <span class="nb">eax</span><span class="p">,</span> <span class="nv">k0</span>
<span class="c1">; repeated 127 more times</span>
</code></pre></div></div>
<p>Since <a href="https://uops.info/table.html?search=kxorb&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKX=on&cb_measurements=on&cb_avx512=on">we know</a> <code class="language-plaintext highlighter-rouge">kxorb</code> has 1 cycle of latency, we expect to increase the latency to 5 cycles and that’s exactly what we measure (the first two tests shown):</p>
<pre>
** Running group avx512 : AVX512 stuff **
Benchmark Cycles Nanos
kreg-GP rountrip latency 4.00 1.25
kreg-GP roundtrip + nonzeroing kxorb 5.00 1.57
</pre>
<p>Finally, the key test:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">kmovb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nf">kxorb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nv">k0</span><span class="p">,</span> <span class="nv">k0</span>
<span class="nf">kmovb</span> <span class="nb">eax</span><span class="p">,</span> <span class="nv">k0</span>
<span class="c1">; repeated 127 more times</span>
</code></pre></div></div>
<p>This has a zeroing <code class="language-plaintext highlighter-rouge">kxorb k0, k0, k0</code>. If it breaks the dependency on k0, it would mean that the <code class="language-plaintext highlighter-rouge">kmovb eax, k0</code> no longer depends on the earlier <code class="language-plaintext highlighter-rouge">kmovb k0, eax</code>, and the carried chain is broken and we’d see a lower cycle time.</p>
<p>Drumroll…</p>
<p>We measure this at the exact same 5.0 cycles as the prior example:</p>
<pre>
** Running group avx512 : AVX512 stuff **
Benchmark Cycles Nanos
kreg-GP rountrip latency 4.00 1.25
kreg-GP roundtrip + nonzeroing kxorb 5.00 1.57
<span style="background: green;"> kreg-GP roundtrip + zeroing kxorb 5.00 1.57</span>
</pre>
<p>So we tentatively conclude that zeroing idioms aren’t recognized at all when they involve mask registers.</p>
<p>Finally, as a check on our logic, we use the following test which replaces the <code class="language-plaintext highlighter-rouge">kxor</code> with a <code class="language-plaintext highlighter-rouge">kmov</code> which we know is <em>always</em> dependency breaking:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">kmovb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nb">eax</span>
<span class="nf">kmovb</span> <span class="nv">k0</span><span class="p">,</span> <span class="nb">ecx</span>
<span class="nf">kmovb</span> <span class="nb">eax</span><span class="p">,</span> <span class="nv">k0</span>
<span class="c1">; repeated 127 more times</span>
</code></pre></div></div>
<p>This is the final result shown in the output above, and it runs much more quickly at 2 cycles, bottlenecked on <abbr title="port 5 (GP and SIMD ALU, vector shuffles)">p5</abbr> (the two <code class="language-plaintext highlighter-rouge">kmov k, r32</code> instructions both go only to <abbr title="port 5 (GP and SIMD ALU, vector shuffles)">p5</abbr>):</p>
<pre>
** Running group avx512 : AVX512 stuff **
Benchmark Cycles Nanos
kreg-GP rountrip latency 4.00 1.25
kreg-GP roundtrip + nonzeroing kxorb 5.00 1.57
kreg-GP roundtrip + zeroing kxorb 5.00 1.57
<span style="background: green;"> kreg-GP roundtrip + mov from GP 2.00 0.63</span>
</pre>
<p>So our experiment seems to check out.</p>
<h3 id="reproduction">Reproduction</h3>
<p>You can reproduce these results yourself with the <a href="https://github.com/travisdowns/robsize">robsize</a> binary on Linux or Windows (using WSL). The specific results for this article are <a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/results">also available</a> as are the <a href="https://github.com/travisdowns/robsize/tree/master/scripts/kreg/">scripts</a> used to collect them and generate the plots.</p>
<h3 id="summary">Summary</h3>
<ul>
<li><abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> has a separate <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> for mask registers with a speculative size of 134 and an estimated total size of 142</li>
<li>This is large enough compared to the other <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> size and the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> to make it unlikely to be a bottleneck</li>
<li>Mask registers are not eligible for move elimination</li>
<li>Zeroing idioms<sup id="fnref:tech" role="doc-noteref"><a href="#fn:tech" class="footnote" rel="footnote">21</a></sup> in mask registers are not recognized for execution elimination or dependency breaking</li>
</ul>
<h3 id="part-ii">Part II</h3>
<p>I didn’t expect it to happen, but it did: there is a <a href="/blog/2020/05/26/kreg2.html">follow up post</a> about mask registers, where we (roughly) confirm the register file size by looking at an image of a <abbr title="Intel's Skylake (server) architecture including Skylake-SP, Skylake-X and Skylake-W">SKX</abbr> CPU captured via microcope, and make an interesting discovery regarding sharing.</p>
<h3 id="comments">Comments</h3>
<p>Discussion on <a href="https://news.ycombinator.com/item?id=21714390">Hacker News</a>, Reddit (<a href="https://www.reddit.com/r/asm/comments/e6kokb/x86_avx512_a_note_on_mask_registers/">r/asm</a> and <a href="https://www.reddit.com/r/programming/comments/e6ko7i/a_note_on_mask_registers_avx512/">r/programming</a>) or <a href="https://twitter.com/trav_downs/status/1202637229606264833">Twitter</a>.</p>
<p>Direct feedback also welcomed by <a href="mailto:travis.downs@gmail.com">email</a> or as <a href="https://github.com/travisdowns/travisdowns.github.io/issues">a GitHub issue</a>.</p>
<h3 id="thanks">Thanks</h3>
<p><a href="https://lemire.me">Daniel Lemire</a> who provided access to the AVX-512 system I used for testing.</p>
<p><a href="http://www.stuffedcow.net/">Henry Wong</a> who wrote the <a href="http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/">original article</a> which introduced me to this technique and graciously shared the code for his tool, which I now <a href="https://github.com/travisdowns/robsize">host on github</a>.</p>
<p><a href="https://twitter.com/Jeffinatorator/status/1202642436406669314">Jeff Baker</a>, <a href="http://0x80.pl">Wojciech Muła</a> for reporting typos.</p>
<p>Image credit: <a href="https://www.flickr.com/photos/like_the_grand_canyon/31064064387">Kellogg’s Special K</a> by <a href="https://www.flickr.com/photos/like_the_grand_canyon/">Like_the_Grand_Canyon</a> is licensed under <a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a>.</p>
<p class="info">If you liked this post, check out the <a href="/">homepage</a> for others you might enjoy.</p>
<hr />
<hr />
<p><br /></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:naming" role="doc-endnote">
<p>These <em>mask registers</em> are often called <em>k</em> registers or simply <em>kregs</em> based on their naming scheme. <a href="https://twitter.com/tom_forsyth/status/1202666300591337472">Rumor has it</a> that this letter was chosen randomly only after a long and bloody naming battle between MFs. <a href="#fnref:naming" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:k0note" role="doc-endnote">
<p>There is sometimes a misconception (until recently even on the AVX-512 wikipedia article) that <code class="language-plaintext highlighter-rouge">k0</code> is not a normal mask register, but just a hardcoded indicator that no masking should be used. That’s not true: <code class="language-plaintext highlighter-rouge">k0</code> is a valid mask register and you can read and write to it with the <code class="language-plaintext highlighter-rouge">k</code>-prefixed instructions and <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions that write mask registers (e.g., any AVX-512 <a href="https://www.felixcloutier.com/x86/pcmpeqb:pcmpeqw:pcmpeqd">comparison</a>. However, the encoding that would normally be used for <code class="language-plaintext highlighter-rouge">k0</code> as a writemask register in a <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> operation indicates instead “no masking”, so the contents of <code class="language-plaintext highlighter-rouge">k0</code> cannot be used for that purpose. <a href="#fnref:k0note" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:maskmerge" role="doc-endnote">
<p>The distinction being that a zero-masking operation results in zeroed destination elements at positions not selected by the mask, while merging leaves the existing elements in the destination register unchanged at those positions. As as side-effect this means that with merging, the destination register becomes a type of destructive source-destination register and there is an input dependency on this register. <a href="#fnref:maskmerge" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:kreg" role="doc-endnote">
<p>I’ll try to use the full term <em>mask register</em> here, but I may also use <em>kreg</em> a common nickname based on the labels <code class="language-plaintext highlighter-rouge">k0</code>, <code class="language-plaintext highlighter-rouge">k1</code>, etc. So just mentally swap <em>kreg</em> for <em>mask register</em> if and when you see it (or vice-versa). <a href="#fnref:kreg" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:hcite" role="doc-endnote">
<p>H. Wong, <em>Measuring Reorder Buffer Capacity</em>, May, 2013. [Online]. Available: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ <a href="#fnref:hcite" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:misstime" role="doc-endnote">
<p>Generally taking 100 to 300 cycles each (latency-wise). The wide range is because the cache miss wall clock time varies by a factor of about 2x, generally between 50 and 100 naneseconds, depending on platform and <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr> details, and the CPU frequency varies by a factor of about 2.5x (say from 2 GHz to 5 GHz). However, on a given host, with equivalent TLB miss/hit behavior, we expect the time to be roughly constant. <a href="#fnref:misstime" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:roughly" role="doc-endnote">
<p>The reason I have to add <em>roughly</em> as a weasel word here is itself interesting. A glance at the charts shows that they are certainly not totally flat in either the fast or slow regions surrounding the spike. Rather there are various noticeable regions with distinct behavior and other artifacts: e.g., in Test 29 a very flat region up to about 104 filler instructions, followed by a bump and then a linearly ramping region up to the spike somewhat after 200 instructions. Some of those features are explicable by mentally (or <a href="https://godbolt.org/z/eAGxhH">actually</a>) simulating the pipeline, which reveals that at some point the filler instructions will contribute (although only a cycle or so) to the runtime, but some features are still unexplained (for now). <a href="#fnref:roughly" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:nonideal" role="doc-endnote">
<p>For example, a given rename slot may only be able to write a subset of all the <abbr title="Register alias table: a table which maps an architectural register identifier to a physical register.">RAT</abbr> entries, and uses the first available. When the <abbr title="Register alias table: a table which maps an architectural register identifier to a physical register.">RAT</abbr> is almost full, it is possible that none of the allowed entries are empty, so it is as if the structure is full even though some free entries remain, but accessible only to other uops. Since the allowed entries may be essentially random across iterations, this ends up with a more-or-less linear ramp between the low and high performance levels in the non-ideal region. <a href="#fnref:nonideal" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:twothirds" role="doc-endnote">
<p>The “60 percent” comes from 134 / 224, i.e., the speculative mask register <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> size, divided by the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size. The idea is that if you’ll hit the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size limit no matter what once you have 224 instructions in flight, so you’d need to have 60% of those instructions be mask register writes<sup id="fnref:write:1" role="doc-noteref"><a href="#fn:write" class="footnote" rel="footnote">10</a></sup> in order to hit the 134 limit first. Of course, you might also hit some <em>other</em> limit first, so even 60% might not be enough, but the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> size puts a lower bound on this figure since it <em>always</em> applies. <a href="#fnref:twothirds" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:write" role="doc-endnote">
<p>Importantly, only instructions which write a mask register consume a physical register. Instructions that simply read a mask register (e.g,. <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> instructions using a writemask) do not consume a new physical mask register. <a href="#fnref:write" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:write:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:rename" role="doc-endnote">
<p>More renaming domains makes things easier on the renamer for a given number of input registers. That is, it is easier to rename 2 <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> and 2 <abbr title="Single Instruction Multiple Data: an ISA type or ISA extension like Intel's AVX or ARM's NEON that can perform multiple identical operations on elements packed into a SIMD register.">SIMD</abbr> input registers (separate domains) than 4 <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> registers. <a href="#fnref:rename" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:prrt" role="doc-endnote">
<p>This is either the <em>Physical Register Reclaim Table</em> or <em>Post Retirement Reclaim Table</em> depending on who you ask. <a href="#fnref:prrt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:simply" role="doc-endnote">
<p>Of course, it is not actually so simple. For one, you now need to track these “move elimination sets” (sets of registers all pointing to the same physical register) in order to know when the physical register can be released (once the set is empty), and these sets are themselves a limited resource which must be tracked. Flags introduce another complication since flags are apparently stored along with the destination register, so the presence and liveness of the flags must be tracked as well. <a href="#fnref:simply" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:moves" role="doc-endnote">
<p>In particular, in the corresponding test for <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> registers (Test 7), the chart looks very different as move elimination reduce the <abbr title="Physical register file: The hardware registers used for renaming architectural (source visible) registers, usually much larger in number than the architectural register count.">PRF</abbr> demand down to almost zero and we get to the <abbr title="Re-order buffer: n ordered buffer which stores in-progress instructions on an out-of-order processor.">ROB</abbr> limit. <a href="#fnref:moves" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:regreg" role="doc-endnote">
<p>Note that I am not restricting my statement to moves between two mask registers only, but any registers. That is, moves between a <abbr title="General purpose: as opposed to SIMD or FP. On x86 often refers to instructions such as integer addition, or registers such as eax.">GP</abbr> registers and a mask registers are also not eliminated (the latter fact is obvious if consider than they use distinct register files, so move elimination seems impossible). <a href="#fnref:regreg" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:zero" role="doc-endnote">
<p>Probably by pointing the entry in the <abbr title="Register alias table: a table which maps an architectural register identifier to a physical register.">RAT</abbr> to a fixed, shared zero register, or setting a flag in the <abbr title="Register alias table: a table which maps an architectural register identifier to a physical register.">RAT</abbr> that indicates it is zero. <a href="#fnref:zero" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:otherzero" role="doc-endnote">
<p>Although <code class="language-plaintext highlighter-rouge">xor</code> is the most reliable, other idioms may be recognized as zeroing or dependency breaking idioms by some CPUs as well, e.g., <code class="language-plaintext highlighter-rouge">sub reg,reg</code> and even <code class="language-plaintext highlighter-rouge">sbb reg, reg</code> which is not a zeroing idiom, but rather sets the value of <code class="language-plaintext highlighter-rouge">reg</code> to zero or -1 (all bits set) depending on the value of the carry flag. This doesn’t depend on the value of <code class="language-plaintext highlighter-rouge">reg</code> but only the carry flag, and some CPUs recognize that and break the dependency. Agner’s <a href="https://www.agner.org/optimize/#manual_microarch">microarchitecture guide</a> covers the <abbr title="Microarchitecture: a specific implementation of an ISA, e.g., "Haswell microarchitecture".">uarch</abbr>-dependent support for these idioms very well. <a href="#fnref:otherzero" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:notall" role="doc-endnote">
<p>Note that only the two source registers really need to be the same: if <code class="language-plaintext highlighter-rouge">kxorb k1, k1, k1</code> is treated as zeroing, I would expect the same for <code class="language-plaintext highlighter-rouge">kxorb k1, k2, k2</code>. <a href="#fnref:notall" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:runit" role="doc-endnote">
<p>Run all the tests in this section using <code class="language-plaintext highlighter-rouge">./uarch-bench.sh --test-name=avx512/*</code>. <a href="#fnref:runit" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fyiuops" role="doc-endnote">
<p>This is why uops.info reports the latency for both <code class="language-plaintext highlighter-rouge">kmov r32, k</code> and <code class="language-plaintext highlighter-rouge">kmov k, 32</code> as <code class="language-plaintext highlighter-rouge"><= 3</code>. They know the pair takes 4 cycles in total and under the assumption that each instruction must take <em>at least</em> one cycle the only thing you can really say is that each instruction takes at most 3 cycles. <a href="#fnref:fyiuops" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tech" role="doc-endnote">
<p>Technically, I only tested the xor zeroing idiom, but since that’s the groud-zero, most basic idiom we can pretty sure nothing else will be recognized as zeroing. I’m open to being proven wrong: the code is public and easy to modify to test whatever idiom you want. <a href="#fnref:tech" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Travis Downstravis.downs@gmail.comSome mostly too-low-level-to-care-about hardware details of the mask registers introduced in AVX-512.