I recently stumbled on a blog post called Building Faster AMD64 Memset Routines, written by Joe Bialek of the Microsoft Security Response Center (MSRC). The blog describes his efforts to improve the performance of the Windows kernel memset() function, across all sizes of memory to set. The reported optimizations are quite fascinating, and could be summed by avoiding branches even at the cost of doing redundant stores. Basically, stores are free while branches are expensive.
Stores are Free?
When I learnt programming back in the days of 8-bit microcomputers, every cycle counted and every operation counted. Fundamentally, doing fewer operations was always better. The same intuition still drives a lot of programming habits – doing less at the programmer-visible level ought to make for better code. Over the years, programmers have had to learn that things are not always so simple. With the introduction of caches, and pipelines, and branch predictors, and prefetchers, and other hardware optimizations, things are not necessarily all that simple anymore.
In the particular case of this memset optimization campaign, Joe Bialek makes some observations that really surprised me, and that goes to show how far hardware mechanisms have evolved to make code run well even if it does not seem to really adhere to best practice.
Unaligned store instructions that straddle a cache line incur a bit of a performance penalty. Unaligned store instructions that straddle a page boundary are substantially slower (taking about 4x as long to execute). Unaligned store instructions that do not straddle either of these boundaries are almost free on modern CPU’s.
While other notes do align with what I would have expected:
Current Intel/AMD CPU’s can retire a single store instruction per clock cycle. Assuming alignment is okay, the CPU can retire one 16-byte store per cycle just like it can retire one 1-byte store per cycle; for best performance you should use the biggest stores possible.
As well as the fact that branches are potentially really expensive. They are in fact surprisingly expensive, much more so than stores, and a key observation to improve performance is this:
We want to minimize the number of branch mispredictions since this can cripple performance. For small memsets, a single branch misprediction can consume more CPU cycles than all the stores. Common sizes should be handled with very few mispredictions and maximum stores retired per cycle, less common sizes can be handled in more expensive ways.
The Neat Trick – Stores are Free!
The previous points together lead to the trick that caused me to write this blog post.
The obvious core of an efficient memory setting routine is to do large aligned writes, to get through the memory block with the fewest possible stores. “Large” can mean 4 or 8 or 16 (or even 64 bytes for AVX-512) bytes depending on the size of the block to set and the instructions available. If the block to set is not aligned to the operation size, there will be some “extra” bytes at one or both ends of the block.
For the extra bytes (prefix and suffix in the simple illustration above), the naïve solution is to just write them a byte at a time. Simple, easy, and no extraneous work done in terms of driving bytes to memory. However, the smarter faster way is just to do a full-size unaligned store that targets the extra bytes + also some that happen to fall into the aligned main block.
This is much faster than trying to avoid writing the same byte multiple times, since it avoids a lot of looping (and looping means branching) and uses the one store you seem to get per cycle in the most efficient way. The explanation is that high-performance processors collect writes into a store buffer before sending them to the memory system. Essentially, multiple store instructions are combined to become a single transfer to caches and memory. Writing the same location multiple times in close succession is basically free since the processor will just capture the operations in the buffer.
The blog post shows how this can be taken further. For example, the final design would write a 24-byte block using two 16-byte writes to its first 16 bytes, and two additional 16-byte writes to the last 16 bytes. Meaning that the middle 8 bytes are written four times! This is still faster than the alternatives, since all the code does is to unconditionally compute the offsets to write. Several redundant writes are better than the risk of a branch misprediction.
Branches are expensive, redundant stores are free. That is totally logical, and also definitely not how computers used to work. This optimization reportedly works well on a range of processors from both Intel and AMD from the past decade. Such universality means that it makes sense for a very popular operating system to have one of its common functions built to take advantage of it. Let’s just hope that keeps being the case.
The Feedback Loop
What I found fascinating about this optimization is that it to some extent goes counter to the idea that software should express what it wants to achieve at a fairly high level of abstraction – while the processor core is the designed to execute the expressed intention as quickly as possible at run time.
A classic simple memset that only writes each byte once presents a clear message to the processor and its designers: the intention of the software is to write a set of bytes. That makes sense as a statement of intent. A computer architect could look at traces of such software and consider ways to improve the core or add new instructions for the use case.
An optimized memset like that presented in the blog post is happily writing the same bytes several times over. The achieves the goal in a way that works very well on current processors. However, if you were to look at a trace of this as part of understanding what software does in order to design a better processor, you will see that software likes to overwrite previous writes. In this case, the intention gets muddled.
It is hard to imagine how this specific observation could lead hardware designers to do something dumb, since the “obvious” mechanism is the already existing write buffer. But it still bugs me. It feels “wrong”, and I wonder where such feedback loops lead in the long run. Maybe we already have hardware and software features out in the world that are very far from their most “natural” form, driven by successive iterations of this feedback loop? I brings up analogies to biology, where evolution produces all kinds of “objectively” bizarre behaviors and physical traits that only make sense as a result of a history of evolution.
When do Savings Matter?
Joe makes a very good point at the end of the post, addressing the eternal question of when it is worth spending time on optimization, and what makes for a worthwhile optimization in general. Some people tend to want obviously visible improvements like 10% or more to call an optimization or improvement worth doing. Then again, there are cases where a 1% saving is incredibly valuable. I remember back in the early days of the Internet-of-Things discussion, where some manager at a traditional industrial firm pointed out that if you have a cost base of billions of dollars, a single percent is still 10s of millions of dollars. I.e., you cannot be blinded by percentages only. Scale matters, and memset in Windows is undoubtedly something that runs at scale:
With a function that is used incredibly heavily such as memset, performance matters even if you don’t have an end-to-end scenario that shows obvious improvements. At datacenter scale, shaving CPU cycles off memset saves money, power, and helps the environment.
Read the blog post! It is a very interesting story of a small but insightful optimization of a common operation. It offers some experiment-based insights into how modern hardware works, and I did not cover all of it here (leaving out the effectiveness of REP STOSB in particular).