The Mill is a new general-purpose high-performance processor design from out-of-the-box computing (http://ootbcomp.com/). They claim to beat typical high-end out-of-order (OOO) designs like the Intel Haswell generation by crazy factors, such as being 2.3x faster while using 2.3x less power compared to a Haswell. All the while costing less. Ignoring the cost aspect, the power and performance numbers are truly impressive – especially for general code. How can they do something so much better than what we have today? For general code? That requires some serious innovation. With that perspective, I ask myself where the Mill is really significantly different from what we have seen before.
The 30000 foot view is that the Mill is a design where hardware overhead that is spent deciding on what a processor should do and where to find and deposit results has been minimized. Since a modern out-of-order machine is really more than 90% such overhead, cutting it and replacing it with more functional units should give tremendous benefits in efficiency. The Mill is a VLIW, as that is probably the most efficient instruction design for high performance… provided you can keep the functional units fed, which is the traditional bane of VLIW designs. The Mill is a VLIW made to work for modern general-purpose code.
Looking a bit more closely at the Mill, they have designed an instruction set and computer architecture that turns many of the tricks that an OOO does under the hood, behind the scenes, into compiler-visible statically generated scheduling decisions. Doing this has required rethinking a large part of traditional instruction set design, using a richer information model for temporary data, and doing away with registers. This allowance for OOO-style mechanism makes it possible to keep the VLIW units fed, and avoids the many pitfalls that have plagued current VLIW designs for all applications except DSP.
Yet another perspective is that the Mill is a 2.0 design that has taken many good ideas that have failed in their version 1.0 incarnations, refreshed them, fixed their flaws, and made them work. The Itanium/EPIC design by Intel and HP was in many ways based on the same ideas, but that design did not really work out commercially (technically, it is not really fair to call it a failure, it has provided some impressive absolute performance numbers for certain codes). Statically known latencies and compiler scheduling has been tried before, but did not work for general-purpose code. Function calls with a fresh register area for the called function was the idea behind register windows on the SPARC, but it turned into a nightmare as processors sped up. Here, they do a similar thing in a different way.
On to some of the details.
By doing away with registers, the Mill avoids the very expensive renaming and managing register hazards in a conventional OOO machine. It is a potential big win in terms of power efficiency, since managing hundreds of rename registers for a user-visible set of a few dozen registers is a major painpoint in OOO processor design. The replacement is a fixed-length FIFO called the Belt (http://www.youtube.com/watch?v=QGw-cy0ylCc). Architecturally, values are read from the belt, fed into functional units, and put at the front of the belt. A bit like a stack machine, but with the ability to read from any value, not just the top of the stack. The Mill also separates between temporary values that are used to implement data flow between functional units, which are put on the Belt, and longer-term storage of values being used many times, which are put in a compiler-managed scratchpad memory. According to prior research, only some 13% of values are used more than once, the rest can easily be handled by temporary storage in a Belt.
The dependence on the compiler and compiler scheduling and managing of the offsets on the Belt and operation latencies is the same idea as the first RISC machines. Move complexity from the machine into the compilation stage. It is absolutely right in principle, but for RISC the idea of having the compiler schedule code for the machine broke down when pipelines started to get longer and vary in length, and thus there was no longer a statically known set of latencies. Instead, we ended up with superscalar out-of-order designs that could take code compiled for the initial set of latencies and make it execute fast. At a terrifying cost in overhead. This is one area where I am really curious to see how things work out down the road. It the Mill architected in such a way that they will avoid that trap? Will they keep the same latencies forever, or is there some other trick that they will pull out for their next generation? In general, reducing the latencies for operations has been a huge source of performance improvements in processors, but if latencies are given by the architecture, it would seem that the Mill could not use that to improve performance. It does sound fishy.
The management of interrupts and function calls is truly novel. If I compare it to all other conventional architectures, the Mill way of doing it shows that they have designed the concept of a function call more deeply into the architecture than any other machine I have seen. In a conventional machine, you use mechanism to implement function calls, argument passing, saving registers in the called function to avoid hurting the caller, etc. In the talk on the Belt (http://www.youtube.com/watch?v=QGw-cy0ylCc), the function call semantics of the Mill are shown to be such that calls are just another instruction with a logical latency of 1 cycle from the perspective of the calling function. You can schedule long-latency results across a function call – the results do not pop out in the middle of the called function, but kindly waits until you are back in the function you came from. Such semantics makes the job of the compiler much easier, and avoids one major source of inefficiency in previous statically scheduled machines. Interrupts are handled in the same way, as a spontaneous function call, and do not need to flush pipelines or do other time-consuming saves and restores in order not to disturb the code that got interrupted. Elegant, there is no other word for it.
It is striking how the Mill incorporates inspiration from modern programming. The fact that Metadata (http://www.youtube.com/watch?v=DZ8HN9Cnjhc) is attached to data on the Belt, telling the machine how big the data is, and having an explicit representation of “none” available feels more like Python than C in style. It is very interesting to note that while data is tagged with size, it is not tagged with type. Thus, a floating-point operation or an integer operation can be performed on the same data, and the type of the operation is given in the instruction encoding. One reason that this works is that today, we have pretty much arrived at a point where all types of data in a machine use the same size – 64-bit integers, floats, and pointers are the norm. It used to be that floats were very different in size from integers, and thus needed special handling in special units. For the past decade, this distinction has been blurred, as various vector operations have become popular in all mainstream computer architectures.
The fact that this information moves around with the data it belongs to feels like it should be more efficient than keeping it in a separate set of flag registers or speculative exception registers. I can see that not having types on the data makes it much easier to do loads, stores, picks, and other operations where you just move bits around without computing on them. No need for special floating or integer versions of them, which is a nice simplification of the design.
The way that you can operate on data with latent exceptions (“NaR”, Not a Result) is obvious when you see it, but very clever to come up with a way to integrate it into the system. I still wonder about how memory management and page faults work, but I am sure it can be solved. The support for extensive speculation and vectorization with instructions like “smear” is cleverly done, but is something that a conventional architecture might also pick up on. It feels more like traditional ISA-tweaking progress. It is very well done and looks very beneficial, but it could also be done in a regular architecture (not as elegantly, though). Those operations and metadata is a key part of how the Mill can implement OOO-style behavior with statically scheduled code. Very clever. It is indeed thinking out of the box.
So, if I look for reasons that the Mill will work as advertised by being fundamentally different from what we have in today’s IA, ARM, and Power Architecture:
- Cutting away overhead as a guiding principle
- Reinventing good old ideas
- Replacing OOO hardware scheduling with compiler-directed speculation
- Replacing registers with the Belt
- Metadata moves around with data on the Belt
- Splitting long-term value storage from temporary values
- Size and vector size metadata on data, rebalancing the operand/instruction information content compared to conventional architectures
- NaR and None metadata to support compiler-OOO
- Function call semantics that simplify scheduling
- Interrupt semantics that are much better than those typically found on VLIW DSPs
This could actually work.
How the market reacts to it is a different question, the old boring issue of compatibility with existing software could be a killer. Or maybe the current wave of cloud computing can help buoy the Mill to success – it seems that with the talk of ARM servers as an alternative to Intel, and having code running inside a server controlled by a service vendor, alternative architectures might have a chance to break out again!
I recommend watching some of the Mill movies – it has had me captured for quite a few hours in the gym and on travel. Maybe my idea of entertainment is different from most other people’s.