The Mill is a new general-purpose high-performance processor design from out-of-the-box computing (http://ootbcomp.com/). They claim to beat typical high-end out-of-order (OOO) designs like the Intel Haswell generation by crazy factors, such as being 2.3x faster while using 2.3x less power compared to a Haswell. All the while costing less. Ignoring the cost aspect, the power and performance numbers are truly impressive – especially for general code. How can they do something so much better than what we have today? For general code? That requires some serious innovation. With that perspective, I ask myself where the Mill is really significantly different from what we have seen before.
The 30000 foot view is that the Mill is a design where hardware overhead that is spent deciding on what a processor should do and where to find and deposit results has been minimized. Since a modern out-of-order machine is really more than 90% such overhead, cutting it and replacing it with more functional units should give tremendous benefits in efficiency. The Mill is a VLIW, as that is probably the most efficient instruction design for high performance… provided you can keep the functional units fed, which is the traditional bane of VLIW designs. The Mill is a VLIW made to work for modern general-purpose code.
Looking a bit more closely at the Mill, they have designed an instruction set and computer architecture that turns many of the tricks that an OOO does under the hood, behind the scenes, into compiler-visible statically generated scheduling decisions. Doing this has required rethinking a large part of traditional instruction set design, using a richer information model for temporary data, and doing away with registers. This allowance for OOO-style mechanism makes it possible to keep the VLIW units fed, and avoids the many pitfalls that have plagued current VLIW designs for all applications except DSP.
Yet another perspective is that the Mill is a 2.0 design that has taken many good ideas that have failed in their version 1.0 incarnations, refreshed them, fixed their flaws, and made them work. The Itanium/EPIC design by Intel and HP was in many ways based on the same ideas, but that design did not really work out commercially (technically, it is not really fair to call it a failure, it has provided some impressive absolute performance numbers for certain codes). Statically known latencies and compiler scheduling has been tried before, but did not work for general-purpose code. Function calls with a fresh register area for the called function was the idea behind register windows on the SPARC, but it turned into a nightmare as processors sped up. Here, they do a similar thing in a different way.
On to some of the details.
By doing away with registers, the Mill avoids the very expensive renaming and managing register hazards in a conventional OOO machine. It is a potential big win in terms of power efficiency, since managing hundreds of rename registers for a user-visible set of a few dozen registers is a major painpoint in OOO processor design. The replacement is a fixed-length FIFO called the Belt (http://www.youtube.com/watch?v=QGw-cy0ylCc). Architecturally, values are read from the belt, fed into functional units, and put at the front of the belt. A bit like a stack machine, but with the ability to read from any value, not just the top of the stack. The Mill also separates between temporary values that are used to implement data flow between functional units, which are put on the Belt, and longer-term storage of values being used many times, which are put in a compiler-managed scratchpad memory. According to prior research, only some 13% of values are used more than once, the rest can easily be handled by temporary storage in a Belt.
The dependence on the compiler and compiler scheduling and managing of the offsets on the Belt and operation latencies is the same idea as the first RISC machines. Move complexity from the machine into the compilation stage. It is absolutely right in principle, but for RISC the idea of having the compiler schedule code for the machine broke down when pipelines started to get longer and vary in length, and thus there was no longer a statically known set of latencies. Instead, we ended up with superscalar out-of-order designs that could take code compiled for the initial set of latencies and make it execute fast. At a terrifying cost in overhead. This is one area where I am really curious to see how things work out down the road. It the Mill architected in such a way that they will avoid that trap? Will they keep the same latencies forever, or is there some other trick that they will pull out for their next generation? In general, reducing the latencies for operations has been a huge source of performance improvements in processors, but if latencies are given by the architecture, it would seem that the Mill could not use that to improve performance. It does sound fishy.
The management of interrupts and function calls is truly novel. If I compare it to all other conventional architectures, the Mill way of doing it shows that they have designed the concept of a function call more deeply into the architecture than any other machine I have seen. In a conventional machine, you use mechanism to implement function calls, argument passing, saving registers in the called function to avoid hurting the caller, etc. In the talk on the Belt (http://www.youtube.com/watch?v=QGw-cy0ylCc), the function call semantics of the Mill are shown to be such that calls are just another instruction with a logical latency of 1 cycle from the perspective of the calling function. You can schedule long-latency results across a function call – the results do not pop out in the middle of the called function, but kindly waits until you are back in the function you came from. Such semantics makes the job of the compiler much easier, and avoids one major source of inefficiency in previous statically scheduled machines. Interrupts are handled in the same way, as a spontaneous function call, and do not need to flush pipelines or do other time-consuming saves and restores in order not to disturb the code that got interrupted. Elegant, there is no other word for it.
It is striking how the Mill incorporates inspiration from modern programming. The fact that Metadata (http://www.youtube.com/watch?v=DZ8HN9Cnjhc) is attached to data on the Belt, telling the machine how big the data is, and having an explicit representation of “none” available feels more like Python than C in style. It is very interesting to note that while data is tagged with size, it is not tagged with type. Thus, a floating-point operation or an integer operation can be performed on the same data, and the type of the operation is given in the instruction encoding. One reason that this works is that today, we have pretty much arrived at a point where all types of data in a machine use the same size – 64-bit integers, floats, and pointers are the norm. It used to be that floats were very different in size from integers, and thus needed special handling in special units. For the past decade, this distinction has been blurred, as various vector operations have become popular in all mainstream computer architectures.
The fact that this information moves around with the data it belongs to feels like it should be more efficient than keeping it in a separate set of flag registers or speculative exception registers. I can see that not having types on the data makes it much easier to do loads, stores, picks, and other operations where you just move bits around without computing on them. No need for special floating or integer versions of them, which is a nice simplification of the design.
The way that you can operate on data with latent exceptions (“NaR”, Not a Result) is obvious when you see it, but very clever to come up with a way to integrate it into the system. I still wonder about how memory management and page faults work, but I am sure it can be solved. The support for extensive speculation and vectorization with instructions like “smear” is cleverly done, but is something that a conventional architecture might also pick up on. It feels more like traditional ISA-tweaking progress. It is very well done and looks very beneficial, but it could also be done in a regular architecture (not as elegantly, though). Those operations and metadata is a key part of how the Mill can implement OOO-style behavior with statically scheduled code. Very clever. It is indeed thinking out of the box.
So, if I look for reasons that the Mill will work as advertised by being fundamentally different from what we have in today’s IA, ARM, and Power Architecture:
- Cutting away overhead as a guiding principle
- Reinventing good old ideas
- Replacing OOO hardware scheduling with compiler-directed speculation
- Replacing registers with the Belt
- Metadata moves around with data on the Belt
- Splitting long-term value storage from temporary values
- Size and vector size metadata on data, rebalancing the operand/instruction information content compared to conventional architectures
- NaR and None metadata to support compiler-OOO
- Function call semantics that simplify scheduling
- Interrupt semantics that are much better than those typically found on VLIW DSPs
This could actually work.
How the market reacts to it is a different question, the old boring issue of compatibility with existing software could be a killer. Or maybe the current wave of cloud computing can help buoy the Mill to success – it seems that with the talk of ARM servers as an alternative to Intel, and having code running inside a server controlled by a service vendor, alternative architectures might have a chance to break out again!
I recommend watching some of the Mill movies – it has had me captured for quite a few hours in the gym and on travel. Maybe my idea of entertainment is different from most other people’s.
It will work no doubt but how fast? An FPGA prototype will give a hint of the actual silicon chip speed. If the speed is there commercial OSes can be implemented with relative ease.
Let’s not forget that hairstyle!
You can make a computer project called Work. It will work by default no matter what it is about.
Thanks for the praise – it’s so gratifying when people understand and appreciate the work.
Don’t worry about latency compatibility going forward – Mill family members already vary in latency. And not only latency – they vary in binary encoding, both across family members and across slots in one member. Yet any Mill program runs on any Mill family member.
To achieve code portability we adopted the device used by the IBM AS/400 family: a load module contains the program in a representation for the abstract architecture, and at install time (or ROM build time) the program is “specialized” for the concrete encoding that is used by the target. The specialized binary is then cached back into the load module, so specialization doesn’t have to be done again unless the code is run on a different family member.
The specializer does two jobs when presented with a load module. First it replaces any abstract operation the is not hardware on the target (for example, the low-end Tin member has no floating point) with an equivalent emulation sequence. It then schedules the code into the available slots/functional units/latencies. The abstract-program data structure produced by the tool chain (and input to the specializer) is designed so that specialization is one pass and very fast, in the order of a dynamic linker time.
With this scheme, the individual member (and slot) can have an entropy-optimal encoding all of its own, while obviating lateral or longitudinal compatibility issues. The actual encoding of each member/slot is mechanically generated from a specification; no human actually lays out bits manually. The generation process produces an assembler, debugger, specializer, and drives the Verilog generators to create the decoding hardware.
You can find out more at http://millcomputing.com – a new video (on the Mill security model) just went up the week this is written. There’s an active forum where we answer the innumerable questions raised by the Mill novelty.
Thanks again. Your blog is rewarding even when you are not talking about our baby, but please do that more too 🙂
Ivan
You missed what, to me, is the most important innovation the Mill presents- how it solves the aliasing problem. One of the big problems that architectures like the Itanium hit was that you want to start a load as early as possible, but what happens when an intervening store aliases the memory? This requires the compiler to be able to determine if the intevening store is an aliasing store or not- a very complicated problem. On the Mill, loads return the value the memory has when the load retires, not when it issues. So aliasing isn’t a problem- if the intervening store aliases, it’s value is reflected in the load’s result. So now the compiler doesn’t have to worry about aliasing- it can hoist loads above stores with wild abandon, and everything works.
This is an important bit of the philosophy of the Mill, at least to me- yes, it depends upon intelligence in the compiler, but not genius as it were. The Itanium explicitly took the attitude that if Intel built it, compile advances would come (especially the solution to the aliasing problem). This wasn’t true, however, and the compiler advances never materialized, and the Itanium failed. None of the compiler smarts the Mill is relying on are novel or recent, let alone not yet worked out.
This is truely a historic moment: Intel has been caught with its pants down. If Godards team sells out, it will be the biggest acquisition of IP, dwarfing recent Facebook acquisitions by multiple orders of magnitude. Money much better spent.
This architecture is such a marvel, combining so many smaller insights into their glorious consequences. It all comes together through the Belt. The _obvious_ is the hallmark of great design. The Mill has got it, and the Belt embodies it. OutOfTheBox itself was like a Belt, a production line, compelled to produce, and the Mill simply rolls of, perfectly manufactured. Indeed, I watch in awe, for they have accomplished the simplicity and multiplicity of ideas I have pursued in this area, as a passtime, myself. It is a supernova, and supernova’s behold the birth of Christ.
Yes, we are witnessing the birth of a supernova, a great blessing on those who created it. Indeed, kings will be dwarfed. No good shall go unpunished though. Many enemies will rise.
In times past, and where I happen to live, you would be killed for this, unless you are “smart”, and “humbly” surrender you conquest to whomever has the guns, and obey and kneal before the Law of Men. Little men that think you are a dog that bites its master, little men society is built to enforce.
If wealth has any purpose at all beyond inflating little men beyond their abilities, into the world of distortion they impose, it is indeed to herald the pivotal entity compelled to crowbar the world into righteousness and strength, inverting reality, and die doing so.
Maybe. But it is a matter of engineering.
Thanks, Ivan! Good to know I did understand things correctly.
You have a truly impressive design here.
The AS/400 reference makes sense too – that is another great design that rarely gets much attention today. Another set of good old proven ideas renovated and reused for the modern day.
Inspiring.
I know what else doesn’t work: you Mr. Jakob, you are NOT working no more on your blog. Please power up(wake up) that part of your organic neural network.
The Burroughs 5500 and its descendants, incorporated the concept of the “unexpected procedure call”
which it used to encapsulate interrupts and traps. It worked exactly as if the running code
had issued a procedure call inserted “between” two instructions. When the synthetic procedure call returned after attending to whatever demanded attention, the previously running code proceeded as if nothing had happened.
find more info galaxy swapper v2 download