I find the subject of fault tolerance and resiliency in computers quite interesting. It also very interesting to look into what kinds of faults actually do happen in the real world, and what impact they have. I recently found a couple of good sources on this. First of all, a paper from Super Computing 2012 by Fiala et al, called “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing” (ACM Digital Library). One of its references was to a 2011 talk by Al Geist, “What is the Monster in the Closet”, which provided some more data on how common faults are.
The first interesting fact from this set of papers was that ECC (Error Correction Codes) as done today in standard computers is weak when we are looking at machines at the scale of HPC setups – errors hitting enough bits at a time to get around the correction and even the detection of errors will happen. Such errors happened daily in the Jaguar machine in 2011 according to Al Geist; and would be expected to happen every few minutes in exascale systems.
What this really indicates is that errors scale in a non-linear way. Unless resiliency and redudancy is built in from the start, a larger system will present a larger target. With more components in the system, the probability that some component will suffer a failure within a certain time frame goes from “very unlikely” to “virtually certain”. The value of the work also tends to go up as the machines get bigger and runs get longer.
This is why a desktop computer can get away without ECC memory – if it randomly crashes every once a year from a memory corruption, or some computation running a game of browsing the web gets the wrong result, users will just restart the machine or application and everything will be fine. This is the nature of intermittent faults, as I discussed in an earlier blog post.
For an HPC machine, I think that a program crash from memory corruption is very unlikely, as most of memory is data. Rather, data corruption will result in results corruption, and some higher-level mechanism is needed to catch that. The insidious thing with data corruption is that you do not really know it happened from the final result. Rather, you have to do a few runs and compare. But even that is not necessarily enough to catch errors – if we know errors happen with certainty during each run, you are more likely to get N wrong results than one wrong and N-1 correct results. The paper by Fiala describes how error detection and correction can be implemented essentially using the classic three-way redundancy with voting concept used in all safety-critical systems.
Errors at run time can also manifest themselves as crashes. Al Geist has an example of a voltage regulator (actually, 18000 of them) that caused failures in the Jaguar machine. Such a failure manifests itself as a node crash. Another example is if an ECC system detects a corruption but cannot correct it. Reboot is the only good solution.
The paper by Fiala et al makes the important point that current HPC practice is to use cheeckpoint and restart in all long-running applications. The reason is that node failures have become common enough that they have to be handled. But checkpoint and restart have a cost, and if restarts happen too often, they will start to eat into the useful time on the machine. They do some modeling, and show that eventually, all time will be spent restarting and no time will be spent computing.
If errors can be corrected on the fly without the need for restarts, more time will be spent computing. It is simple much more light-weight to correct a small issue when it happens in a localized fashion, than to stop and restart it all. This means that a machine that is triple-redundant and corrects errors on the fly will eventually outperform a single-way machine. Without checkpoint-restart the single-way machine will never get anything computed. With checkpoint-restart, the triple-redundant will eventually be faster as the machine scale goes up and failures happen more often.
In this situation, using continuous error detection and correction makes a lot more sense. And redundancy starts to make economic sense – a massively redundant machine is actually cheaper than a simplistic machine. This is really interesting as a result!
The reason is really that HPC and a few other areas like financial services and safety-critical control cannot really afford to be “almost right”.
This is quite unlike the other category of vastly large machines that are being built today, the web-services data centers. These are built with ECC memory as anything else would be too error prone, but then the assumption is that jobs can come and go and be restarted without much impact. There is no “right” answer, really, and no real cost to being a little wrong. In that world, the main issue being dealt with are machine crashes and recovering from jobs dying due to intermittent data-caused bugs. Then, having one “spare” machine per N machines is enough.
Telecom gear is using a similar model, where you often have master-slave or N+1 redundancy. Once again, the threat is software crashing or hardware going down. But a little bit of memory corruption causing some wrong result somewhere in the chain does not really matter. The steady flow of new independent events and data packets washes away any errors pretty quickly. While in HPC, if a result starts to off, errors will accumulate and you will have produced incorrect results. Which simply is not acceptable, since they can be arbitrarily large.
In finance, you have the same issue. Being wrong means losing people’s money, and that does not work. That’s why the big iron that powers the banks uses far more error checking and resiliency than anything else. You have to note if something goes wrong, even when cosmic rays hit the processor registers. The hardware to do this is much more expensive, MIPS by MIPS, than a web server. But it also reliable in a way that a web server is not and does not have to be.
I also noted that an old acquaintance from my research days was on the SC paper: Frank Mueller at NCSU.