I just found and read an old text in the computer systems field, “Why Do Computers Fail and What Can Be Done About It?” , written by Jim Gray at Tandem Computers in 1985. It is a really nice overview of the issues that Tandem had encountered in their customer based, back in the early 1980s. The report is really a classic in the computer systems field, but I did not read it until now. Tandem was an early manufacturer of explicitly fault tolerant and highly reliable and available computers. In this technical report Jim Gray describes the basic principles of fault tolerance, and what kinds of faults happen in the field and that need to be tolerated.
The obvious question when reading a text this old is just how relevant the information is for the world of computing today. I see no reason to think that the fundamental principles have changed, even though all numbers in the report are way off compared to current technology. For example, in a description of a restart scenario, he cites a time of about 90 minutes from start to a live system. Today, most systems would come up much faster than that. The nature of networking has changed dramatically from 1985 in terms of speed and latencies and robustness. Even so, it is still true that communications links are normally the weakest link in any distributed system.
The report contains some wonderful quotes.
“Tandem supplies about 4 million lines of code to the customer. Despite careful efforts, bugs are present in this software.”
I appreciate his honesty with this fact, even if nobody since 1950 would be foolish enough to believe any software free of bugs (except possibly TeX). On the topic of human (administrator) errors in handling the systems and causing outages, he notes that most maintenance can be done while the system is online, without bringing it down for a restart (that is better than most systems of today). Still, mistakes that forced the system to go down did happen, but only once every 31 years of system operation.
“The notion that mere humans make a single critical mistake every few decades amazed me – clearly these people are very careful and the design tolerates human faults. ”
However, this might be under-reported:
“If a system fails because of the operator, he is less likely to tell us about it.”
In his numbers, already in 1985, it is worth noting that hardware is much more stable than software. The majority cause of outages was communications failure, power failure, and software. The hardware was the least likely of all to fail – as this hardware was built to be dual-redundant and recover from component faults, its MTBF was on the order of a 1000 years for the system as a whole. Each component was not that good, but the sum was thanks to redundancy.
The main conclusion is a familiar one, and one always worth keeping in mind:
“the key to high-availability is tolerating operations and software faults”
His principles for achieving this still hold true:
- Fail fast – make sure to detect errors and react to them quickly. The quicker an error is caught, the lower the risk that it spreads and brings the entire system down.
- Modularity – make sure that individual components can be restarted in isolation and that they do not spill over into other components. In 1985, separating processes in an operating system using the MMU was still cutting-edge, but today this is standard for most operating systems.
- Paired components – both software and hardware should run in pairs, with one component on the pair ready to take over if the other fails.
- Transaction mechanisms – his favorite system design is based on using transactions to make it possible to abort and restart operations without causing system state corruption.
The reasoning behind using paired components is actually quite interesting. In their experience, most faults in the field are transient. If a software unit fails, it is most likely to be due to some one-off corruption event, timing issue, or similar that creates a non-repeatable bug. This is the classic Heisenbug, the bug that cannot be repeated and goes away if you try to repeat it. Heisenbugs are highly annoying if you are trying to fix the software, but they are obviously of great value if you are looking for reliability. Just retry the operation or restart the transaction or process, and most likely it will work.
“The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers. Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring. But Heisenbugs may elude a bugcatcher for years of execution. Indeed, the bugcatcher may perturb the situation just enough to make the Heisenbug disappear. This is analogous to the Heisenberg Uncertainty Principle in Physics.”
Jim Gray makes the excellent point that the simple repeatable and easily triggered Bohrbugs are going to be found in testing and not make it to the field. An issue that makes a system crash or stop every time will be found early. The bugs that make it into the deployed systems are thus not going to be the easy bugs. This definitely fits my experience – with the noteworthy exception that you get unexpected, untested, but repeatable environmental conditions or inputs out in the field that do uncover Bohrbugs. For Tandem, working in a fairly restricted environment, this was probably less of an issue than for embedded systems. Jim Gray did not have very good data on how common it was that Bohrbugs escaped into deployed systems, but in one experiment he found that out of 132 logged software errors in a “spooler” subsystem, only one was a Bohrbug that failed both in the primary and secondary.
For an embedded system operating out in a complex world, I believe that unexpected data triggering latent Bohrbugs is unfortunately more likely to happen. On the other hand, if the unexpected data itself is transient, tolerating the issue via a restart seems perfectly valid. Even though it would be repeatable given the same inputs, if the same inputs do not repeat, neither will the bug. Thus, given input variance and non-repeatability, the difference between a Bohrbug and a Heisenbug is starting to get a bit blurred.
A consequence of the Heisenbug hypothesis is that there is no point in doing multi-version programming – all components would run the same software, as errors would be likely to come from the environment, rather than built into the software. The cost of multiversion programming simply does not pay off in increased reliability in practice, at least not for this type of systems.
One thing that is very different today compared to 1985 is the sheer scale of systems, and thus hardware faults are probably more common in today’s datacenters compared to those of 1985. This makes tolerating hardware faults mandatory, but today it is not done using paired machines but rather just having some amount of spare capacity in a thousands-of-units systems where a couple of failing units can be taken out and swapped without much harm.
The paper is still worth a read, both a beautiful piece of analysis and writing to be appreciated, and for the simple and well-expressed basic principles of robust systems design.