Fault Injection is a topic that has fascinated me for a long time. Not just the area of software-to-software fault injection, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a simulator). I just stumbled on a short interesting note about such hardware-actuated fault injection in a Fujitsu article.
The Fujitsu Scientific and Technical Journal is the Fujitsu equivalent of IBM’s Journal of Research and Development. Thankfully, the FSTJ is still free while IBM erected a paywall around their articles. Number 2 of 2011 had the theme of servers, and there is an article about Quality Assurance for Stable Server Operation by Masafumi Matsuo, Yuji Uchiyama, and Yuichi Kurita.
The article describes the process of ensuring that the final servers that are shipped to customers (from what seems to be the Sparc-based line of Fujitsu computers, even though it might actually apply equally to their mainframes and x86-based servers) are as stable as possible. Apart from designing things right, this also requires testing the fault handling and recovery operations.
I found it noteworthy that they do a lot of configuration testing, where various hardware and software configurations are played off against each other. In this way, corner cases are explored and coverage of the actual configurations that customers will be using becomes more likely (it is always dangerous to only test on one or a few configurations). They push memory system and processor loads to very high levels to ensure continued operation even in extreme cases, and also try to push the actual chips to make sure they will operate reliably in a wide range of environmental conditions. Indeed, a large focus is placed on pure physical reliability, as that is the basis for system reliability.
The best part, however, is on page four of the article, where they show the physical fault injection robot that is applied to the chips mounted on boards. This robot shorts out individual pins on chips, clamping them to zero volts. It goes over all pins, and the test system checks what happens to the system in each case. Some kind of exhaustive testing going on here.
Neat. I have heard other stories of physical fault injection, including complex mechanisms like passing computer boards through irradiation chambers all the way to brutally simple tasks like putting an axe into a board to break it or pulling boards out of racks to simulate a sudden catastrophic failure. I would like to see more of just how these things are done in the real world. I suspect there are quite a few interesting robotics setups out there that do fault injection.
In any case, the article offered an interesting glimpse of many of the techniques used to make computer systems robust and reliable. Recommended.
It ends by noting that deeply consolidated SoC designs and aggressive dynamic power management are challenging from a testing and observation perspective, as well as creating more single points of failure. If a single system on chip fails, that system is all gone…