Injecting faults into systems and subjecting them to extreme situations at or beyond their nominal operating conditions is an important part of making sure they keep working even when things go bad. It was realized very early in the history of Simics (and the same observation had been made by other virtual platform and simulator providers) that using a virtual platform makes it much easier to provide cheap, reliable, and repeatable fault injection for software testing. In an Intel Developer Zone (IDZ) blog post, I describe some early cases of fault injection with Simics.
I have a two-part series (one, two) on testing posted on my Software Evangelist blog on the Intel Developer Zone. This is a long piece where I get back to the interesting question of how you test things and the fact that testing is not just the same as development. I call the posts Mindset and Toolset
I just added a new blog post on the Wind River blog, about how you do fault injection with Simics. This blog post covers the new fault injection framework we added in Simics 5, and the interesting things you can do when you add record and replay capabilities to spontaneous interactive work with Simics. There is also a Youtube demo video of the system in action.
I find the subject of fault tolerance and resiliency in computers quite interesting. It also very interesting to look into what kinds of faults actually do happen in the real world, and what impact they have. I recently found a couple of good sources on this. First of all, a paper from Super Computing 2012 by Fiala et al, called “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing” (ACM Digital Library). One of its references was to a 2011 talk by Al Geist, “What is the Monster in the Closet”, which provided some more data on how common faults are.
There is a new post at my Wind River blog, an interview with Andreas Buchwieser from the Wind River office in München. It discusses how Simics can be applied to the field of safety-critical systems, including helping test the software to get it certified. Really interesting, and in particular it is worth noting that qualifying tools in the IEC 61508 and ISO 26262 context is much easier than in DO-178B/C. The industrial family of safety standards have been created to allow for tools to help validate an application without forcing incredibly high demands on the development of those tools.
There is a new post at my Wind River blog, about how you actually do fault injection in Simics. This particular post is pretty detailed, showing the actual architecture of a fault injector in Simics, not just “yes you can do it”. It includes actual diagrams of system components and how you can insert fault injection into an existing system, so it is a bit more technical than most my Wind River blog posts that tend to be more conceptual.
Fault Injection is a topic that has fascinated me for a long time. Not just the area of software-to-software fault injection, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a simulator). I just stumbled on a short interesting note about such hardware-actuated fault injection in a Fujitsu article.
Last week, I posted a discussion about fault injection in virtual systems, using Basil Fawlty as the perfect example of a fault injection agent.
I just spotted a fun little application on Freescale’s homepage: an interactive demo of the fault tolerance functions of the MPC564XL dual-core microcontroller.
I just read the panel interview at the start of the latest issue (Number 4, 2008) of ACM Queue. Here, you have Bryan Cantrill of Sun (the man behind dTrace) bemoan the difficulty of testing faults. In particular:
Part of the reason I’m interested in virtualization is as a development methodology. It has not delivered on this, but one of the things that I ask is can I use virtualization to automate someone pulling the Ethernet cable out of the jack? I can get a lot closer to simulating it if you let me create a toy virtual machine than I can running on the live machine.
Well, this already exists. It is a common feature to any virtual platform that is not a datacenter-oriented runtime engine like VmWare, Xen, LPAR, and its ilk. Doing fault injection is a primary use case for virtual platforms, especially for larger servers and systems featuring redundancy and fault tolerance.