I just read the panel interview at the start of the latest issue (Number 4, 2008) of ACM Queue. Here, you have Bryan Cantrill of Sun (the man behind dTrace) bemoan the difficulty of testing faults. In particular:
Part of the reason I’m interested in virtualization is as a development methodology. It has not delivered on this, but one of the things that I ask is can I use virtualization to automate someone pulling the Ethernet cable out of the jack? I can get a lot closer to simulating it if you let me create a toy virtual machine than I can running on the live machine.
Well, this already exists. It is a common feature to any virtual platform that is not a datacenter-oriented runtime engine like VmWare, Xen, LPAR, and its ilk. Doing fault injection is a primary use case for virtual platforms, especially for larger servers and systems featuring redundancy and fault tolerance.
I am of course wonderfully excited to hear that this problem has already been solved! So tell me: which virtualization platform allows me to pull a virtual 10 GigE NIC? Or allows me to pull out one half of a LACP’d 10 GigE link aggregation? Or allows me to pull an IPMP’d link under load and validate that the other path picks up full bandwidth within the response times that I must meet to deliver service? If the virtualized hardware is not “datacenter-oriented”, you can forget it — testing my software on a toy system has little value for me.
Background technology first: The basic property that virtual system needs to fulfill is that you do indeed model a particular system’s complete hardware. Once that is in place, pulling bits an pieces is pretty simple. And the software will react in whatever way it is designed to do (getting an alert interrupt, noting in a timeout that something has gone dead, etc.).
This is something that I have been part of doing for rack-based telecom systems, which are on the same scale as typical servers (tens to hundreds of processors, tens of boards). Simics also has some models of olden Sparc servers like the US-III/IV/IV+-based SunFires.
The nice thing with modeling the hardware directly rather than protocols in the middle of the stack is that killing things is pretty easy to do. Just turn off part of the model, or send in some “I am dead” interupt. And then let the software react in whatever manner it is written to. The hardware/software interface is very well-behaved in that respect in that it is narrow and well-defined.
Pulling out virtual cables is a simple as a single line command like “link0.disconnect machine0_phy4”.