I listened to an interesting FLOSS Weekly interview with Adam Hall and Achim Hasenmuller of VirtualBox. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox.
VirtualBox is able to give a guest OS up to 32 virtual processors for its use. The system virtualizes cores so that you can allocate more cores than you have on your host. You don’t want more active cores than you have physically, or performance might suffer badly (there is a good blog post hosted by blogs.sun.com about VirtualBox 3 and its SMP support, read it before Oracle decides to kill off all the old content…).
Second, the development of that SMP support only took some six months. But getting it to work right took 18 months. Sounds like a familiar story for parallelizing software of this kind. The interviewees made it sound like the code for this was utterly complex, and I can believe that too.
VirtualBox makes extensive use of hardware virtualization support on x86 hosts to enhance performance (Intel VT-X and AMD AMD-V). As they see it, all other alternatives are inferior, including the VmWare tradition of binary translation and patching. They claimed that VmWare actually interpreted all of the code in a guest, but I find that a bit hard to believe.
It is interesting to compare their approach to something like Simics, which was made parallel in the Spring of 2008 with Simics 4.0 Acclerator. The advantage an IT run-time virtual machine like VirtualBox has over a virtual platform like Simics in creating a parallel system is that they can make use of the cache coherence on the host to propagate information between the cores. A virtual platform cannot assume that the host and target are of the same type, which is necessary for this to make sense. It also makes VirtualBox entirely nondeterministic, but that is just plain normal for a physical computer system. One interesting intermediate form here is the IBM CECsim system, where a z-series mainframe is used to simulate a slightly different z-series mainframe, including running multiple simulated processors in parallel. CECsim also makes use of the hardware cache coherence and is nondeterministic.