I have read some recent IBM articles about the POWER8 processor and its hardware debug and trace facilities. They are very impressive, and quite interesting to compare to what is usually found in the embedded world. Instead of being designed to help with software debug, it seems the hardware mechanisms in the Power8 are rather focused on silicon bringup and performance analysis and verification in IBM’s own labs. As well as supporting virtual machines and JIT-based systems!
The articles I read were “IBM POWER8 performance features and evaluation”, by Mericas et al, and “Debugging post-silicon fails in the IBM POWER8 bring-up lab”, by Dusanapudi et al. Both in the IBM Journal of Research and Development, No 1, January/February 2015.
The hardware debug features described in the article are really only intended for post-silicon system bringup! The hardware debug that I am used to typically targets software developers to help them debug their software, but the hardware debug described for the POWER8 is really to help hardware designers find hardware bugs in silicon. It is used to validate the Power8 design using real chips, to find all the flaws that made it through pre-silicon verification, so that the flaws can be fixed for the next silicon spin. Eventually, this results in a better final shipping product. It was not clear from the article how many spins IBM had to make of the POWER8 , but I would guess they need at least two passes to find all issues.
When the first silicon arrives, it is used to extensively test the system – since real silicon runs so much faster than any chip-level simulator or emulator, it only takes a few days to accumulate more runtime than the entire pre-silicon verification phase!
The validation phase on hardware is driven by bare-metal test cases that have been prepared pre-silicon using both IBM’s in-house “Mambo” transaction-level fairly fast simulator as well as various hardware simulation solutions. In addition, the hardware can actually provide test inputs to itself by invoking various special operational modes! I’ll get back to that below.
Since repeatability helps with debug, there is an operational mode where a Power8 processor core along with its L1, L2 and L3 caches can be run in a “cycle-repeatable” way with no variation between runs starting with the same state and running the same software. Even more impressively, the state of this repeatable island can be extracted from the processor and injected into a software simulation model! We actually have working state transfer from hardware to simulation, which I don’t think I have seen before. I am a big fan of imposing determinism and repeatability in order to help debug, and doing that in hardware is very impressive indeed! This only works for a single core, however, and beyond that, you have to resort to rerunning tests with certain inputs that “should” trigger bugs.
When issues are found in hardware, the failing test is first run on a different machine to rule out hardware manufacturing issues, electrical faults, and similar. Once a test is replicated, they then search for a minimal test case that can replicate it. This can apparently easily take days for a failure that happened after a few seconds, but in the end, it is better to have a good minimal test case than a very large an difficult one. Once a reliable “fail recipe” is found, the test team goes on to analyze it and its causes using the on-chip debug hardware.
In all cases, a bug is only resolved once it is repeated in the pre-silicon environment. In this way, hardware discoveries are fed back to the “source code” and design models, so that they can be properly resolved before the next hardware spin.
To help debug issues, the hardware also features some built-in trace buffers that capture state changes or bus traffic in various locations of the chip. These debug features are only used for silicon validation, as they are too low-level to serve software debug. But in design, they do look a lot like embedded debug solutions like ARM CoreSight – including triggering machines that watch events and stop the processor when an error is found.
There is a hardware “attention” instruction that test software can use to totally stop the processor in case it detects an error. I don’t think I’ve seen that in the past either, it is a bit like jumping to highest-priority non-maskable interrupt (NMI) – except it jumps to the hardware and pulls the emergency brake. I have to assume that that instruction is very privileged as it sounds like a great way to crash a system in real life, and very nice target for an intruder to a Denial-of-Service (DoS) attack – presumably, it can be turned off hard by the boot code. Another way to stop the chip is to have the built-in trace checker machines issue a machine checkstop that freezes all clocks pretty much immediately. This means that the state is closer to the cause of the failure. Once the chip is stopped, JTAG chains can be used to extract data, and the chip can be restarted to extract trace buffer data – provided the stop did not damage the data. All useful stuff to allow software to exercise the hardware while still having hardware-level insight into what is going on.
For the specific case of cache coherency, the hardware even has built-in checkers that apply some fairly simple rules to spot errors:
One example of this type of hardware check is a master machine check which detects when it has received more data than they are expecting for a read type command. This simple check serves as an indication that two caches have errantly provided data for the same request which is a sign that both caches errantly think they have exclusive ownership of the same line.
In addition to the trace and detection support, there is also hardware modes that help force rare conditions to appear. The memory system has several built-in “stresser” operating modes that make caches appear smaller than they are, that force random evictions from a cache, and that help create enough traffic to purposely overload parts of the system. The software test cases use these modes plus software actions to uncover deadlocks, performance degradations, and cache coherency issues. It is hardware-assisted fault injection, nice.
Optimization support is logically part of the software-visible performance monitoring system of the POWER8 . Just like all other modern processors, the Power8 features a large set of possible events to count in its PMU (Performance Monitoring Unit). I found it somewhat interesting that the hypervisor and operating-system (supervisor) levels of the software stack have their own separate PMUs; but it makes sense that they need to watch for different things and be protected from interference by the software.
The support for dynamic optimization found in the POWER8 are not like anything I have seen before. There is a “hotness table”, hardware-aided trace identification, and “event-based branch” – all part of the processor’s software-visible PMU. What is interesting here is that the processor core hardware itself helps identify hot code regions in the code that is executed, rather than having the software keep track of that information itself.
The hardware-assisted optimization support is used to enable yet another take on the old idea of using dynamic recompilation of binaries to improve performance. For the Power8, this is called DCO, dynamic binary code optimization, and is found in AIX. I have not found any open information about this feature, so it is hard to know if it is even shipping today.
For virtual machines like the Java VM (JVM), it is harder to directly identify hot binary code using the processor features – since the starting point is interpretation. Instead the EBB and a facility called BHRB, Branch History Rolling Buffer, which supports building performance instrumentation into the JVM with “low overhead”. Not entirely sure how this works out, but IBM claims that by using the hardware, overhead is low enough that it is worth the effort to profile and optimize fairly flat programs:
The JIT compiler can use these two facilities together to obtain accurate execution profiles with very low overhead, and focus its optimization on the portions of the code that will yield the most benefit. This is particularly important for large applications with relatively flat execution profiles, where existing software profiling mechanisms have a relatively high overhead that limits their ability to identify hot spots that can be optimized aggressively.
Not quite sure how to interpret that to be honest. It would interesting to know a bit more about this dynamic optimization support, as it seems rather interesting. But I have not found a single reference to “DCO” with POWER8 outside of this article, so maybe the feature never shipped? Or got hidden inside an OS as a feature IBM does not talk about?
In summary, it seems that IBM (as usual) has come up with a very impressive computer system that offers very good performance, and where they are allowing themselves the benefits from being a true system builder that essentially start with sand and ship completed computers.