Last week, I attended my fourth System, Software, SoC and Silicon Degug conference (S4D) in a row. I think the silicon part is getting less attention these days, most of the papers were on how to debug software. Often with the help of hardware, and with an angle to how software runs in SoCs and systems. I presented a paper reviewing the technology and history of reverse debugging, which went down pretty well.
This year, S4D took place in Wien, colocated with FDL at TU Wien. The main building of the TU is really nice (but we were in the more modern microelectronics building). Here is a night shot of the newly renovated main building:
S4D is a small workshop, and the papers presented only provide part of the picture as to what the hot topics under discussion were. This year, the main themes that I picked up were:
- The bandwidth limitation for hardware-based debug.
- Log-based debugging, automatically looking for errors in logs.
The bandwidth limitation is very important. If you want to use hardware debug circuitry inside an SoC, you also need to be able to talk to it. The complexity of the chip and the capability of on-chip debug grows with Moore’s law – but off-chip bandwidth does not. The number of pins is also growing slowly, if at all, and thus there is strong pressure to use the pins and bandwidth for “real work” and not for debug. The result of this are a few trends in technology for debug:
- Doing more debug processing on the chip, without having to turn around in an off-chip interface box or debugger host (several research papers describes approaches to have checkers and inspection code run on the chip, in a coprocessor or even on one of the regular cores).
- Aggressive compression of data sent off-chip (the ARMv8 debug architecture presented by Michael Williams of ARM only traced mispredicted branches off-chip — expecting the debugger to reconstruct the flow from a minimal amount of information).
- The use of software debug agents and the software interface of on-chip debug hardware becomes more important. In particular for devices such as smartphones, where there is no dedicated hardware debug port and the debug might be done over USB, Bluettoth, or Wifi. Thus, the exposure of hardware breakpoints and similar functions to software becomes more important to let users actually take advantage of the debug power of a modern SoC. Hopefully, all other silicon vendors will follow the lead of ARM and expose really powerful features in the hardware to software agents so we can get away from silly things like rewriting code to plant breakpoints (and allowing full data write and read breakpoints in software agents).
- Simulator-based debug offers a way to get around the issue by having virtually infinite bandwidth (potentially at the cost of slowing down the target, obviously).
Log-based debug is a favorite topic of mine, and it has been on the agenda for S4D since it started (see reports from 2010 and 2009).
- This year, the most interesting idea was a hardware unit (generated into an FPGA) that watched as a target was executing, looking for traces that satisfied properties expressed using past-time Linear Temporal Logic (ptLTL). ptLTL seems quite well-suited for the task of watching traces of events fly by – it allows looking backwards just a bit, which makes it much more powerful than just looking at the current state, but still it can be implemented quite efficiently.
- Users are clearly using logs to diagnose issues in running systems, and a key problem there is finding issues in huge logs. This is nothing new.
- There was a discussion over how to handle explicit log and instrumentation calls in software. Should they remain in the target software as it ships, or be removed? How does that affect certification and validation?
- If we introduce hardware-supported log instructions in the ISA, couldn’t they also be used as a timing-fault-injection mechanism? Basically, with a settable pipeline stall? Such single-cycle overhead instructions should be possible to keep in the shipping software, as they do not lower performance too much. And if single-cycle disturbances kill your real-time system, it is too close to the edge anyway.
One idea that I threw out but that met very little agreement from the S4D and FDL participants was the notion that we should build systems that accept and tolerate and recover from errors, rather than hoping to make them bug-free and with timing under perfect control. I instinctively find the idea a bit repugnant, being schooled in the precise tradition of computer science where we expect programmers to fix bugs, not just work around them. But in practice, I realize that this might be the right thing to do as our systems get so complex we cannot hope to precisely understand them or diagnose issues in the lab. A typical example of this apprach was published in the ACM Queue last year – basically, using a malloc system that minimizes the effects of buffer overruns, double-free, and similar common causes of crashes. A variant of this is actually shipping in Windows 7 already. People building safety-critical systems do not want to have to do this, but at some point we probably need to go statistical rather than showing that our software is correct. There is some interesting work in making software become more continuous than discrete in behavior, paving the way for statistical analysis of errors. But that was not a topic of S4D, at least not this year.
I presented a paper on the history and techniques of reverse debugging, and received some good feedback from the audience. Someone pointed out that with a weak memory model, record-replay on hardware is not guaranteed to reproduce all bugs since concurrency bugs related to the memory model are not inside the controlled area. On x86, where most work has been done, the use of a TSO-like memory model makes this point fairly unimportant. But on ARM and Power architecture, it is indeed relevant. Another member of the audience found it funny that he believed that he had indeed invented record-based debug back in 2001 – but my presentation of history showed that there was ample work before that. Just goes to show how hard it is to know the history of computer science, many ideas are never widely circulated. Finally, there is a lead indicating that there was some kind of reverse-breakpoint trace-based debugger in the market around 1999. I hope to learn more and do a full blog post on this once more data emerges.
Once again a good workshop, and my only wish for next year is that many more people show up so we can get a broader discussion!
My take on software product testing before release:
1. Extensive tests that cover lots of scenarios with lots of data.
2. Stress test the target machine inside functional envelope.
3. Long run nonstop tests : stress the system for an acceptable service period time ( ex. a month)
If the software + hardware perform well there is no reason for it to crash at the customer site. If it does make sure you have a spare system around that will ensure continuous service.
After years of nonstop service systems DO break down and you should have a replacement strategy handy.
============================================================================
After years of usage my 3D graphics card started to randomly crash my PC. The optimal solution would be to replace the whole PC. In the mean time I under-clocked the GPU and the PC stopped crashing.