I love bug and debug stories in general. Bugs are a fun and interesting part of software engineering, programming, and systems development. Stories that involve running Simics on Simics to find bugs are a particular category that is fascinating, as it shows how to apply serious software technology to solve problems related to said serious software technology. On the Intel Software and Services blog, I just posted a story about just that: debugging a Linux kernel bug provoked by Simics, by running Simics on a small network of machines inside of Simics. See https://blogs.intel.com/evangelists/2016/05/30/finding-kernel-1-2-3-bug-running-wind-river-simics-simics/ for the full story.
A new record, replay, and reverse debugger has appeared, and I just had to take a look at what they do and how they do it. “rr” has been developed by the Firefox developers at Mozilla Corporation, initially for the purpose of debugging Firefox itself. Starting at a debugger from the angle of attacking a particular program does let you get things going quickly, but the resulting tool is clearly generally useful, at least for Linux user-land programs on x86. Since I have tried to keep up with the developments in this field, a write-up seems to be called for.
I have read some recent IBM articles about the POWER8 processor and its hardware debug and trace facilities. They are very impressive, and quite interesting to compare to what is usually found in the embedded world. Instead of being designed to help with software debug, it seems the hardware mechanisms in the Power8 are rather focused on silicon bringup and performance analysis and verification in IBM’s own labs. As well as supporting virtual machines and JIT-based systems!
At the ISCA 2014 conference (the biggest event in computer architecture research), a group of researchers from Microsoft Research presented a paper on their Catapult system. The full title of the paper is “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services“, and it is about using FPGAs to accelerate search engine queries at datacenter scale. It has 23 authors, which is probably the most I have ever seen on an interesting paper. There are many things to be learnt from and discussed about this paper, and here are my thoughts on it.
Simics can run and debug UEFI BIOSes, and that is the topic of my latest blog at Wind River. UEFI is actually pretty interesting once you get to know it, and building a good debug experience for UEFI took a bit of work. Still, it was built as just another target for the standard uniform Simics debugger, which is not the way most other UEFI and BIOS debuggers are built. I guess in that in the past, debugging a BIOS required such specialized tools that it made sense to also build a custom specialized frontend for the purpose. With a simulator as the backend, things do become simpler and more uniform, and Eclipse CDT is a actually a very good basis for a debugger for any kind of C and C++ code.
For more reading on UEFI itself, I can recommend the 2011 Intel Tech Journal on the topic.
I have a silly demo program that I have been using for a few years to demonstrate the Simics Analyzer ability to track software programs as they are executing and plot which threads run where and when. This demo involves using that plot window to virtually draw text, in a way akin to how I used to make my old ZX Spectrum “draw” things in the border. But when I brought it up in a new setting it failed to run properly and actually starting hanging on me. Strange, but also quite funny when I realized that I had originally foreseen this very problem and consciously decided not to put in a fix for it… which now came back to bite me in a pretty spectacular way. But at least I did get an interesting bug to write about.
Debugging – the 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems by David Agans was published in 2002, based on several decades of practical experience in debugging embedded systems. Compared to the other debugging book I read this Summer, Debugging is much more a book for the active professional currently working on embedded products. It is more of a guidebook for the practitioner than a textbook for students that need to learn the basics.
This blog post is a review of the book “If I Only Changed the Software, why is the Phone on Fire“, (see more information on Amazon, for example), by Lisa Simone. The book was released in 2007, on the Elsevier Newnes imprint. It is a book about debugging embedded systems, written in a murder-mystery style with a back story about the dynamics of an embedded development team. It sounds strange, but it works well.
On my Wind River blog, you can now find a description on how we have used the Eclipse TCF (target connection framework) to build the Simics GUI. Or rather, the connection between the Simics GUI and the Simics simulation process. It is actually quite revolutionary what you can do with the TCF, compared to older debug protocols. In particular, TCF lets you combine many different services across a single connection.
Last year, I did a Simics webinar which included a two-part demo of how to use Simics to debug an endianness bug in a networked system as it migrates from big-endian to a little-endian system. Along the way, I also showed off various Simics features like reverse execution and checkpointing and scripted execution.
The demo is now online at the Wind River Youtube channel, and the setup is explained in a blog post at the Wind River company blog which is worth reading before watching the video.
Logging as as debug method is not new, and I have been writing about it to and from over the past few years myself. At the S4D conference, tracing and logging keeps coming up as a topic (see my reports from 2009, 2010 and 2012 ). I recently found an interesting piece on logging from the IT world in the ACM Queue (“Advances and Challenges in Log Analysis“, Adam Oliner, ACM Queue December 2011). Here, I want to address some of my current thoughts on the topic.
There is a new blog post on my Wind River blog, about the Landslide system from CMU. It is a pretty impressive Master’s Thesis project that used the control that Simics has over interrupts to systematically try different OS kernel thread interleavings to find concurrency bugs. The blog is an interview with Ben Blum, the student who did the work. Ben is now a PhD student, and I bet that he will continue to generate cool stuff in the future.
Last week, I attended my fourth System, Software, SoC and Silicon Degug conference (S4D) in a row. I think the silicon part is getting less attention these days, most of the papers were on how to debug software. Often with the help of hardware, and with an angle to how software runs in SoCs and systems. I presented a paper reviewing the technology and history of reverse debugging, which went down pretty well.
I am going to be talking about how to transport bugs with virtual platform checkpoints, in the Software Tools track at the Embedded Conference Scandinavia, on October 3, 2012, in Stockholm (Sweden). The ECS is a nice event, and there are several tracks to choose from both on October 2 and October 3. In addition to the tracks, Jan Bosch from Chalmers is going to present a keynote that I am sure will be very entertaining (see my notes from a presentation he did in Göteborg last year).
I am scheduled to talk at the SiCS multicore day 2012 (like I did back in 2009 and 2008). The event takes palce on September 13, at SiCS in Kista. My topic will be on System-Level Debug – how we can make debuggers that work for big systems.
This year, the multicore day is part of a bigger Software Week event, which also covers cloud and internet of things. See you there!
In this final part of my series on the history of reverse debugging I will look at the products that launched around the mid-2000s and that finally made reverse debugging available in a commercially packaged product and not just research prototypes. Part one of this series provided a background on the technology and part two discussed various research papers on the topic going back to the early 1970s. The first commercial product featuring reverse debugging was launched in 2003, and then there have been a steady trickle of new products up until today.
It used to be that Microsoft was the big, boring, evil company that nobody felt was very inspiring. Today, with competition from Google and Apple as well as a strong internal research department, Microsoft feels very different. There are really interesting and innovative ideas and paper coming out of Microsoft today. It seems that their investments in research and software engineering are generating very sophisticated software tools (and good software products).
I have recently seen a number of examples of what Microsoft does with the user feedback data they collect from their massive installed base. I am not talking about Google-style personal information collection, but rather anonymous collection of user interface and error data in a way that is more designed to built better products than targeting ads.
Since I have a certain interest in debugging, I was happy find the article “Guidelines for SystemC – Debugger Integration” at the usually interesting Design and Reuse website. However, I must say that it was pretty disappointing.
There is a new post at my Wind River blog, about some computing history. Wind River turns thirty this year, Simics twenty, and simulation for debug (and probably debug in general) turns sixty. Computing has come a long way.
This post features some additional notes on the topic of transporting bugs with checkpoints, which is the subject of a paper at the S4D 2010 conference.
The idea of transporting bugs with checkpoints is some ways obvious. If you have a checkpoint of a state, of course you move it. Right? However, changing how you think about reporting bugs takes time. There are also some practical issues to be resolved. The S4D paper goes into some of the aspects of making checkpointing practical.
I have a paper about “Transporting Bugs with Checkpoints” to be presented at the S4D (System, Software, SoC and Silicon Debug) conference in Southampton, UK, on September 15 and 16, 2010. The core concept presented is to leverage Simics checkpointing to capture and move a bug from the bug reporter to the responsible developer. It is a fairly simple idea, but getting it to work efficiently does require that some things are done right. See the longer Wind River blog posting about this topic for a few more details.
Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like “let’s show how we unravel a randomly occurring hard-to-reproduce bug using Simics“. This then turns into a hard hunt for a program with a suitable bug in it… not the Simics tooling to resolve the bug. For some reason, when I best need bugs, I have hard time getting them into my code.
I guess it is Murphy’s law — if you really set out to want a bug to show up in your code, your code will stubbornly be perfect and refuse to break. If you set out to build a perfect piece of software, it will never work…
So I was actually quite happy a few weeks ago when I started to get random freezes in a test program I wrote to show multicore scaling. It was the perfect bug! It broke some demos that I wanted to have working, but fixing the code to make the other demos work was a very instructive lesson in multicore debug that would make for a nice demo in its own right. In the end, it managed to nicely illustrate some common wisdom about multicore software. It was not a trivial problem, fortunately.
An unplanned and unexpected bonus with my trip to the FDL 2009 conference was the co-located S4D conference. S4D means System, Software, SoC and Silicon Debug, and is a conference that has grown out of some recent workshops on the topic of debugging, as seen from the perspective of hardware designers (mostly). S4D was part of the same package as FDL and DASIP, entrance to one conference got you into the other two too. As I did not know about S4D until quite late in the process, this was a great opportunity for me to look at what they were doing.
In my series (well, I have one previous post about checkpointing) about misunderstood simulation technology items, the turn has come to the most difficult of all it seems: determinism. Determinism is often misunderstood as meaning “unchanging” or “constant” behavior of the simulation. People tend to assume that a deterministic simulation will not reveal errors due to nondeterministic behavior or races in the modeled system, which is a complete misunderstanding. Determinism is a necessary feature of any simulation system that wants to be really helpful to its users, not an evil that hides errors.
I just read the panel interview at the start of the latest issue (Number 4, 2008) of ACM Queue. Here, you have Bryan Cantrill of Sun (the man behind dTrace) bemoan the difficulty of testing faults. In particular:
Part of the reason I’m interested in virtualization is as a development methodology. It has not delivered on this, but one of the things that I ask is can I use virtualization to automate someone pulling the Ethernet cable out of the jack? I can get a lot closer to simulating it if you let me create a toy virtual machine than I can running on the live machine.
Well, this already exists. It is a common feature to any virtual platform that is not a datacenter-oriented runtime engine like VmWare, Xen, LPAR, and its ilk. Doing fault injection is a primary use case for virtual platforms, especially for larger servers and systems featuring redundancy and fault tolerance.
I have another short technical piece published about Multicore Debug at the EETimes (and their network of related publications, like Embedded.com). Pretty short piece, and they cut out some bits to make it fit their format. Nothing new to fans of virtual platforms for software development, basically we can use virtual platforms to reintroduce control over parallel and for all practical purposes chaotic hardware/software systems.
In the book “Programming Embedded Systems — with C and GNU Development Tools“, authors Michael Barr and Anthony Massa make some statements on simulation that I just have to disagree with on principle. Read on for what. Note that overall this is a good book, I am not claiming that it is not. The Amazon reviews are pretty good, and having a foreword by Jack Ganssle is always a sign of quality. But I just have to correct them on one little fact…
The TimeSys Embedded Linux Podcast (also called LinuxLink Radio) is a nice listen about embedded computing using Linux. Sometimes they are a bit too open-source centric, though, and ignore very good tools that live in the classic commercial world. One such example is the recent episode 20 on debugging tools, where they totally ignore modern high-powered hardware-based debugging.