Logging as as debug method is not new, and I have been writing about it to and from over the past few years myself. At the S4D conference, tracing and logging keeps coming up as a topic (see my reports from 2009, 2010 and 2012 ). I recently found an interesting piece on logging from the IT world in the ACM Queue (“Advances and Challenges in Log Analysis“, Adam Oliner, ACM Queue December 2011). Here, I want to address some of my current thoughts on the topic.
For some people, log printouts from a program is dirty and primitive. Very often when discussing debug methodologies, everyone starts with “printf debug” and implicitly considers everything else superior. In many cases this is true. When solving comparatively simple linear problems, proper use of a typical debugger ought to solve a problem in much less time and with much less intrusion into the target system. However, once you get to system composed of many quite independent parts, with unpredictable sequencing of actions and concurrent activity, logging is actually a pretty practical and powerful way to diagnose a system. In such a case, the key is to be systematic about how logging is done and to make sure it is much more than just a sprinkling of print statements in random places in the program. What good logging brings you is the executable embodiment of the knowledge about the most interesting and important events in the flow of a program, put there by the programmer writing the code. It is high-level semantic information, much easier to digest than a line by line trace or stepping of the code of the program.I find it hard to see a better way to put that in than by annotating the source code with log statements – but they had better be statements that are clearly about logging, and that direct the logs to some kind of processing system that allows selection and sorting of the output.
Of course, logging can also be done badly, in which case nothing important is ever logged, or millions of uninteresting things are logged, hiding the few relevant events. I have seen this many times. But when done right, logs help you zero in on a problem at a very high pace, without having to really sit down and inspect the code. Often, good logs let you pinpoint a problem and report it to the programmer without needing more than a trace of the log messages that show that something bad is going on in the code modules executing.
One problem with logging that almost always comes is that you have many different log sources and many different log systems (the ACM Queue article has several examples of this), and thus getting a consistent correctly ordered view of events can be very difficult or impossible. Obviously, this is something that can be solved for a system over which you have some amount of control. It would seem to me that logging all systems through a single conduit makes the most sense – but also requires some changes to existing mechanism. In many cases, it would seem possible to just change existing log collectors to push the final results through to some unified infrastructure.
The typical real-world case is shown below, contrasted with an ideal case of a single log conduit that creates a single consistent view of what is going on.
Back at the S4D in 2010, Pat Brouillette presented a paper on Intel’s internal SVEN system that still presents the best logging design I have seen (but I have not seen every system on the planet so there are probably more that are as sophisticated). What I found particularly attractive and useful in their approach was that they logged in user-level software, drivers, operating systems, and the hardware itself. Thus, you could follow a chain of events up and down through the stack. As illustrated above, most existing log systems only apply at one of these levels, making it hard to really see what is happening. This system pretty much achieves the unified log illustrated above.
The result of a unified log is that you can follow the execution through the software stack, even across changes from user-level to kernel-level code and the actions of drivers on the hardware. This is incredibly useful for systems where software and hardware need to interact closely. This is something really hard to do using other debug methods, as most debuggers lose the trail when changing execution modes. Even when they can indeed step along, the volume of code that needs to be followed is so big that it is not very likely that debug information and source code is available for all of it.
A key to making the unified log approach work is to have hardware support for logging. This gets around the common complaint about logging that the overhead affects the system and swamps IO. In particular for resource-constrained system, logging can quickly start to impact performance and in particular IO performance (it seems to be less of an issue in the server space). But there are ways around that, if we allow the hardware to help us. Thus, we want to get something like this into our embedded system:
The simplest approach is to have some special memory-mapped locations where log information is dumped, and then passed on via a debug interface to the outside world. This creates a side channel which avoids mixing in log messages into the useful data streams being sent over serial lines, networks, or other application communications channels. ARM has support for this in their trace facilities for hardware debug today, and I have seen it used to both implement efficient logging as well a general printf (using the output channel instead of serial ports to get data back to the debugger). We also did something similar with Simics and the LDRA code-coverage tools.
In a panel at the S4D in 2012 I proposed using a special instruction in a processor’s ISA to do logging (of a single integer or a few integer values stored in registers). I think this could work very well, but it would reuire some kind of discipline to allocate log ID numbers to various parts of the stack. This kind of approach would work best when the goal is to know where you have been executing with a small amount of context data. If there is a need to dump large amounts of data (like typical IT logs), it probably does not work that well. The advantage is that it can be used within any code, as there is no need for a call into a system library (print statements are typically verboten inside interrupt code, for example). Today, we can do this kind of logging inside virtual platforms by using “magic instructions” (no-ops that the simulator recognizes as special and which can be handed off to instrumentation modules or scripts), but it would be a very handy thing to have in hardware as well.
Analyzing the results
Given some kind of log mechanism, the final step is to analyze the logs to figure out what is going on and whether something is going on that should not. This is part that I am currently looking at ways to address – it is much easier getting things out of a system than to really understand them and draw the right conclusions. There is a huge body of research that should apply here, but just not sure about how the pieces should fit together for my target of “general embedded systems”.