I have a silly demo program that I have been using for a few years to demonstrate the Simics Analyzer ability to track software programs as they are executing and plot which threads run where and when. This demo involves using that plot window to virtually draw text, in a way akin to how I used to make my old ZX Spectrum “draw” things in the border. But when I brought it up in a new setting it failed to run properly and actually starting hanging on me. Strange, but also quite funny when I realized that I had originally foreseen this very problem and consciously decided not to put in a fix for it… which now came back to bite me in a pretty spectacular way. But at least I did get an interesting bug to write about.
Anyway, this is what the program is supposed to look at when done right:
Silly indeed. Each pixel blob in the plot is a piece of code that keeps a particular thread 100% busy computing on the processor with no disk or other IO needed. The structure of the program is quite simple: there are 7 worker threads and 1 master thread that feeds the worker threads with work for each vertical line, and then waits for the workers to finish their work and immediately kick them off on the next line. If you zoom in and look closely, there is some noise in when processes start computing, but not enough to matter when zooming out.
I moved this software stack off of the old Linux 2.6.23 where it had been running before, onto a new machine with a Linux kernel 3.4-based software stack. The program still ran fine, but I noticed a slight gap in the resulting plot:
Interesting. Even more interesting, if I reran the program immediately on the same target, it completely hung and never returned to prompt. Putting in some breakpoints and engaging the debugger, I managed to step through the first few computational blocks, and realized that the hang appeared on the first vertical line of the “S”. The first line where there was more than one pixel active. Here is zoomed-in plot of what was happening:
Note how the top three pixels line up nicely. However, the fourth pixel is offset to the right. Actually, looking really really closely at the issue, it was clear that it started after the previous three had finished. Not by much, but enough to make a difference.
Stepping through the synchronization code (nice thing you can do with a globally stopping debugging based on a simulator), it seemed that what was happening was that my program was failing in the case that the master thread sent some threads out to work after some threads had already terminated. This scenario was not supposed to happen as that meant the “text drawing” would not work. It was a classic race condition – when this particular program timing happened, the program hung, as an assumption built into the code logic was violated. There was no attempt to enforce that assumption.
Why was this happening, though? Zooming in even more in the timeline view revealed the answer:
The color-coding here shows that the violet thread, which is the master thread, is running on the same olive-green processor as one of the worker threads. When that worker thread is dispatched, the master thread immediately gives up the processor and gets scheduled out by the Linux OS. When Linux gets it back to running state, it is too late to save the situation and it send off the fourth and last worker thread and then sits down and waits for four threads to call home, while only one will ever do that, and the program hangs. The synchronization code was written in such a way that it expected all the dispatched threads to be running when it started waiting.
So how to fix this? Either I could try giving the master thread a higher priority to run (would have been bullet-proof on a proper RTOS), or I could simply use the same kind of processor affinity already used to tie the seven worker threads to designated processors. When I went down to the starting point of the master thread in the source code, I found this:
void *andraw_master_thread(void *args) { //printf("Andraw master thread started\n"); // Not clear that we need to tie the master to a certain // core. Not doing it might give some interesting effects. // // tie_thread_to_core(WORKER_COUNT);
It seems that a certain programmer (me) realized that this could maybe happen when I wrote the code. But then I decided that it was more interesting not to code in the safest way and to see what would happen. And three years later, interesting things did indeed happen.
There are two lessons that can be drawn from this story, I think:
- For proper program functionality, do code to make sure assumptions and met and with the most safety you can imagine. That would avoided this bug biting me “in production” so to speak.
- For interesting stories to tell, leave a few well-chosen glitches in your code so you can come back later and show off how good you are at finding and fixing bugs.
Better lat than never. I look forward for the next catch.
P.S. I hate English language. Mainly retarded people speak English .
Salutari.