<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; demo</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/demo/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>My Bug Doesn&#8217;t Work!</title>
		<link>http://jakob.engbloms.se/archives/1489?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1489#comments</comments>
		<pubDate>Wed, 14 Sep 2011 03:27:07 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[compilers]]></category>
		<category><![CDATA[demo]]></category>
		<category><![CDATA[VxWorks]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1489</guid>
		<description><![CDATA[Every once in a while I need to build demo setups to show debugging in action. As I have blogged before, finding a good bug when you need one isn&#8217;t always easy.  The solution is to try to invent artificial bugs, and I was very happy when I managed to stage a buffer overrun in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png"><img class="alignleft size-full wp-image-982" title="butterfly" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png" alt="" width="90" height="91" /></a>Every once in a while I need to build demo setups to show debugging in action. As I have blogged before, <a href="http://jakob.engbloms.se/archives/975">finding a good bug when you need one isn&#8217;t always easy</a>.  The solution is to try to invent artificial bugs, and I was very happy when I managed to stage a buffer overrun in a VxWorks program.</p>
<p>It is pretty very nice demo in which you first start a period program A, which prints the value of an incrementing counter every target second.  You then run a supposedly unrelated program B, resulting in the values that program A prints to become corrupted.  Perfect to show off reverse execution and data breakpoints in reverse as you go from the point where the corrupted value is printed to the piece of code that overwrote the variable.</p>
<p>But then I ported the demo to a new platform&#8230; and the bug didn&#8217;t work anymore. My bug had caught a bug and was now not working, or at least not they way I expected it to. What had happened?<br />
<span id="more-1489"></span><br />
Very simple. I changed the compiler I used. Since my bug relied on an unspecified behavior in C, the change was totally valid and really expected.  Still, it was interesting to see how things played out&#8230; in the end, we got a different bug from the same code thanks to the change.</p>
<p>The code is essentially the following, with some simplifications that make it easier to read for those not familiar with VxWorks, and ignoring all the code to start tasks initially.</p>
<pre>// Global variables
int     iDataArray[100];
int     myWdISRcount;
WDOG_ID myWatchDogId;

// Periodic task - program A
void myWdISR(void)
{
  /* Increment ISR invocation count */
  myWdISRcount = myWdISRcount+1;
  printf("wd Fired %d times\n",myWdISRcount);

  /* Start off next invocation */
  wdStart (
    myWatchDogId,
    WD_INTERVAL,
    (FUNCPTR) myWdISR,
    (int) NULL
  );   
}

// Overwrite code - program B
int myCompletelySafeRoutine(void)
{
  uint32_t *a,i;
  a = iDataArray;   
  // This loop writes one word beyond the
  // limit of the iDataArray
  for (i=0; i&lt;=100; i++) {
     a[i] = 0x7fffffff;
   }   
   return OK;
}</pre>
<p>In the original setup, compiled for a Power Architecture target, iDataArray ended up right before myWdISRcount in memory.  Thus, the buffer overflow changed the value of the counter from something like 10 to 0x7fffffff.  Very noticeable in the printouts from the periodic task.</p>
<p>When I changed to an x86 target (using a compiler from the same family, but obviously with a different code generator since the target was different), the variable order in memory changed and it seems that we got iDataArray placed last.  Suddenly, the effect of running the safe code was that nothing happened at all.  A bit annoying for a demo.  Some small source-code changes and a recompile later, the effect was instead to crash the target with a triple fault (page fault inside a page fault handler). Seems the program now managed to corrupt some kernel state. While impressive as a bug, it was not quite what I was looking for.</p>
<p>I then changed the compiler type to compiler 2, and the data layout changed once more.  This produced a very useful bug, but it took me a while to actually understand this.  Now, when program B was run, program A stopped.  This looked like a bug in my program, and I actually started trying to fix this &#8211; until I realized that this was the bug I was looking for.  Running program B kills program A is just as good a bug as corrupting a counter value, after all.  In this case, the array overrun hits the myWatchDogId variable, and when that gets corrupted, the wdStart call ignores the request since it does not recognize the ID it gets.</p>
<p>So, in the end, I got a bug that was just as good as the first, and arguably a bit more intruiging. It is still obviously a contrived example &#8211; but I think that a good demo or lab exercise can be artificial as long as it gets the point across.  Judging from how people who have done the lab reacts, the goal seems to have been<br />
achieved.</p>
<p>The moral of the story is really that compilers are free to change things which are explicitly implementation-defined or not specified at all in the C standard. That is a good thing as it gives the compiler freedom to optimize the code. If you want to control how variables are laid out in memory, I guess you have to resort to linker scripts or similar &#8211; but that was too much pain for me in this case. Just changing things around until I got a good bug, and then freezing the binary (and not recompiling it ever again) is a sufficient strategy.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1489"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1489" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1489" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1489/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Finally, a Bug!</title>
		<link>http://jakob.engbloms.se/archives/975?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/975#comments</comments>
		<pubDate>Sun, 25 Oct 2009 20:41:20 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[demo]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=975</guid>
		<description><![CDATA[Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using Simics&#8220;. This then turns into a hard hunt for a program with a suitable bug [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png"><img class="alignleft size-full wp-image-982" title="butterfly" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png" alt="butterfly" width="90" height="91" /></a>Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using <a href="http://www.virtutech.com/products/simics_hindsight.html">Simics</a>&#8220;. This then turns into a hard hunt for a program with a suitable bug in it&#8230; not the Simics tooling to resolve the bug. For some reason, when I best need bugs, I have hard time getting them into my code.</p>
<p>I guess it is Murphy&#8217;s law &#8212; if you really set out to want a bug to show up in your code,  your code will stubbornly be perfect and refuse to break. If you set out to build a perfect piece of software, it will never work&#8230;</p>
<p>So I was actually quite happy a few weeks ago when I started to get random freezes in a test program I wrote to show multicore scaling. It was the perfect bug! It broke some demos that I wanted to have working, but fixing the code to make the other demos work was a very instructive lesson in multicore debug that would make for a nice demo in its own right. In the end, it managed to nicely illustrate some common wisdom about multicore software. It was not a trivial problem, fortunately.</p>
<p><span id="more-975"></span>First, some notes about the program. It is a producer-consumer system using pthreads, with a single producer thread feeding a variable number of compute threads with data, over a shared queue structure (a simple one that uses a single lock to protect it, making it not very scalable for small data messages and lots of workers).</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png"><img class="aligncenter size-full wp-image-980" title="program structure 2" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png" alt="program structure 2" width="411" height="237" /></a></p>
<p>The queue contains a circular buffer, managed using a standard set of full/empty/tail/head kinds of variables. There is also a flag &#8220;done&#8221; which is set once we are out of data, to tell the compute threads to shut down and terminate the program. As this program is used to demonstrate and test scaling, it is actually something that terminates. The main program spawns off all the threads, and then waits for all threads to finish before it terminates itself.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png"><img class="aligncenter size-full wp-image-981" title="program structure" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png" alt="program structure" width="300" height="458" /></a></p>
<p>This program and the queue subsystem had worked perfectly for a long time for me, running on an MPC8641 machine with a Linux 2.6.23 kernel, with 1 to 8 cores and 1 to 16 threads. Regardless of settings like thread counts, data sizes, number of packets to compute, it always ran smoothly and terminated.</p>
<p>However, the other week, I moved the program, the exact same binary even, over to a new software stack built on a Linux 2.6.27 kernel. Still on the same MPC8641 machine. Suddenly, I started to see occasional freezes where the program would never terminate. I added some more diagnostic printouts to the program, and saw that the main program would simply freeze waiting for the other threads to terminate and report in. The freezes had no real relationship to input variables. Maybe they were a bit more common with short packets, but no real pattern emerged. They also happened randomly, running the program with the same parameters for a few times in a row would sometimes result in a freeze. Using control-C to quit it and restart would keep the new instance of program running well. Doing some other demo work, I found the same effect on a P4080 machine with 8 cores and a 2.6.30 Linux kernel.</p>
<p>This is a common pattern for parallelism bugs: they only manifest themselves as actual visible crashes or freezes or bad computation results once something in the software stack has changed, even though the fundamental issues have been there all the time. In this case, I think it was the Linux scheduler, but it is really hard to tell. Just because a program runs fine today it does not have to run fine tomorrow.</p>
<p>After deciding to finally sit down and turn this lemon into lemonade, I had to reproduce the error. Thankfully, that is easy when you have a simulator. The first few times I had to run the target program 20 times or so before hitting the issue, but with some parameter and timing variations I managed to create a script that would open a <a href="http://jakob.engbloms.se/archives/714">checkpoint</a>, and run the program a few times under script control, triggering the bug on the fourth run (every time, thanks to determinism).</p>
<p>To diagnose the problem I wrote some Simics script code that I actually felt was fairly cool. I guessed that the problem had something to do with the queue and its handling of &#8220;done&#8221;, since that is what told the threads to terminate.</p>
<p>The first problem was that the queue was not a global variable. Instead, it was dynamically allocated on the heap by a function, and a pointer passed around, but never stored in a global variable (a good computer science graduate never uses a global variable other than as the means of last resort). Finally, my script set a breakpoint on the line in the setup function that came after the allocation. With the program stopped at that point, I could read the local variable pointing to the queue, and find and store the addresses of all the interesting members of the structure.</p>
<p>The code looked like this (Simics CLI), for the record:</p>
<pre> $mbp = ($ctx.break ($st.pos (rule30_threaded.c:222)))
 $cpu = (wait-for-breakpoint $mbp)
 $pq_addr  = ($cpu.sym "pq")
 $pq_tail  = ($cpu.sym "&amp;(pq-&gt;tail)")
 $pq_empty = ($cpu.sym "&amp;(pq-&gt;empty)")
 $pq_full  = ($cpu.sym "&amp;(pq-&gt;full)")
 $pq_head  = ($cpu.sym "&amp;(pq-&gt;head)")
 $pq_done  = ($cpu.sym "&amp;(pq-&gt;done)")</pre>
<p>Next, I set breakpoints on all writes to empty, full, and done. This was the most expedient route to catch actual puts and gets to the queue. Breakpoints on the queue_put() and queue_get() functions are not really showing the true flow, as these functions start by contending for the lock. Looking at writes to the actual queue members gave me the point where the tasks had grabbed the lock.</p>
<p>The script that caught all writes to done, full, and empty, and on each write, it dumped the state of the queue including computing out the number of elements in the circular buffer (without having to run any code on the target). To get an idea for who was active, it also used OS awareness to find the currently executing thread ID, and scripted debugging to convert the current program counter into a position in the program source code (actually, the important issue was the name of the function we were executing in).</p>
<p>This trace of activity showed quite an interesting pair of patterns. When the program ran well, the queue was mostly full, and it looked like the producer task always got some kind of priority to fill it before consumers could get in and drain it. When the program froze, the queue was seldom more than a few elements deep. This was the same program, on the same kernel, just run a few milliseconds later.</p>
<p>Clearly, the Linux kernel can exhibit quite variable behavior even for a program this simple. I guess that&#8217;s why this is called &#8220;soft real time&#8221;&#8230; Another parallelism lesson here: the scheduler is very important, and a smart adaptive scheduler can wreak havoc with software that was accidentally tuned for a different scheduler.</p>
<p>In the end, the crucial hint was that whenever the program froze, the &#8220;done&#8221; flag was set with a queue that was empty or contained just a few elements. I was sure that I had handled this case in my code, checking specifically for that and making sure to wake up the other threads with a signal that &#8220;the queue is not empty any more, please come check for more work&#8221;&#8230; but looking closely at the code, it turned out the code only woke up a single thread. Thus, the froze resulted from the producer setting &#8220;done&#8221; with an empty queue, waking up a single compute thread, and then having the other threads wait forever for more data to be put into the queue. The fix was easy: use a broadcast signal rather than a single signal.</p>
<p>In retrospect, it seems really strange that this ever worked reliably&#8230; it almost that I suspect the old Linux kernel of having a flawed pthreads implementation where signals always wake up all waiting threads, and not just a single one like the documentation says. But that will wait for another day to be investigated.</p>
<p>Here is the code, for reference:</p>
<pre>void rule30_packet_queue_signal_done(rule30_packet_queue_t *q) {
 //
 // Grab lock, set the done signal atomically
 //
 pthread_mutex_lock (&amp;(q-&gt;mutex));
 q-&gt;done = 1;
 pthread_mutex_unlock (&amp;(q-&gt;mutex));
 // Signal any threads waiting for data to wake up
 // and discover that we are indeed done
 //
 // This is the bug:
 // - It only wakes up one thread...
 pthread_cond_signal (&amp;(q-&gt;notEmpty));
 // To be correct:
 // pthread_cond_broadcast (&amp;(q-&gt;notEmpty));
}</pre>
<p><em>Updated analysis:</em></p>
<p>My initial analysis was that when things worked, the &#8220;done&#8221; flag was set with enough data left in the queue that all threads had a chance to pull in data and come in and see the done flag being set.</p>
<p>However, today I went back and wrote a deeper analysis script that also checked for reads from the done flag (turning this check on only after the write to &#8216;done&#8217; to reduce the noise). I expected there to be a single reader when the freeze happened&#8230; but that was not the case. In my current test case, three out of five threads actually got in to read the done flag and terminate.  The crucial code for the compute threads looks like this:</p>
<pre> // Grab mutex,
 //   Check if the queue is empty, if so wait for someone
 //   to push something onto the queue, or signal done.
 //   both of which are done by setting the not_empty conditional variable
 pthread_mutex_lock (&amp;(queue-&gt;mutex));
 while ((queue-&gt;empty) &amp;&amp; !(queue-&gt;done)) {
   pthread_cond_wait (&amp;(queue-&gt;notEmpty), &amp;(queue-&gt;mutex));
 }</pre>
<p>To freeze, a thread actually has to be doing the conditional wait here. There are plenty of other places threads can be as the program is finishing. For example, they can be waiting to grab the initial mutex lock, or actually doing compute work. That explains why some threads actually still terminate even with the buggy version. It certainly also illustrates just how chaotic concurrent programs can be. More so that you can ever imagine, really.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/975"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/975" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/975" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/975/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

