<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; race condition</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/race-condition/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Neat Register Design to Avoid Races</title>
		<link>http://jakob.engbloms.se/archives/1070?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1070#comments</comments>
		<pubDate>Thu, 28 Jan 2010 18:59:53 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[64-bit computing]]></category>
		<category><![CDATA[device driver]]></category>
		<category><![CDATA[Gary Stringham]]></category>
		<category><![CDATA[high-level synthesis]]></category>
		<category><![CDATA[programming register]]></category>
		<category><![CDATA[race condition]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1070</guid>
		<description><![CDATA[In his most recent Embedded Bridge Newsletter, Gary Stringham describes a solution to a common read-modify-write race-condition hazard on device registers accessed by multiple software units in parallel. Some of the solutions are really neat! I have seen the &#8220;write 1 clears&#8221; solution before in real hardware, but I was not aware of the other [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-589" style="margin: 5px 10px;" title="racecondition" src="http://jakob.engbloms.se/wp-content/uploads/2008/01/racecondition.png" alt="racecondition" width="99" height="78" />In his most recent <a href="http://garystringham.com/newsletter.shtml?nid=039">Embedded Bridge Newsletter</a>, Gary Stringham describes a solution to a common read-modify-write race-condition hazard on device registers accessed by multiple software units in parallel. Some of the solutions are really neat!</p>
<p>I have seen the &#8220;write 1 clears&#8221; solution before in real hardware, but I was not aware of the other two variants. The idea of having a &#8220;write mask&#8221; in one half of a 32-bit word is really clever.</p>
<p>However, this got me thinking about what the fundamental issue here really is.</p>
<p><span id="more-1070"></span></p>
<p>As I see it, it is the fact that the processor cannot address small enough units atomically. The <a href="http://garystringham.com/newsletter.shtml?nid=037">read-modify-write that was used to start the discussion in the Embedded Bridge #37</a> was needed in order to get the current state of a configuration register, change some setting that only occupied a few bits in it, and write back the result to the register. The way most configuration registers that I have seen in practice works.</p>
<p>But if each setting could be given its own register, the problem would go away. Each operation would target a unique address, achieving the same effect as the bit-wise masks or write-1 solutions proposed. The core problem is that hardware tends to share settings into registers, as it has been considered too expensive to put information that might cover a range as small as [0,1] into a 32-bit register. Probably, since there is a lack of addresses for registers, you cannot have 1000 settings cause each simple device to use up 1000 words of physical addresses.</p>
<p>But is that really an issue, if we look forward?</p>
<p>It seems to me that, as 64-bit instruction sets and addressing systems penetrate down into more and more embedded systems, a simple solution would be to throw address space at the problem. I don&#8217;t think it is uneconomical to allocate huge chunks of memory space to each device, giving each setting its own register, when you have 64 bit virtual addresses to work with. There is no way you can fill up a physical memory system (guess that will some day come back to haunt me)&#8230; even the highest-end machines today only use something like 40 bits for actually addressing physical memories.</p>
<p>The software would be simpler and more robust, with virtually no cost.</p>
<p>Another solution that I have also seen starting to appear is to dispense with register settings altogether, and rather define a command API that the processor &#8220;calls&#8221; by putting in command packets into some memory area. This does require quite a bit of silicon for a decoder, but it provides for a much higher level of interaction with devices. As hardware devices get defined in successively higher-level languages (C, C++, UML, MatLab, &#8230;), and <a href="http://jakob.engbloms.se/archives/871">their programming interfaces and associated drivers get autogenerated</a>, this solution makes eminent sense.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1070"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1070" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1070" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1070/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hardware-Software Race Condition in Interrupt Controller</title>
		<link>http://jakob.engbloms.se/archives/588?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/588#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:16:14 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[interrupt controller]]></category>
		<category><![CDATA[learning by doing]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[operating systems]]></category>
		<category><![CDATA[race condition]]></category>
		<category><![CDATA[teaching setup]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=588</guid>
		<description><![CDATA[The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-589" style="margin: 5px 10px;" title="racecondition" src="http://jakob.engbloms.se/wp-content/uploads/2008/01/racecondition.png" alt="racecondition" width="99" height="78" />The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a global shared memory, as well as a means for the processors to interrupt each other in order to notify about the presence of a message or synchronize in general. Getting this really right turned out to be not so easy.</p>
<p><span id="more-588"></span></p>
<p>I started out with a simple model where each processor had an interrupt location mapped in global memory, and writing to this location would interrupt the processor. As a bonus, the written value was communicated to the receiving processor. Then, the processor being interrupted would acknowledge the interrupt to its local interrupt controller by writing into a local address.  Worked like a charm in simple tests.</p>
<p>It broke completely when I started sending messages from multiple nodes to the same node&#8230; if an interrupt from node B reached node A when A was busy processing an interrupt from C, the interrupt from B would simply be ignored. There was no queuing, no fairness, no arbitration. The software could not solve this, since in order to create a lock around the global interrupt location for a processor, it needs some kind of global signaling mechanism. Which was what this interrupt system was supposed to provide.</p>
<p>I must have had some suspicion that something was not quite right, as I had equipped the interrupt controller with a counter for interruptions raised vs interrupts cleared. This monotonically increased, indicating accumulated non-noticed interrupt attempts.</p>
<p>One obvious solution that did not work either was to provide a way to check that an interrupt was successfully sent. Since the interrupt send register for a processor was put in a shared global memory space, a processor that wrote the interrupt send register and then read the status register would have no way to guarantee that the status it read actually dealt with the interrupt it had tried to send. It would be very likely to read the status resulting from some other processor&#8217;s interrupt attempt. Basically, it would be doing non-protected access to a shared mutable area&#8230; known not to be a good idea.</p>
<p>Another solution would be to use an atomic load-and-store operation that would store a value in a register and then return a value to the processor as well. However, I have never seen this supported for device space, even if atomic operations of this type is available on most machines for regular memory.</p>
<p>So it was back to the drawing board. It is clear that in order to do interrupts in a multiprocessor, it must be possible for any processor to interrupt any other processor without the message getting lost due to simultaneous actions in other processors. How to solve this?</p>
<p>And why did I just not copy an existing design or read a book to tell me how to do this? The problem is that I have not managed to find any good readable text on this kind of subject: how does a multiprocessor (shared-memory or local memory, does not matter really) really handle interrupts and coordinate the code that is actually running locally on each individual processor with that running on other processors &#8212; at the lowest level. A description of the hardware-software interaction design needed to make this work must exist somewhere, but I have not managed to find it, and I suspect that in many cases this is just passed down as lore from one generation of system designers to the next. If someone knows a good text on this subject, please do point it out to me!</p>
<p>My first design was to use N x N registers for an N-processor machine. Essentially, each processor would have a bank of registers with one register for each other processor, indicating the sending processor. Thus, if processors A and B decide to interrupt C simultaneously, they would write into two different locations, and C could scan its register array to tell that both A and B were calling. However, this eats memory space pretty quickly, since it requires 2 times N squared registers:</p>
<ul>
<li>N registers local to a processor, to read out the message sent in.</li>
<li>N registers for each processor,  to write messages to. This can be either a local set for each processor, or a put in global memory.</li>
</ul>
<p>In essence, this is the design of the OpenPIC controller common in PowerPC land. It codes the processors using bits rather than full registers, but it works with a local set of data for each processor where it can set bits to interrupt any other processor.</p>
<p>A colleague of mine pointed out that the SPARC systems do things a bit differently. There, you have a single register into which you send the number of the receiving processor, and a status flag to tell you if you were successful in sending. The sending software is thus responsible for retrying if the remote side is busy. This scales nicely to quite large systems, since there is no need to represent or manage interrupt registers many hundreds of bits wide &#8212; the vast vast majority of which would not be used anyway at any particular point in time. What you lose is the ability of a single processor to do arbitary multicast interrupting, which I don&#8217;t think is that commonly neede (though it might well be, this is a bit of a dark art).</p>
<p>Since both these controller registers are present in memory that is local to a processor, there is no need to worry about races between different processors interrupting the same target processor simultanenously. The hardware interrupt bus will work out so that only one wins, and the software on only one processor will see  a successful flag status and continue. The others will spin, or do more sophisticated waits if needed.</p>
<p>In the end, the code for sending an interrupt that I used was this:</p>
<pre>void interrupt_cpu(int cpu_num, int message) {
  *my_intr_dest = cpu_num;
  *my_intr_send_data = message;
  while(*my_intr_send_status == 0) {
    *my_intr_send_data = message;
  }
}</pre>
<p>Note that I still send a 32-bit message, mostly since that is handy in educational and demo setups that are not completely limited by what current hardware does. In this design, writing to the message register is what triggers the interrupt (or an attempt to send an interrupt, rather) on the other processor. The hardware (or in my case, the virtual hardware model) does the rest, in a way that is guaranteed to deliver all interrupts safely to its end point, eventually. But without any complex buffering in the hardware itself, that is best handled in the software which has an easier time managing state. This also lets the software use other strategies, such as possibly using a busy interrupt as a signal to try some other processor that is less busy.</p>
<p>Anyway, it was an interesting experience to try this, and seeing how hardware devices and software interact in a concurrent machine to create races. Not just software, but also hardware, must be designed right to avoid races from occuring. And races caused by hardware are quite impossible to work around in software at times.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/588"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/588" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/588" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/588/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Shaking a Linux Device Driver on a Virtual Platform</title>
		<link>http://jakob.engbloms.se/archives/337?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/337#comments</comments>
		<pubDate>Sun, 09 Nov 2008 22:23:13 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[teaching]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[device driver]]></category>
		<category><![CDATA[interrupt]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[operating systems]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[race condition]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=337</guid>
		<description><![CDATA[To continue from last week&#8217;s post about my Linux device driver and hardware teaching setup in Simics, here is a lesson I learnt this week when doing some performance analysis based on various hardware speeds. First some background. A key idea in the setup is to use the approach of assuming some processing time for [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-medium wp-image-329" style="margin: 5px 10px;" title="penguin-variant" src="http://jakob.engbloms.se/wp-content/uploads/2008/11/penguin-variant.png" alt="" width="100" height="118" />To continue from <a href="http://jakob.engbloms.se/archives/330">last week&#8217;s post </a>about my Linux device driver and hardware teaching setup in <a href="http://www.virtutech.com/academia">Simics</a>, here is a lesson I learnt this week when doing some performance analysis based on various hardware speeds.</p>
<p><span id="more-337"></span></p>
<p>First some background.</p>
<p>A key idea in the setup is to use the approach of <em>assuming some processing time </em>for the hardware accelerator, rather than creating detailed code and determining the actual processing time for a particular implementation. Given some assumed time, we can then see how it impacts program performance. This is a way of designing hardware where we look to how fast something needs to be to have a positive impact, rather than trying to make it as fast as possible. It also lets us analyze how performance in hardware is seen when using a complete OS stack and a real device driver rather than simple bare-metal software (which tends to show the performance in the best possible light). Essentially, it is loosely timed design-space exploration.</p>
<p>Initial tests of the driver used very short completion times, on the order of 1 microsecond. The read() call at this point simply waited for the hardware completion flag to become true, and then returned the results. That is not the kind of behavior that a driver should have, since if the hardware gets some kind of hiccup, we will be stuck looping  inside a kernel context. Instead, I implemented a blocking read variant that would put the calling process to sleep until a result arrives.</p>
<p class="MsoNormal">In order to test that my driver did the sleep function correctly, I changed the processing delay into the level of seconds&#8230; and promptly found a set of issues that forced several rewrites of the code. The most important was the need to switch to a software flag for completion rather than relying on the hardware flag, and the implementation of an interrupt handler to get a notification from the hardware.</p>
<p>Then, on Friday, I demonstrated the setup along with some new performance analysis tools to go with it to some students testing the setup. And the test program suddenly stopped working, obviously hanging at the first call to read() without ever getting unblocked.</p>
<p>The reason was a classic race condition: the code in the <tt>write()</tt> device driver call that sent input data into the hardware device waited until after the writing was complete (and then some more) before clearing the operation complete flag. Here is the relevant piece of code:</p>
<pre>for(i=0;i&lt;words;i++) {
  write_register(SIMPLE_INPUT, kbuf[i]);
}
*f_pos = 0;
kfree(kbuf);
clear_completion_state();</pre>
<p class="MsoNormal">With a sufficiently short delay to completion, the completion interrupt fired, was handled, and set the completion flag before the <span class="codeinline"><span style="font-size: 8pt; line-height: 115%;">write()</span></span> function even got to <span class="codeinline"><span style="font-size: 8pt; line-height: 115%;">clear_completion_state()</span></span>. After this, the test program called <span class="codeinline"><span style="font-size: 8pt; line-height: 115%;">read()</span></span> to read the result, and was blocked as the completion flag was not set. The interrupt to signal completion from the hardware had already triggered and its result deposited in the software flag, which had then been promptly overwritten inside write(). Thus, inside read(), the flag never became set, and the process waited forever.</p>
<p class="MsoNormal">The fix is obvious: just move the clearing of the flag to <em>before </em>the writing to the hardware begins.</p>
<p class="MsoNormal">To generalize from this brilliant example of concurrency carelessness, this is a really good accidental demonstration of the power of varying timing in a virtual platform to shake code and find timing-related bugs in a manner much more efficient than possible on physical hardware.</p>
<p class="MsoNormal">Had I described the exact (or even approximate) timing of a particular hardware implementation, this kind of bug would not have been found and the driver code would not have been as robust. An implementation relying on a very short completion time could check the hardware operation complete flag directly, but that broke down when the delay was long. The buggy implementation above worked fine with a long completion time, but broke down with a short. The fixed implementation works across a span of times from 10 ns to 10 s or more, which is all you can ask for I think.</p>
<p class="MsoNormal">A short fun Simics note on this: changing that timing parameter is a run-time change. It is possible to change it during a run, from the Simics command-line, using a simple one-line command:</p>
<pre class="MsoNormal" style="padding-left: 30px;"><span style="color: #0000ff;">simics&gt; </span>sd0-&gt;time_to_result = 10.0e-9</pre>
<p class="MsoNormal">It is really nice working with a system like that!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/337"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/337" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/337" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/337/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The 1970 rule strikes again: Virtual Platform Principles in 1967</title>
		<link>http://jakob.engbloms.se/archives/130?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/130#comments</comments>
		<pubDate>Fri, 30 May 2008 20:37:31 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[1969]]></category>
		<category><![CDATA[HITAC-8400]]></category>
		<category><![CDATA[Hitachi]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[operating systems]]></category>
		<category><![CDATA[race condition]]></category>
		<category><![CDATA[Temporal decoupling]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=130</guid>
		<description><![CDATA[Being a bit of a computer history buff, I am often struck by how most key concepts and ideas in computer science and computer architecture were all invented in some form or the other before 1970. And commonly by IBM. This goes for caches, virtual memory, pipelining, out-of-order execution, virtual machines, operating systems, multitasking, byte-code [...]]]></description>
			<content:encoded><![CDATA[<p>Being a bit of a computer history buff, I am often struck by how most key concepts and ideas in computer science and computer architecture were all invented in some form or the other before 1970. And commonly by IBM. This goes for caches, virtual memory, pipelining, out-of-order execution, virtual machines, operating systems, multitasking, byte-code machines, etc. Even so, I have found a quite extraordinary example of this that actually surprised me in its range of modern techniques employed. This is a follow-up to a previous post, after having actually digested <a href="http://jakob.engbloms.se/archives/121">the paper I talked about earlier</a>.</p>
<p><span id="more-130"></span></p>
<p>The paper in question was published in 1969, and is titled &#8220;<a href="http://portal.acm.org/citation.cfm?id=961053.961092&amp;coll=ACM&amp;dl=ACM&amp;CFID=67556471&amp;CFTOKEN=25257537">A program simulator by partial interpretation<strong>&#8220;</strong></a>. In the previous post, I took note of its use of direct execution of software plus trapping of privileged instructions, but that was not really the most interesting bits in there.</p>
<p>They lay out  in quite simple terms most of the key ideas behind today&#8217;s fast virtual platforms. Here are the best parts:</p>
<ul>
<li>They note that simulation of a computer is often used to overcome debugging difficulties, in particular repeating failed runs and tracing all that is going on in the target machine.</li>
<li>They are hunting down race conditions using the simulator.</li>
<li>They use recorded input and output to drive a deterministic simulation even of workloads involving communication with the external world.</li>
<li>They simulate multiple processors on top of a single physical processor by means of giving each processor a certain time slice to do its work before switching to the next processor. This is known as temporal decoupling or quantized simulation today, and is a key to the high speed of solutions such as Simics. They note the same tradeoffs as we see today, 40 years later, for doing this: shorter slices more accurately depict the parallelism, but also cost performance.</li>
<li>The temporally decoupled simulation also includes timers and similar non-CPU-hardware. Just like we do it today for virtual platforms.</li>
<li>In a temporally decoupled simulation, they optimize the simulation of the IDL, Idle, instruction. When it is encountered, they skip immediately to the end of the time slice. This is what we today call idle-loop optimization or hypersimulation, and which is absolutely key to achieving scalable simulation of large multiprocessor and multi-machine setups (since most parts of a system are not usually maximally loaded).</li>
<li>They are debugging operating systems on the simulator, not just user-level code.</li>
</ul>
<p>The computer in question is a Japanese System/360-compatible machine called the <a href="http://www.ipsj.or.jp/katsudou/museum/computer/0610_e.html">HITAC-8400</a>. The work was reported in 1969, but actually carried out in 1967.</p>
<p>There are some differences in scale and kind compared to today&#8217;s virtual platforms, but none that detract from the underlying principles. The 1967 system is host-on-host, so it is not the kind of cross-environment that is most common in today&#8217;s virtual platforms (Power Arch on x86, ARM on x86, etc.). The IO system is much easier to simulate since it is part of the instruction set of the processor rather than being a set of complex memory-mapped peripherals.</p>
<p>So the 1970 rule strikes again. Not the IBM rule, this time, this was all done by Hitachi. There are traces of similar work at IBM in other papers, but I have not been able to locate actual copies of any publication.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/130"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/130" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/130" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/130/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Dekker&#8217;s Algorithm Does not Work, as Expected</title>
		<link>http://jakob.engbloms.se/archives/65?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/65#comments</comments>
		<pubDate>Mon, 07 Jan 2008 21:14:22 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[code example]]></category>
		<category><![CDATA[Dekker's Algorithm]]></category>
		<category><![CDATA[Embedded Systems Conference]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[race condition]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/65</guid>
		<description><![CDATA[Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about Debugging Multicore Systems at the Embedded Systems Conference Silicon Valley in 2006 and 2007, I had a fairly long discussion about relaxed or weak [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about <a href="http://www.engbloms.se/jakob_publications.html">Debugging Multicore Systems</a> at the <a href="http://www.esconline.com">Embedded Systems Conference Silicon Valley</a> in 2006 and 2007, I had a fairly long discussion about relaxed or <a href="http://en.wikipedia.org/wiki/Weak_consistency">weak </a><a href="http://en.wikipedia.org/wiki/Consistency_model">memory consistency models</a> and their effect on parallel software when run on a truly concurrent machine. I used <a href="http://en.wikipedia.org/wiki/Dekker%27s_algorithm">Dekker&#8217;s Algorithm</a> as an example of code that works just fine on a single-processor machine with a multitasking operating system, but that fails to work on a dual-processor machine. Over Christmas, I finally did a practical test of just how easy it was to make it fail in reality.  Which turned out to showcase some interesting properties of various types and brands of hardware and software.</p>
<p><span id="more-65"></span>Now to the code.</p>
<p>The core part of Dekker&#8217;s Algorithm are two symmetrical pieces of code that access a set of shared variables in a way that makes it impossible for both codes to enter their critical section at the same time. As long as memory is sequentially consistent. Here is my implementation, warts and all:</p>
<pre>static volatile int flag1 = 0;
static volatile int flag2 = 0;
static volatile int turn  = 1;
static volatile int gSharedCounter = 0;</pre>
<pre>void dekker1( ) {
        flag1 = 1;
        turn  = 2;
        while((flag2 ==  1) &amp;&amp; (turn == 2)) ;
        // Critical section
        gSharedCounter++;
        // Let the other task run
        flag1 = 0;
}

void dekker2(void) {
        flag2 = 1;
        turn = 1;
        while((flag1 ==  1) &amp;&amp; (turn == 1)) ;
        // critical section
        gSharedCounter++;
        // leave critical section
        flag2 = 0;
}</pre>
<p>This code can fail on a machine with weak memory consistency since there is no constraint in most memory systems about the order in which the updates to &#8220;flag2&#8243;, &#8220;flag1&#8243;, and &#8220;turn&#8221; become visible to the other processor.  In particular, there is no guarantee that the read from &#8220;flag2&#8243; in dekker1 will happen after the write to &#8220;flag1&#8243; and &#8220;turn&#8221; propagates to dekker2. Doing this argument symmetrically, you get something like the following sketch:</p>
<p><a title="Dekker Bug" href="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekkersbug.png"><img src="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekkersbug.png" alt="Dekker Bug" /></a></p>
<p>From this basic faulty code, I then use pthreads to create a parallel program. In the program, I loop many million of times in each thread trying to get into the critical section:</p>
<pre>int gLoopCount;
void *task1(void *arg) {
        int i;
        printf("Starting task1n");
        for(i=gLoopCount;i&gt;0;i--) {
                dekker1();
        }
}
void *task2(void *arg) {
        int i;
        printf("Starting task2n");
        for(i=gLoopCount;i&gt;0;i--) {
                dekker2();
        }
}</pre>
<p>If it happens that both enter the critical section at the same time, the construction of increasing a shared counter in dekker1 and dekker2 will likely result in a missed update to &#8220;gSharedCounter&#8221; as both threads read the same value, increment it, and then write the same value back (this is the kind of error mutual exclusion is supposed to protect against). Given this, by checking the value of gSharedCounter at the end of the program run, I can tell if any failures to lock happened. Note that this is likely conservative, since it is quite possible that one task manages to do the entire read-increment-write operation implicit in the ++ operation before the other task does. So the number of missed updates is actually a lower bound on the number of failed lockings.</p>
<p>So what happened when I tried this on real machines?</p>
<p>I must admit that I did not expect to see very many instances of errors, since I kind of assumed that modern hardware is so fast in communicating between cores that the window of opportunity to catch a bug would be pretty small. But it turned out to be quite frequent, and very variable across machines.</p>
<ul>
<li>On my Core 2 Duo T7600 laptop, with Windows Vista, I got on average one error in every 1.5 million locking attempts.</li>
<li>On an older dual-processor Opteron 242 machine, I got on average one error every 15000 locking attempts. 1000 times more often than on the Core 2 Duo!</li>
<li>On a <a href="http://www.freescale.com">Freescale </a><a href="http://en.wikipedia.org/wiki/PowerPC_e600#MPC8641_.26_MPC8641D">MPC8641D </a>dual-core machine,  I got on average one error every 2000 locking attempts.</li>
<li>On a range of single-core machines, not a single error was observed.</li>
</ul>
<p>So the theory is validated. Always feels good to have proof in practice that I have been telling the truth at the ESC <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  What else can we tell from the numbers?  I think this actually demonstrates a few other theories in practice:</p>
<ul>
<li>Communication between cores on a single chip is much faster than between separate chips, and the longer latencies between chips makes Dekker more likely to fail. This is shown by the difference between the dual-processor and dual-core x86 systems.</li>
<li>The PowerPC has a weaker memory consistency model by design than x86 systems, so the greater occurrence of locking failures there is also consistent with expectations.</li>
</ul>
<p>Now all I need to find are a few more machines to test on. It is always fun to do microbenchmarking of real machines, as evidenced in my <a href="http://www.engbloms.se/publications/engblom-rtas2003.pdf">RTAS 2003 paper</a> on branch predictors and their effect on WCET predictability.</p>
<p>If you want to try this yourself, the source code is attached to this post: <a title="dekker.c" href="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekker.c">dekker.c</a> . Just compile it with &#8220;gcc -O2 -o dekker dekker.c -lpthread&#8221;. It has been tested on Windows with Cygwin as well as various Linuxes. Run it with the argument 10000000 (10 million), which appears to give a good indication of the prevalence of errors without running for too long.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/65"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/65" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/65" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/65/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

