<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; mpc8641d</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/mpc8641d/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>I Want One&#8230; Trillion Instructions&#8230;</title>
		<link>http://jakob.engbloms.se/archives/709?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/709#comments</comments>
		<pubDate>Sat, 28 Mar 2009 21:10:31 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[abstraction levels]]></category>
		<category><![CDATA[device driver]]></category>
		<category><![CDATA[Dr. Evil]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=709</guid>
		<description><![CDATA[There is an eternal debate going on in virtual platform land over what the right kind of abstraction is for each job. Depending on background, people favor different levels. For those with a hardware background, more details tend to be the comfort zone, while for those with a software background like myself, we are quite [...]]]></description>
			<content:encoded><![CDATA[<p>There is an eternal debate going on in virtual platform land over what the right kind of abstraction is for each job. Depending on background, people favor different levels. For those with a hardware background, more details tend to be the comfort zone, while for those with a software background like myself, we are quite comfortable with less details. I<a href="http://www.virtutech.com/whitepapers/wp-system_arch_spec.html"> recently did some experiments about the use of quite low levels of hardware modeling details for early architecture exploration and system specification</a>.</p>
<p><span id="more-709"></span></p>
<p>It all comes down to a simple classic tradeoff that I usually illustrate like this (using more neutral ground than computer systems; and with credit to Peter Magnusson who had this slide already in place when I joined Virtutech back in 2002):</p>
<p><img class="aligncenter size-full wp-image-711" title="simulation-rule" src="http://jakob.engbloms.se/wp-content/uploads/2009/03/simulation-rule.png" alt="simulation-rule" width="457" height="341" /></p>
<p>What this is telling you is simple:</p>
<ul>
<li>You simulate something very large using large units, i.e., low level of detail; or</li>
<li>You simulate something quite small using small units, i.e., high level of detail.</li>
</ul>
<p>I wanted to test the idea that by using less detail, you can run larger test cases and therefore obtain better coverage of overall landscape than diving in and counting cycles in some small part of it. In the end, this made me cross the trillion instruction line &#8212; since each experiment took a few hundred billion target instructions to complete, repeating and tweaking during the development work definitely add up to more than a trillion instructions.</p>
<p>And this is where I have put my little finger close to my mouth and say:</p>
<p style="text-align: center;"><img class="size-full wp-image-710 aligncenter" style="margin-top: 10px; margin-bottom: 10px;" title="drevil_million_dollars" src="http://jakob.engbloms.se/wp-content/uploads/2009/03/drevil_million_dollars.jpg" alt="drevil_million_dollars" width="300" height="318" /></p>
<p>&#8216;I want one trillion instructions&#8217;</p>
<p>So what did I get from these trillion instructions?</p>
<p>An interesting study in how operating system overhead can have a big impact on the profitability of hardware accelerators. By running hundreds of test cases with different assigned computation latencies of a hardware accelerators, as well as different driver models for my hardware (all running under Linux on my favorite MPC8641D), a key diagram emerged:</p>
<p style="text-align: left;"><img class="aligncenter size-full wp-image-712" style="margin-top: 10px; margin-bottom: 10px;" title="hwsw" src="http://jakob.engbloms.se/wp-content/uploads/2009/03/hwsw.png" alt="hwsw" width="872" height="507" /><a href="http://www.virtutech.com/whitepapers/wp-system_arch_spec.html">Read the paper </a>for all the details, but the key thing to note is that with a poor driver architecture, making the hardware 100 times faster resulted in zero gain in system performance. Had this experiment been performed on a bare-bones platform without a full operating system in place, I am fairly certain that the faster hardware would have been considered much more worthwhile.</p>
<p style="text-align: left;">In the end, I resorted to a driver variant where I had user-level code directly access the device programming interface via an mmap()-mapped memory region. Not pretty, essentially this was bare-metal programming wrapped inside a big cosy Linux package, but it sure was efficient compared to doing a kernel/user mode switch for each hardware operation. But even here, it turned out that making the hardware very very fast as opposed to just very fast had no benefit. It proves to me that the software has to be taken into account in full in order to properly evaluate an idea for a hardware design.</p>
<p style="text-align: left;">You could say that the poor results for acceleration here were due to my inept Linux driver programming skills, but that just underscores the key result: you have to take the software into account. If the conclusion is that a better Linux device driver programmer is needed, you have still decided that the key system bottleneck is not just the speed of the hardware, but how it is used. And that is exactly what system design needs to be about.</p>
<p style="text-align: left;">As an aside, playing around with a complete system like this, and automatically run large volumes of test with varying parameters was a really interesting experience. I must admit that getting to these trillions of instructions required  a few hours of simulation time, but nothing that could not be solved by leaving a computer running over lunch or a long meeting. The machine was modeled using standard Simics &#8220;software timing&#8221;, i.e., without any particular cache or pipeline or bus details, and it seems that that is usually all you need. Had I increased the level of detail and slowed things down by a factor of ten or a hundred, I would never have covered such a large set of test cases and been able to evaluate as many different variants of drivers and hardware speeds.</p>
<h2 style="text-align: left;">IBM did it before me</h2>
<p style="text-align: left;">Finally, I found it interesting that an analogous experience about the effect of creating a complete software stack and testing what looks like a very good hardware idea was reported in an IBM paper from a few years ago, in &#8220;<a href="http://researchweb.watson.ibm.com/journal/rd/502/peterson.html">Application of full-system simulation in exploratory system design and development</a>&#8220;, by Peterson et al, in the IBM Journal of Research and Development. Look at the section about the &#8220;MIP Morphing&#8221; feature, which is essentially cache locking. They do use a fairly detailed simulator for the end evaluation of their performance &#8211; but the key message is that by running a full software stack, they realized that just managing the feature was too hard in a realistic software environment to make it worthwhile:</p>
<blockquote>
<p style="text-align: left;">Initially, the MIP morphing feature was well received by internal development and HPCS customers alike. The team was aware of the need to both manage this hardware feature at the OS level and provide portable abstractions to the programmer to exploit this feature in a productive way. &#8230;</p>
</blockquote>
<p style="text-align: left;">And then:</p>
<blockquote>
<p style="text-align: left;">The implementation effort was facilitated by Mambo, allowing the OS team to prototype the MIP morph idea in a controlled development environment. Taking the prototyping effort to this level of realism uncovered many complexities in supporting the MIP morph in a virtualized manner. ..</p>
</blockquote>
<p style="text-align: left;">And finally:</p>
<blockquote>
<p style="text-align: left;">By prototyping the software support that was <em>needed at the OS level and exposing the usage issues at the application programmer&#8217;s level</em>, the magnitude of the problem was exposed at its fullest. Further, the improvement in performance did not show a sufficient payback for the immense effort that would be required at the software level to support the idea, and as a result it was dropped from further consideration.</p>
</blockquote>
<p style="text-align: left;">It seems that whatever you do, IBM did it first&#8230; and it validates the idea of full-system simulation and that software is king today.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/709"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/709" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/709" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/709/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Enea and Freescale Article on SMP OS</title>
		<link>http://jakob.engbloms.se/archives/664?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/664#comments</comments>
		<pubDate>Tue, 24 Feb 2009 09:43:16 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[AMP]]></category>
		<category><![CDATA[Enea]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mpc8572e]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OSE]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Patrik Strömblad]]></category>
		<category><![CDATA[SMP]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=664</guid>
		<description><![CDATA[Elektronik i Norden just published a technical insight article about the SMP kernels of Enea OSE and Linux, by Patrik Strömblad and Jonas Svennebring. It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.elinor.se">Elektronik i Norden </a>just published a <a href="http://www.webbkampanj.com/ein/0903/?page=51">technical insight article </a>about the <a href="http://www.enea.com/templates/Extension____24922.aspx?headline=http://cws.huginonline.com/E/1059/PR/200811/1267022.xml">SMP kernels </a>of <a href="http://www.enea.se">Enea </a>OSE and Linux, by Patrik Strömblad and Jonas Svennebring.</p>
<p><span id="more-664"></span>It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in Swedish, but I would expect the CMP network that Elektronik i Norden is part of will place this article in English into EETimes or some other publication of theirs.</p>
<p>The article discusses some Freescale targets, such as my favorite the MPC8641D, the MPC8572E dual-core, and the upcoming QorIQ P4080.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/664"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/664" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/664" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/664/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Three Cores make a Crowd &#8212; or a Problem</title>
		<link>http://jakob.engbloms.se/archives/633?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/633#comments</comments>
		<pubDate>Sat, 07 Feb 2009 21:12:38 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[device tree]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=633</guid>
		<description><![CDATA[A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-634" style="margin: 10px;" title="mpc8640d_pp" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8640d_pp.jpg" alt="mpc8640d_pp" width="130" height="130" />A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.</p>
<p>Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores &#8212; even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible &#8220;in principle&#8221;.</p>
<p>Here is the story of how I convinced Linux to handle more than two cores in a virtual <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8641D&amp;nodeId=0162468rH3bTdG8653">MPC8641D </a>machine.</p>
<p><span id="more-633"></span>In principle, adding more cores to the MPC8641 is easy. The interrupt controller that connects the cores together is the eminently scalable OpenPIC design, which can do at least 32 cores. During run-time this is only addressing that really matters. The Linux SMP support seems sufficiently scalable using the OpenPIC driver as well (and aside here: OpenPIC appears to be a design originally created by AMD or Cyrix for x86-SMP, but that reached common use with the PowerPC CHRP reference design &#8212; however, Internet sources are murky on this).</p>
<p>But the interrupt controller is just the first hurdle. There is another limit in the MPC8641 hardware: the multicore controller module, MCM, has a register that despite a strange name (Port Control Register, or PCR) is essentially what is used to enable and disable processors. PCR has room for only eight cores,. Since the real MPC8641D only has two cores, there is actually a set of six &#8220;reserved&#8221; bits. The Linux board support package has thankfully use a generic scheme based on processor core numbers. So adding in more cores just sets bits in the &#8220;reserved&#8221; field:</p>
<p><img class="aligncenter size-full wp-image-636" title="mpc8641d-mcm-room-for-extension1" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8641d-mcm-room-for-extension1.png" alt="mpc8641d-mcm-room-for-extension1" width="630" height="200" /></p>
<p>Thus, this processor scales to eight cores without recoding the Linux support  package or having to modify the register layout of the hardware.</p>
<p>The next issue was then how to communicate the number of cores to the software stack. There is no standard probing available, so the core count has to be a parameter given to the kernel. In all modern Linux versions, the &#8220;powerpc&#8221; architecture uses an OpenFirmware device tree data structure to obtain the hardware setup: cores, devices, addresses, interrupt routing, and anything else that is not explicitly probed (like PCI or USB, for example).</p>
<p>Once I got a <a href="http://www.jdl.com/software/">device tree compiler </a>installed this was surprisingly straight-forward. Just add a few more cores to the description file, compile, and use the new binary blob (the representation used by the kernel is the dtb, or &#8220;device tree blob&#8221;) instead of the standard one. In a virtual setup, changing this is trivial: just load a different file to memory before booting the system.</p>
<p>However, this did not work. The boot froze after core 2 (the third core) was enabled. Figuring out why and how to fix it took some time, since it turned out not to be a kernel problem at all&#8230; I spent a lot of time tracing and debugging the Linux kernel boot, including reversing back and forth over a hung loop, forcing interupts to be enabled just to see what would happen, and similar standard virtual platform tricks.</p>
<p>The problem turned out to be that the kernel was using processor numbers as a way to check which processors were coming online, and this processor number was read from the &#8220;PIR&#8221; special-purpose register (SPR) on the newly activated core. And this PIR value was set to one for all cores except core zero &#8212; some distance into the boot.</p>
<p>By single-stepping the first few instructions of the reset vector code I finally saw what was happening: code put in place by U-Boot (not the Linux kernel, really) was reading a magical MMU configuration register, and using the single bit it contained for determining the current processor as the processor ID. Thus, here was a piece of hardware with a single architected bit for IDs, and it is not even clear to me that this bit is supposed to be used in the way it is here. This was also a bit that could not be extended: putting data in neighboring (reserved, not used for other purposes) bits in that register just to see what would happen broke page table lookups with very high reliability.</p>
<p>In the end, the solution was just to remove the assembler instruction that wrote the PIR register. There was no other way around the problem. I guess this is &#8220;cheating&#8221;, but if changing a single line of code in the boot loader is what it takes to make Linux work with one to eight processor cores, I am fine with that. It is far less invasive than making changes to the Linux kernel, or creating a new system support package from scratch.</p>
<p>Which has finally provided me with a machine I can provide to <a href="http://www.virtutech.com/products">Simics </a>users that need a easy-to-change embedded SMP machine for multicore studies. I have tested that it works with 2, 3, 4, 6, and 8 cores. Five and seven would be easy to add as well, as it is just a matter of replacing the device tree.</p>
<p>This exercise also told me that the device tree is an interesting data structure that has significant power once you understand how it works. Until now, I have just seen it as a daunting weird thing that you could not do much about&#8230; but that is not the right attitude.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/633"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/633" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/633" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/633/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tying a Thread to a Processor in Linux</title>
		<link>http://jakob.engbloms.se/archives/625?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/625#comments</comments>
		<pubDate>Sun, 01 Feb 2009 19:24:09 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[fre]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[processor affinity]]></category>
		<category><![CDATA[sched_setaffinity]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=625</guid>
		<description><![CDATA[This is a small Linux SMP programming tip, which I had a hard time finding documented clearly anywhere on the web. I guess people won&#8217;t find it here either, but with some luck some search engine will pick up on this. Basically, the problem I faced was that the Linux scheduler (in the MPC8641D setup [...]]]></description>
			<content:encoded><![CDATA[<p>This is a small Linux SMP programming tip, which I had a hard time finding documented clearly anywhere on the web. I guess people won&#8217;t find it here either, but with some luck some search engine will pick up on this.</p>
<p><span id="more-625"></span>Basically, the problem I faced was that the Linux scheduler (in the MPC8641D setup Linux 2.6.23 setup that I have blogged about before &#8212; <a href="http://jakob.engbloms.se/archives/337">here </a>and <a href="http://jakob.engbloms.se/archives/330">here</a>) executed my test program on cpu zero and cpu one depending on its input parameters in a way that really made performance measurements give strange results. For some reason, certain input values almost always put the program on cpu one, and others on cpu zero. Very consistently, and I cannot understand what in the difference between a &#8220;119&#8243; and a &#8220;120&#8243; on a command-line makes the Linux scheduler make a different decision on the best processor on which to put a certain execution of my program.</p>
<p>The solution was to revisit the ability of Linux to tie processes to certain cores in the system, something that I was not actually sure existed. But it did, and apparently for most of the 2.6 kernel at least. One thing that threw me off for a short while was that the feature had to be accessed through the user-level calls in glibc, which meant that my scavenging through the kernel source was pretty useless.</p>
<p>Anyway, here is the code I came up with. Defining <tt>USE_GNU</tt> was necessary to allow the function to be accessed, since this is Linux-specific.</p>
<pre>#define __USE_GNU
#include
void tie_program_to_cpu_0(void) {
  cpu_set_t my_affinity_set;
  CPU_ZERO(&amp;my_affinity_set);          // no CPUs set
  CPU_SET(0, &amp;my_affinity_set);        // set cpu0 

  sched_setaffinity(0,                // 0=current process
		    sizeof(cpu_set_t),
		    &amp;my_affinity_set);

  printf("  Tying program to run only on CPU0 using sched_setaffinity()\n");
}</pre>
<p>Tested with:</p>
<ul>
<li>Linux 2.6.23 for powerpc architecture (Freescale MPC8641D HPCN board support package from Freescale LTIB)</li>
<li>gcc version 4.1.2 (<a href="http://www.codesourcery.com/gnu_toolchains/power/portal/release231">Code Sourcery G++ Lite 4.1-78</a>), with its accompanying glibc</li>
</ul>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/625"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/625" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/625" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/625/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dekker&#8217;s Algorithm Does not Work, as Expected</title>
		<link>http://jakob.engbloms.se/archives/65?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/65#comments</comments>
		<pubDate>Mon, 07 Jan 2008 21:14:22 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[code example]]></category>
		<category><![CDATA[Dekker's Algorithm]]></category>
		<category><![CDATA[Embedded Systems Conference]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[race condition]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/65</guid>
		<description><![CDATA[Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about Debugging Multicore Systems at the Embedded Systems Conference Silicon Valley in 2006 and 2007, I had a fairly long discussion about relaxed or weak [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes it is very reassuring that certain things do not work when tested in practice, especially when you have been telling people that for a long time. In my talks about <a href="http://www.engbloms.se/jakob_publications.html">Debugging Multicore Systems</a> at the <a href="http://www.esconline.com">Embedded Systems Conference Silicon Valley</a> in 2006 and 2007, I had a fairly long discussion about relaxed or <a href="http://en.wikipedia.org/wiki/Weak_consistency">weak </a><a href="http://en.wikipedia.org/wiki/Consistency_model">memory consistency models</a> and their effect on parallel software when run on a truly concurrent machine. I used <a href="http://en.wikipedia.org/wiki/Dekker%27s_algorithm">Dekker&#8217;s Algorithm</a> as an example of code that works just fine on a single-processor machine with a multitasking operating system, but that fails to work on a dual-processor machine. Over Christmas, I finally did a practical test of just how easy it was to make it fail in reality.  Which turned out to showcase some interesting properties of various types and brands of hardware and software.</p>
<p><span id="more-65"></span>Now to the code.</p>
<p>The core part of Dekker&#8217;s Algorithm are two symmetrical pieces of code that access a set of shared variables in a way that makes it impossible for both codes to enter their critical section at the same time. As long as memory is sequentially consistent. Here is my implementation, warts and all:</p>
<pre>static volatile int flag1 = 0;
static volatile int flag2 = 0;
static volatile int turn  = 1;
static volatile int gSharedCounter = 0;</pre>
<pre>void dekker1( ) {
        flag1 = 1;
        turn  = 2;
        while((flag2 ==  1) &amp;&amp; (turn == 2)) ;
        // Critical section
        gSharedCounter++;
        // Let the other task run
        flag1 = 0;
}

void dekker2(void) {
        flag2 = 1;
        turn = 1;
        while((flag1 ==  1) &amp;&amp; (turn == 1)) ;
        // critical section
        gSharedCounter++;
        // leave critical section
        flag2 = 0;
}</pre>
<p>This code can fail on a machine with weak memory consistency since there is no constraint in most memory systems about the order in which the updates to &#8220;flag2&#8243;, &#8220;flag1&#8243;, and &#8220;turn&#8221; become visible to the other processor.  In particular, there is no guarantee that the read from &#8220;flag2&#8243; in dekker1 will happen after the write to &#8220;flag1&#8243; and &#8220;turn&#8221; propagates to dekker2. Doing this argument symmetrically, you get something like the following sketch:</p>
<p><a title="Dekker Bug" href="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekkersbug.png"><img src="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekkersbug.png" alt="Dekker Bug" /></a></p>
<p>From this basic faulty code, I then use pthreads to create a parallel program. In the program, I loop many million of times in each thread trying to get into the critical section:</p>
<pre>int gLoopCount;
void *task1(void *arg) {
        int i;
        printf("Starting task1n");
        for(i=gLoopCount;i&gt;0;i--) {
                dekker1();
        }
}
void *task2(void *arg) {
        int i;
        printf("Starting task2n");
        for(i=gLoopCount;i&gt;0;i--) {
                dekker2();
        }
}</pre>
<p>If it happens that both enter the critical section at the same time, the construction of increasing a shared counter in dekker1 and dekker2 will likely result in a missed update to &#8220;gSharedCounter&#8221; as both threads read the same value, increment it, and then write the same value back (this is the kind of error mutual exclusion is supposed to protect against). Given this, by checking the value of gSharedCounter at the end of the program run, I can tell if any failures to lock happened. Note that this is likely conservative, since it is quite possible that one task manages to do the entire read-increment-write operation implicit in the ++ operation before the other task does. So the number of missed updates is actually a lower bound on the number of failed lockings.</p>
<p>So what happened when I tried this on real machines?</p>
<p>I must admit that I did not expect to see very many instances of errors, since I kind of assumed that modern hardware is so fast in communicating between cores that the window of opportunity to catch a bug would be pretty small. But it turned out to be quite frequent, and very variable across machines.</p>
<ul>
<li>On my Core 2 Duo T7600 laptop, with Windows Vista, I got on average one error in every 1.5 million locking attempts.</li>
<li>On an older dual-processor Opteron 242 machine, I got on average one error every 15000 locking attempts. 1000 times more often than on the Core 2 Duo!</li>
<li>On a <a href="http://www.freescale.com">Freescale </a><a href="http://en.wikipedia.org/wiki/PowerPC_e600#MPC8641_.26_MPC8641D">MPC8641D </a>dual-core machine,  I got on average one error every 2000 locking attempts.</li>
<li>On a range of single-core machines, not a single error was observed.</li>
</ul>
<p>So the theory is validated. Always feels good to have proof in practice that I have been telling the truth at the ESC <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  What else can we tell from the numbers?  I think this actually demonstrates a few other theories in practice:</p>
<ul>
<li>Communication between cores on a single chip is much faster than between separate chips, and the longer latencies between chips makes Dekker more likely to fail. This is shown by the difference between the dual-processor and dual-core x86 systems.</li>
<li>The PowerPC has a weaker memory consistency model by design than x86 systems, so the greater occurrence of locking failures there is also consistent with expectations.</li>
</ul>
<p>Now all I need to find are a few more machines to test on. It is always fun to do microbenchmarking of real machines, as evidenced in my <a href="http://www.engbloms.se/publications/engblom-rtas2003.pdf">RTAS 2003 paper</a> on branch predictors and their effect on WCET predictability.</p>
<p>If you want to try this yourself, the source code is attached to this post: <a title="dekker.c" href="http://jakob.engbloms.se/wp-content/uploads/2008/01/dekker.c">dekker.c</a> . Just compile it with &#8220;gcc -O2 -o dekker dekker.c -lpthread&#8221;. It has been tested on Windows with Cygwin as well as various Linuxes. Run it with the argument 10000000 (10 million), which appears to give a good indication of the prevalence of errors without running for too long.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/65"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/65" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/65" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/65/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

