<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; multicore computer architecture</title>
	<atom:link href="http://jakob.engbloms.se/archives/category/parallel-computing/multicore-computer-architecture/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Nvidia &#8220;Kal-El&#8221; Variable SMP</title>
		<link>http://jakob.engbloms.se/archives/1496?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1496#comments</comments>
		<pubDate>Fri, 23 Sep 2011 19:16:33 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1496</guid>
		<description><![CDATA[Nvidia recently announced that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg"><img class="alignleft size-full wp-image-1497" style="margin: 5px 10px;" title="nvidia logo" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg" alt="" width="48" height="48" /></a>Nvidia <a href="http://blogs.nvidia.com/2011/09/quad-core-kal-el’s-stealth-fifth-core-lets-it-save-on-energy/">recently announced </a>that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something truly good.</p>
<p><span id="more-1496"></span>It is common practice in multicore computing today to dynamically change the clock frequency of a processor and turn cores on and off in order to adjust the compute power available to the current workload. Such operations tend to be limited in scope, as processors have minimum clock frequencies that make sense, and often the memory system requires all cores to be at the same frequency. Operating systems also tend to want to work with homogeneous sets of cores, as that makes scheduling reasonably straight-forward. This is probably what has kept the idea of &#8220;small + large&#8221; cores of the same ISA out of the mainstream of SMP design, despite all its advantages in principle.</p>
<p>Now, Nvidia has managed to implement some of that idea in Kal-El.</p>
<p>The key observation is that if you can turn cores on and off, once you get down to a single active core, any system is by definition homogeneous across all cores regardless of what that core is. Changing the nature of this core should then be much easier, since there is only a single core to contend with.</p>
<p>What Nvidia does in Kal-El is to add a fifth low-power core to the main group of four high-performance cores. The fifth core is architecturally identical (ARM Cortex-A9), so that the system state can be moved from the high-performance to the low-performance cores without undue complexities. Indeed, this is all done in hardware, so the OS (typically, Android) thinks it is running on a homogeneous quad-core. When the system is lightly loaded and the OS decides to only have a single core on, the hardware can detect the load is <em>really</em> light, and effectively change the nature of the active core to a low-power-optimized version.</p>
<p>Once more compute power is needed, the hardware invisible slips back to the first high-power core, and then the OS can start increasing clocks and turning on cores as usual. It is effectively the same as a regular ARM Cortex-A9 quad-core setup, but with better low-power performance. The following graph from the Nvidia <a href="http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-whitepaper-0911b.pdf">white paper </a>shows it pretty clearly (red text is my added comment):</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png"><img class="aligncenter size-full wp-image-1498" title="tegra kal-el power curve" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png" alt="" width="655" height="446" /></a></p>
<p>Note the slope of the green line: that core is not a good one if you want high performance. It is optimized to scale within a range of low compute-power requirements, rather than provide the best performance per watt at the high end. Using Variable SMP, Nvidia lets us have both.</p>
<p>Neat.</p>
<p>More reading:</p>
<ul>
<li><a href="http://arstechnica.com/gadgets/news/2011/09/tegra-3-includes-5th-stealth-core-to-optimize-power-efficiency.ars">ArsTechnica</a> has a short summary</li>
<li>There does not seem to be much more right now, everyone is really just reiterating the points from the white paper.</li>
</ul>
<p>&nbsp;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1496"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1496" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1496" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1496/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Memory Models: x86 is TSO, TSO is Good</title>
		<link>http://jakob.engbloms.se/archives/1435?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1435#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:16:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Doug Lea]]></category>
		<category><![CDATA[Francesco Zappa Nardelli]]></category>
		<category><![CDATA[memory consistency]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[SPARC]]></category>
		<category><![CDATA[UpMarc]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1435</guid>
		<description><![CDATA[By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well. He showed a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif"><img class="size-full wp-image-1016 alignleft" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="" width="122" height="45" /></a>By chance, I got to attend a day at the <a href="http://www.it.uu.se/research/upmarc/events/SS2011/Programme.html">UPMARC Summer School</a> with a very enjoyable talk by <a href="http://moscova.inria.fr/~zappa/">Francesco Zappa Nardelli </a>from INRIA. He described his work (along with others) on <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">understanding and modeling multiprocessor memory models</a>. It is a very complex subject, but he managed to explain it very well.</p>
<p><span id="more-1435"></span>He showed a very interesting discussion from a few years ago on the x86 memory model and the implementation of spinlocks in the Linux kernel. Various experts went back and forth over whether the final MOV that sets a lock variable to 1 needed to be prefixed by LOCK or not. The discussion ended when Linus Torvalds said &#8220;I know that it is needed&#8221;. Only to see an Intel architect finally intervene and say &#8220;you know, really, it isn&#8217;t needed&#8221;. This was followed by a series of releases of Intel manuals documenting the x86 memory model, with increasing precision in each release. Intel also actually changed the published rules along the road, withdrawing some optimizations as they realized that they would break existing software.</p>
<p>Note that such a description of a memory model must both describe existing hardware, and serve as the guideline for future hardware. Therefore, there are optimizations that are not implemented today but which are possible given the rules. Such optimization opportunities can be removed from the rulebook as long as they have never been part of shipping hardware, so it is not as crazy as it might sound.</p>
<p>Anyway, the point that Francesco made was both to tell an interesting story from history, and making the point that describing and understanding memory models is hard. I certainly agree with that. I recall an ISCA many years ago when some computer architecture professors all agreed that very few people really understand consistency and weak memory models.</p>
<p>To make life easier for programmers, Francesco and Peter Sewell (in Cambridge) has defined their own set of rules for x86 memory consistency. This is not an architecture spec, but a rule set for regular programmers. It is found at <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">http://www.cl.cam.ac.uk/~pes20/weakmemory/</a>. Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model.</p>
<p>They have also attempted to formalize the Power Architecture memory model. Both the actual memory model and their model of it can only be described as very complex. The programmer&#8217;s model is expressed in terms of store queues, speculative instruction execution, and commits of instructions. Not something you easily keep in your head. It is interesting to note that ARM MPCore essentially copied the Power Architecture.</p>
<p>He showed an interactive simulation of the Power memory model, and the way that you need to think about it in terms of propagating information between threads and committing them. It is possible to propagate values and then another propagation overrides a value before the thread commits&#8230; Fun. Or a headache.</p>
<p>The big take-away from the talk for me is that it confirms the observation made may times before that <a href="http://en.wikipedia.org/wiki/Memory_ordering">SPARC TSO </a>seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines.</p>
<p>Maybe TSO does not theoretically scale in the same insane way as Power or Alpha does/did. But the cost of that theoretical scalability is that programmers might have to litter their code with sync operations just to get it to run correctly. With too many sync operations, the code will run very slowly negating any advantage on the hardware level. Note that sync operations can be very expensive. <a href="http://g.oswego.edu/">Doug Lea</a>, in the audience, pointed out that a sync can cost up to 300 cycles on a POWER5.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1435"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1435" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1435" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1435/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SecurityNow on Randomness</title>
		<link>http://jakob.engbloms.se/archives/1424?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1424#comments</comments>
		<pubDate>Wed, 25 May 2011 20:20:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[random number generation]]></category>
		<category><![CDATA[SecurityNow]]></category>
		<category><![CDATA[Steve Gibson]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1424</guid>
		<description><![CDATA[Episodes 299 and 301 of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>Episodes <a href="http://twit.tv/sn299">299 </a>and <a href="http://twit.tv/sn301">301 </a>of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas of predictability, observability, repeatability, and determinism. I have worked and wrangled with these concepts for almost 15 years now, from my research into timing prediction for embedded processors to my current work with the repeatable and reversible Simics simulator.</p>
<p><span id="more-1424"></span>Let&#8217;s start from the top.</p>
<p>When Steve said that computers are deterministic, I jumped. To me, a computer is anything but deterministic. The idea that rerunning a program does the same thing is an ideal state that you can rarely reach, and having an infrastructure like Simics that <a href="http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html">helps you achieve this </a>is huge win for debugging.</p>
<p>Listening closely, what I think Steve <em>really </em>said is that an algorithm like a random number generator is deterministic. If you know its initial state, it will always compute the same result. That is indeed true for code that just converts an input into an output, and does no communication and is not dependent on time or timing. My experience in random and nondeterministic behavior comes from programs that feature multiple threads and often multiple processes, and plenty of asynchronous activity going on. So, same word, different contexts.</p>
<p>However, Steve also several times talk about computers as being deterministic predictable machines. I think that characterizing today&#8217;s computers as being deterministic is untrue. I would rather say that with multiple cores and multiple chips and timing variations all over the place, a computer has become fundamentally <em>nondeterministic </em>and non-repeatable, since there are so many little things going on where a nanosecond difference in time can cause behavior to diverge incredibly quickly. There is a nice paper from 2003 about the divergent behavior from minor differences, &#8220;<a href="http://portal.acm.org/citation.cfm?id=822813">Variability in Architectural Simulations of Multi-threaded Workloads</a>&#8220;, by Alaa R. Alameldeen and David A. Wood.</p>
<p>The <a href="http://jakob.engbloms.se/archives/1374">HAVEGE program I wrote about a while back </a>is essentially an attempt to harness the fundamental unpredictability of modern hardware timing. Nice idea, which at least in theory fulfills the more important property for security of being <em>unobservable</em>. Security doesn&#8217;t really need &#8220;real&#8221; randomness, all you need is something that an attacker cannot predict or observe. The classic <a href="http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html">Netscape SSL lack-of-randomness in the random seed</a> issue from 1996 is the best illustration of this. Certain things about a target can be inferred or observed, but the low-level hardware timing is not one of them, at least not for an x86 or high-end ARM class machine.</p>
<p>The solution that Steve prefers are the Yarrow and <a href="http://en.wikipedia.org/wiki/Fortuna_%28PRNG%29">Fortuna </a>algorithms that collect randomness from the environment of a computer and uses that as a seed to a normal random number generator, creating lots of useful random data from a fairly small seed. This is the same idea as HAVEGE, but with a different entropy source. In both cases the basic idea seems sound and reasonable, but I kind of hoped that Steve would know of some way to evaluate the quality of the entropy pool generated from hardware events.</p>
<p><a href="http://www.grc.com/sn/sn-301.htm">Steve mentioned </a>the NIST randomness test that was used to test HAVEGE. It is certainly an aggressive test, but <a href="http://jakob.engbloms.se/archives/1374">as my testing showed</a>,  it only demonstrates that a random number generator is random in the data  produced. It does not show that it is unpredictable, and it does not measure the benefit gained from using  unobservable local events in hardware as the source of entropy. You need something  else, like comparing repeated collections of randomness over time from  the same system, to build confidence in unobservable and unpredictable  randomness.</p>
<p>With a computer, you do have such a thing as repeatable,  deterministic, and thus predictable randomness. In a modern desktop or server computer, you also have tons of totally unpredictable non-repeatable non-usefully-observable randomness in the low-level hardware timing and concurrent behavior of independent hardware units. Too bad it seems hard to prove this by measurement.</p>
<p>For yet more randomness discussion, especially randomness in embedded systems, I recommend the <a href="http://secworks.se/2011/03/om-slumptal-och-entropikallan-haveged/">blog post </a>by Joachim Strömbergsson. (it is in Swedish).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1424"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1424" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1424" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1424/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: True Concurrency is Different</title>
		<link>http://jakob.engbloms.se/archives/1151?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1151#comments</comments>
		<pubDate>Fri, 18 Jun 2010 20:24:04 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1151</guid>
		<description><![CDATA[I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called True Concurrency is Truly Different (Again). It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems. Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="button-quicklink-blogs" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html#more">True Concurrency is Truly Different (Again). </a>It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1151"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1151" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1151" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1151/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How (Not) To Present Parallel Programming Results</title>
		<link>http://jakob.engbloms.se/archives/946?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/946#comments</comments>
		<pubDate>Mon, 05 Oct 2009 13:06:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DAC]]></category>
		<category><![CDATA[DAC 2009]]></category>
		<category><![CDATA[parallelized software]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=946</guid>
		<description><![CDATA[SCDSource ran a short but good article summarizing a few DAC talks that I would liked to attend. it mostly about the experience of long-term parallel programming research David Bailey in presenting results in the field&#8230; Or more importantly: how not to present results, or how to mislead the audience as to the efficiency of [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-824" title="46daclogo" src="http://jakob.engbloms.se/wp-content/uploads/2009/07/46daclogo.gif" alt="46daclogo" width="81" height="73" /><a href="http://www.scdsource.com/article.php?id=360">SCDSource ran a short but good article</a> summarizing a few DAC talks that I would liked to attend. it mostly about the experience of long-term parallel programming research David Bailey in presenting results in the field&#8230;</p>
<p><span id="more-946"></span>Or more importantly: how not to present results, or how to mislead the audience as to the efficiency of your approach. <a href="http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf">His old 1991 paper </a>on how to do this is still worthy of a read, even if some things have changed (64-bit FP is pretty much as fast as 32-bit these days, for example). The fundamentals of parallelism are still pretty much the same.</p>
<p>The results were from a <a href="http://www.dac.com/events/eventdetails.aspx?id=95-32">DAC panel about multicore and EDA</a>, I wonder if that panel dealt with how to make EDA software itself parallel, or about how to help semiconductor companies help their end user programmers harness the multicore hardware being designed using EDA tools. That does not seems clear to me.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/946"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/946" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/946" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/946/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Freescale P4080, in Physical Form</title>
		<link>http://jakob.engbloms.se/archives/933?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/933#comments</comments>
		<pubDate>Thu, 17 Sep 2009 10:16:37 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[DWF]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[MPC5606]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=933</guid>
		<description><![CDATA[Past Tuesday, I attended the Freescale Design With Freescale (DWF) one-day technology event in Kista, Stockholm. This is a small-scale version of the big Freescale Technology Forum, and featured four tracks of talks running from the morning into the afternoon. All very technical, aimed at designing engineers. There were several topic areas, such as automotive, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png"><img class="alignleft size-full wp-image-878" style="margin: 5px 10px;" title="freescale-logo-icon" src="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png" alt="freescale-logo-icon" width="80" height="80" /></a>Past Tuesday, I attended the Freescale Design With Freescale (DWF) one-day technology event in Kista, Stockholm. This is a small-scale version of the big Freescale Technology Forum, and featured four tracks of talks running from the morning into the afternoon. All very technical, aimed at designing engineers.</p>
<p><span id="more-933"></span>There were several topic areas, such as automotive, consumer, and networking. Networking was mostly focused on the issues of multicore hardware and software.</p>
<p>Of particular interest to me was to see a <a href="http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=0162468rH3bTdG25E4">Freescale QorIQ P4080 </a>8-core networking/control-plane processor live for the first time. This chip was <a href="http://jakob.engbloms.se/archives/137">announced in the Summer of 2008</a>, with a full ecosystem of software support thanks to <a href="http://www.virtutech.com/qoriq">Virtutech Simics</a>. Now, when the silicon is here, software is indeed running on it thanks to the long headstart development got with the virtual platform. Note that several demos at the event used the Simics simulator to show the software support for the P4080, as there was only a single chip to go around.</p>
<p>I would have loved to have a meaningful picture of the first P4080 in Europe, but  a chip is not really very photogenic &#8211; the P4080 processor was in an open computer case, but covered with a 10 cm-high heat sink which made it fairly hard to actually see. That&#8217;s the challenge with infrastructure things: they are not designed to be seen&#8230; just to do their job well. If you have a new consumer electronics processor, you can at least drive a screen quickly or something. But watching 28 Gbps of Ethernet traffic is not as easy <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Jonas Svennebring of Freescale gave a good talk about how the process of bringup on the P4080 had worked out. It was a total validation of the methodology of using virtual platforms, at different levels of abstraction, and slipping in a bit of hardware emulation as well.</p>
<p>Freescale started software development on the functional fast model, and when clock-cycle-level detailed models of subsystems became available, they started using them as well for performance validation for small pieces of code. Any discrepancies in behavior between the two models was then used to correct the models and documentation. Finally, as the RTL for the silicon began to become available, they used a few emulation setups to run parts of the actual RTL (the emulator could only handle a subset of the entire chip), and validate the performance numbers in the detailed model and the behavior of both models. In the end, when the first silicon became available, Linux was up in a very short time (I cannot give the exact number, but it was a matter of days rather than weeks).</p>
<p>This is the typical iterative process that all chip designers are implementing today: using virtual platforms you can get a head start on development of software, and then as more details become available, you tune models and update both designs, models, and software, iterating towards a hardware/software combination that just works once the silicon realization of the hardware comes around.</p>
<p>So that was all cool.</p>
<p>Jonas also showed a die photo of the QorIQ, and that confirmed by opinion from the <a href="http://jakob.engbloms.se/archives/905">SiCS Multicore Day</a>: embedded multicore is not just about processor cores and cache, it is very much about accelerators to help offload repetitive work from the processing cores. More than half the chip was such acceleration logic! To me, this is a clear confirmation that heterogeneity is the future of hardware design, and a useful way to spend hundreds of millions of transistors to boost SoC performance.</p>
<p>The same was true for most other Freescale hardware showcased at the event. For example, there was the <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC560xS">MPC5606S dashboard processor</a>, running an LCD display with lots of dynamic graphics with 0.2% CPU load on a 60 MHz e200 Power Architecture processor. All the work was done by its display driver and accelerator. It is hard to argue with that kind of efficiency. That chip did not need a heatsink, either. It was just mounted on the back of an example board with no need for any external logic chips. Apparently, it could also have moved some physical gauges and blinked LEDs, but that demo was considered too distracting for this particular setting.</p>
<p>I also gave a talk at the DWF, about debugging software on multicore using virtual platforms. That was fun, as always. Need to get out more on the road and talk in conferences, I think <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/933"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/933" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/933" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/933/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU &#8211; a new type of DSP?</title>
		<link>http://jakob.engbloms.se/archives/930?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/930#comments</comments>
		<pubDate>Fri, 11 Sep 2009 14:35:18 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[GPGPU]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=930</guid>
		<description><![CDATA[My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, [...]]]></description>
			<content:encoded><![CDATA[<p>My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.</p>
<p><span id="more-930"></span>The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that &#8220;everyone has anyway&#8221; &#8212; i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.</p>
<p>But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.</p>
<p>How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.</p>
<p>The reasons for GPU raw performance seems to be three-fold:</p>
<ul>
<li>Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.</li>
<li>There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.</li>
<li>Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.</li>
<li>And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans,  and 300W power draws are OK&#8230;</li>
</ul>
<p>What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.</p>
<p>If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs &#8212; even if they can be painful compared to general-purpose processors.</p>
<p>So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.</p>
<p>Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/930"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/930" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/930" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/930/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>SiCS Multicore Day 2009</title>
		<link>http://jakob.engbloms.se/archives/905?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/905#comments</comments>
		<pubDate>Mon, 07 Sep 2009 19:26:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[Anders Landin]]></category>
		<category><![CDATA[CPP]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hazim Shafi]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Richard Kaufmann]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Visual Studio 2010]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=905</guid>
		<description><![CDATA[Last Friday, I attended this year&#8217;s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a [...]]]></description>
			<content:encoded><![CDATA[<p>Last Friday, I attended this year&#8217;s edition of the <a href="http://www.sics.se/node/4360">SiCS Multicore Day</a>. It was smaller in scale than <a href="http://jakob.engbloms.se/archives/283">last year</a>, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from <a href="http://blogs.msdn.com/hshafi/">Hazim Shafi </a>of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community.<span id="more-905"></span></p>
<p>I think that for next year, the organizers need to find keynote speakers that are not from the general computing multicore world. The Microsoft talk this year was a step in that direction, as it rather came from multicore programming than multicore hardware. Richard and Anders gave very interesting and good talks, no doubt about it. But it would have been nice with someone from ARM or Freescale or Tensilica or TI or ST or Ericsson or Cisco talking about the kinds of multicore embedded hardware that is being developed and used today. For example, the &#8220;next new thing&#8221; touted by the keynotes this year was GPGPU. Interesting for HPC and desktops, certainly. But pretty irrelevant for most of the people that I know. GPUs are huge, expensive, and power hungry.</p>
<p>GPGPU was one part of the theme this year. It is definitely catching on as <em>the </em>way to do number crunching in the desktop, server, and HPC world. It is not the universal panacea for any kind of parallelism, however, as Hazim and I noted in the panel discussion that ended the day. There are applications (such as <a href="http://www.virtutech.com/whitepapers/accelerator.html">parallel Simics</a>&#8230;) that scale well on general-purpose cores, but that will never ever work on GPUs. In general, the class of problems that work on GPUs is pretty limited to massive data-parallel problems like image and video manipulation.</p>
<p>In the eternal homogeneous vs heterogeneous debate (follow <a href="http://jakob.engbloms.se/archives/tag/homogeneous">the tags </a>in my blog for more posts on this topic), GPGPU was grudingly accepted as a good candidate for something that will not be homogeneized with the main processors. Additionally, Richard Kaufmann gave some hints that Intel or AMD are coming out with new chips with more accelerators on board&#8230; I guess it will be security, as is already done by Sun and <a href="http://jakob.engbloms.se/archives/80">IBM</a>. When I brought up the topic of more accelerators like pattern matching, compression, and the other things we see in chips from Freescale, Cavium, and others, the response was very &#8220;can only be economical for very high volume applications&#8221;.</p>
<p>It is striking how the GPGPU idea is bringing the classic telecommunications DSP-data plane/CPU-control plane division into the desktop and server space. Without any recognition being paid or any experience being reused from the 40 years that that has been done in telecoms and consumer electronics&#8230; as Jack Ganssle often says, us embedded folks get no respect.</p>
<p>In terms of programming, this year was all about general programming languages. Hazim from Microsoft talked about (and demoed) the quite pervasive addition of parallelism to both native C/C++ and managed .net code in Visual Studio 2010. Microsoft is dead serious about parallel programming, and are bringing out a whole set of different libraries and support structures to allow <a href="http://blogs.msdn.com/pfxteam/archive/2009/08/12/9867246.aspx">easier expression of parallel code</a>. In the &#8220;LINQ&#8221; data query language subset of C#, you could add some easy modifiers to &#8220;foreach&#8221; statements to make them parallel, for example. Having a language that is your own and which you can extend at will certainly pays off in terms of innovation here. C++ moves far slower than C#, that is becoming clearer and clearer. C# and its cousins in the .net system seem to be sneaking in lots of powerful language design ideas from places like Python, and also results from Microsoft&#8217;s powerful group of language researchers.</p>
<p>When I tried to bring up the idea of using domain-specific languages to program parallel applications, Hazim had the wonderful comment that &#8220;that might be applicable in certain domains&#8230;&#8221; &#8212; yes, that is the idea. By being narrow in terms of target domains, you gain expressive power and semantic insight that helps move programming from &#8220;how&#8221; towards &#8220;what&#8221;. But it sounds like domain-specific is a foul word inside of Microsoft &#8212; when the audience asked whether LINQ was not a exactly a domain-specific language for data access, Hazim was a pains to point out that it is Turing-complete and that someone had managed to write a Raytracer using it&#8230; interesting. This feels more political than market-based. I guess Micro</p>
<p>Richard Kaufmann had some interesting notes on throughput vs TTC (time-to-completion) jobs in servers. In the &#8220;cloud computing&#8221; era, throughput is much easier to scale: just add more servers. Classic HPC is more oriented towards TTC, as you do want your results within a reasonable time. Quite often, you can most work into a throughput-oriented style by simply running lots of jobs in parallel rather than pushing through a series of jobs sequentially. Note however that we have the entire field of real-time control, real-time communications, etc., that do not work like this. But that is not the market that HP is building servers for, or that Intel and AMD are servicing.</p>
<p>Outside the keynotes, Per Holmberg of Ericsson gave an interesting presentation on the adoption of multicore in the control plane of the <a href="http://www.ericsson.com/ericsson/corpinfo/publications/review/2002_02/161.shtml">Ericsson CPP </a>platform. The core of his talk was the observation that in these kinds of systems, multicore is not such a big revolution.</p>
<p>They have been distributed since the beginning. Thus, scaling by adding more processors (with local memories) is easy and multicore is only a packaging change from that. Also, most performance-intense operations are already offloaded onto DSP groups, network processors, ASICs, or FPGAs. There is not much parallelism left for the control plane to exploit. Essentially, only functions that unexpectedly become performance bottlenecks due to changes in traffic patterns are likely candidates for parallellization. Interesting point, and might be <a href="http://jakob.engbloms.se/archives/703">why the EETimes noted that multicore is slow to catch on in communications </a>(the article is a bit flawed).</p>
<p>Patrik Nyblom from Ericsson held a talk about how the <a href="http://www.erlang.org">Erlang </a>runtime engine was parallelized. From a practical perspective, the most interesting aspect was that this made applications parallel without changing a single line of code in the applications. Of course, applications had to be threaded to start with, but that is the most natural way in Erlang. He mentioned systems containing up to a quarter of a million threads &#8212; hard to do that in anything except Erlang.</p>
<p>He described how they had evolved from a simple implementation that worked well on synthetic benchmarks to a truly industrial-strength implementation. The difference was quite radical, as real codes feature more complex communications patterns, and make heavy use of device drivers and network stacks. This process forced the use of more and finer locks, and rethinking the balance between shared and separate heaps for threads.</p>
<p>They also had the opportunity to test their solution on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8&#8230; The two key lessons they learned was that <em>no shared lock goes unpunished, </em>and <em>data has to be distributed as well as code.</em> Very interesting to hear this story from real software developers solving real problems.</p>
<p>The next multicore event taking place around here is the Second <a href="http://www.it.uu.se/research/upmarc/MCC09">Swedish WOrkshop on Multicore Computing </a>(MCC 2009), in Uppsala, November 26-27.</p>
<p>Update: note that the presentations from the event are available via <a href="http://www.multicore.se/">http://www.multicore.se/</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/905"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/905" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/905" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/905/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Downloadable Book about Embedded Multicore</title>
		<link>http://jakob.engbloms.se/archives/877?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/877#comments</comments>
		<pubDate>Sat, 08 Aug 2009 19:27:08 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[John Logan]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[Patrik Strömblad]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=877</guid>
		<description><![CDATA[Freescale has now released the collected, updated, and restyled book version of the article series on embedded multicore that I wrote last year together with Patrik Strömblad of Enea, and Jonas Svennebring, and John Logan of Freescale. The book covers the basics of multicore software and hardware, as well as operating systems issues and virtual [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.freescale.com"><img class="alignleft size-full wp-image-878" style="margin-left: 5px; margin-right: 5px;" title="freescale-logo-icon" src="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png" alt="freescale-logo-icon" width="80" height="80" /></a>Freescale has now released the collected, updated, and restyled <a href="http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf">book version </a>of the article series on embedded multicore that I <a href="http://jakob.engbloms.se/archives/423">wrote last year </a>together with Patrik Strömblad of <a href="http://www.enea.com">Enea</a>, and Jonas Svennebring, and John Logan of <a href="http://www.freescale.com">Freescale</a>. The book covers the basics of multicore software and hardware, as well as operating systems issues and virtual platforms. Obviously, the virtual platform part was my contribution.</p>
<p><span id="more-877"></span></p>
<p>It is one of the more comprehensive introductions to how to think about and use multicore architectures in the high-end embedded space. It is free to download and print, but if you want a printed copy, such can be ordered at a price of (I am told) 15 USD (did not try it myself).</p>
<p>The PDF is at <a href="http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf">http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf </a>.</p>
<p>It will also be linked from the &#8220;Documentation&#8221; section for most Freescale multicore chips&#8217; information pages.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/877"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/877" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/877" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/877/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coding Horror on Big Iron Hardware</title>
		<link>http://jakob.engbloms.se/archives/841?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/841#comments</comments>
		<pubDate>Wed, 15 Jul 2009 19:41:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[CodingHorror]]></category>
		<category><![CDATA[HP]]></category>
		<category><![CDATA[Jeff Atwood]]></category>
		<category><![CDATA[server]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=841</guid>
		<description><![CDATA[In a post from late June, Jeff Atwood at Coding Horror discusses the horrible cost of a large HP server (scaling up to 32 processor cores in eight AMD x86 sockets), compared to a bunch of simple single-socket basic servers. There are some interesting notes on relative costs of small-and-simple servers, including things like administration [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-654" title="opinion" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png" alt="opinion" width="91" height="69" />In a post from late June, Jeff Atwood at Coding Horror <a href="http://www.codinghorror.com/blog/archives/001279.html">discusses the horrible cost of a large HP server </a>(scaling up to 32 processor cores in eight AMD x86 sockets), compared to a bunch of simple single-socket basic servers. There are some interesting notes on relative costs of small-and-simple servers, including things like administration and power. There is an undercurrent to the post and the comments that the big HP machine is &#8220;overpriced&#8221;. I don&#8217;t think it is. If you have ever had <a href="http://user.it.uu.se/~eh/">Erik Hagersten </a>as a teacher in computer architecture, you will know why.</p>
<p><span id="more-841"></span></p>
<p>Essentially, the cost of connecting a bunch of processors goes up exponentially as the number of processors increase. I think this is just as true for Hypertransport-connected AMD 4-way chips as it was for Sun 10000 servers ten years ago. The backplane takes over as the cost driver, from the processors and memories and other obviously useful stuff. Scaling up beyond the commodity space (which is a moving target over time, certainly) requires a lot of engineering and custom hardware design. This makes the cost exponentially higher, but for a good reason.</p>
<p>Note that this is one of the reasons that the Sun Niagara/UltraSparc T-line machines are compelling: with 32 or 64 threads per socket, getting to 100+ hardware threads is way cheaper using that architecture than anything else in the server space (in deep embedded, 100+ cores is a yawn).</p>
<p>Just a small rant, while on vacation.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/841"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/841" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/841" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/841/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cavium Octeon II: Short Notes</title>
		<link>http://jakob.engbloms.se/archives/811?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/811#comments</comments>
		<pubDate>Sat, 13 Jun 2009 19:40:41 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[Cavium]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[Octeon]]></category>
		<category><![CDATA[Octeon II]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=811</guid>
		<description><![CDATA[About two months ago, Cavium Networks launched their second generation of Octeon chips, the Octeon II. The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though. First, they launch with 2 [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-812" title="octeon-ii" src="http://jakob.engbloms.se/wp-content/uploads/2009/06/octeon-ii.jpg" alt="octeon-ii" width="76" height="78" />About two months ago, <a href="http://www.caviumnetworks.com">Cavium Networks </a>launched their second generation of Octeon chips, the <a href="http://www.caviumnetworks.com/OCTEON_II_MIPS64.html">Octeon II. </a>The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though.</p>
<p><span id="more-811"></span>First, they launch with 2 to 6 cores in typical chips, far short of the 32 core maximum. That probably indicates that system builders have a hard time adopting and getting good use from manycore architectures currently.</p>
<p>It is also a system that is full of accelerator units! In a 6-core chip, you find some 75 accelerator units according to Cavium. That is ten times as many accelerators as main cores, indicating where a large part of the work is actually being performed. To me, this validates that heterogeneous architectures and accelerators are still useful and valuable for networking applications, and that the idea of a homogeneous sea of identical processor cores with no specialization and no fixed-function hardware accelerators is still distant (I think it will never happen, but you never know).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/811"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/811" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/811" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/811/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallelism in Action</title>
		<link>http://jakob.engbloms.se/archives/793?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/793#comments</comments>
		<pubDate>Sun, 24 May 2009 12:53:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Embarrassingly Parallel]]></category>
		<category><![CDATA[iPod]]></category>
		<category><![CDATA[Nero]]></category>
		<category><![CDATA[parallelized software]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=793</guid>
		<description><![CDATA[Last year in a blog post on video encoding for the iPod Nano, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) video encoding is a good example of an application that can take advantage of parallelism. Yesterday I put these two topics together in a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-125 alignleft" style="margin: 5px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="Shrinking cores" width="100" height="100" /></p>
<p>Last year in a blog post on <a href="http://jakob.engbloms.se/archives/28">video encoding for the iPod Nano</a>, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) <a href="http://jakob.engbloms.se/archives/31">video encoding is a good example of an application that can take advantage of parallelism</a>. Yesterday I put these two topics together in a practical test. And it worked nicely enough.</p>
<p><span id="more-793"></span></p>
<p>My new Core i7 920-based machine was very well utilized by the Nero 8 suite&#8217;s Nero Recode 3 application when converting some children&#8217;s movies for use on my Nano. Here is a screenshot of the CPU load at one point in the computation:</p>
<p><img class="aligncenter size-full wp-image-794" title="skarmklipp" src="http://jakob.engbloms.se/wp-content/uploads/2009/05/skarmklipp.png" alt="skarmklipp" width="162" height="139" />It was much higher than this at times, but capturing that using the <a href="http://jakob.engbloms.se/archives/580">snipping tool </a>was harder than expected.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/793"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/793" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/793" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/793/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>When does Hardware Acceleration make Sense in Networking?</title>
		<link>http://jakob.engbloms.se/archives/770?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/770#comments</comments>
		<pubDate>Sat, 16 May 2009 06:45:47 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[accelerators]]></category>
		<category><![CDATA[ethernet]]></category>
		<category><![CDATA[hardware-software interface]]></category>
		<category><![CDATA[Mike Odell]]></category>
		<category><![CDATA[networking]]></category>
		<category><![CDATA[tcp]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=770</guid>
		<description><![CDATA[Yes, when does hardware acceleration make sense in networking? Hardware acceleration in the common sense of &#8220;TCP offload&#8221;. This question was answered by a very nicely reasoned &#8220;no&#8221; in an article by Mike Odell in ACM Queue called &#8220;Network Front-End Processors, Yet Again&#8220;. The article is highly recommended for its long historical look at network [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-771" style="margin: 0px 15px;" title="q_stamp" src="http://jakob.engbloms.se/wp-content/uploads/2009/05/q_stamp.gif" alt="q_stamp" width="38" height="65" />Yes, when does hardware acceleration make sense in networking? Hardware acceleration in the common sense of &#8220;TCP offload&#8221;. This question was answered by a very nicely reasoned &#8220;no&#8221; in an article by Mike Odell in <a href="http://queue.acm.org/">ACM Queue </a>called &#8220;<a href="http://queue.acm.org/detail.cfm?id=1530828">Network Front-End Processors, Yet Again</a>&#8220;.</p>
<p><span id="more-770"></span></p>
<p>The article is highly recommended for its long historical look at network processing and network processing offload. As the balance  between speeds of networks, processors, memory, and interconnects between network cards and the rest of the system has changed over the years, it is an idea that occasionally (four or five times since the 1970s) has made sense. However, in the end, Mike thinks that it usually does not, and for a machine with multiple cores and a modern fast interconnect, it is hard to see how a hardware accelerator can actually help speed things up much when the coordination between the hardware and the software is accounted for. Even if there would appear to be a big bottleneck somewhere today, we can be sure that it wil be removed in the next generation of hardware, rendering the market window for an accelerator quite short.</p>
<p>I read this article as another great motivation for the need to carefully consider the functional design of the hardware-software interface for acceleration devices. For simple data-pumping or media-processing units, this looks easy. For something as complex as TCP/IP processing, it is not. I think the key is that for TCP, we have something that is much more like control-plane processing than data-plane processing, and that is harder to efficiently integrate between hardware and software. Also, there is not really that much work left to offload once data copies have been architected in the right way (and I read Mike&#8217;s article to say that we now know how to do this in a sufficently few-copies way that software is close to optimal in architecture).</p>
<p>From a market perspective, it would also indicate that the acceleration circuits that are in common use today are by definition those that make sense. Having hardware-accelerated graphics and video decoders does seem to help build more efficient and attractive computer systems, as do cryptography accelerators. With this view, it will be interesting to see which of all the accelerators found in modern networking SoCs like those from Freescale and Cavium will survive the test of time. I am willing to put a small bet that pattern-matching engines for traffic inspection is one of them. Apart from that, hard to say.</p>
<p>So go read that article before you start designing your next brilliant accelerator for a common expensive operation.</p>
<p>It also reminds me of a <a href="http://www.virtutech.com/whitepapers/wp-system_arch_spec.html">whitepaper I wrote early this year </a>on how to evaluate performance of a hardware accelerator in the context of a full system with a full software stack, considering the details of the hardware-software interface.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/770"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/770" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/770" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/770/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>EETimes.com &#8211; Multicore CPUs face slow road in comms</title>
		<link>http://jakob.engbloms.se/archives/703?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/703#comments</comments>
		<pubDate>Sun, 22 Mar 2009 21:16:36 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Cavium]]></category>
		<category><![CDATA[Communications market]]></category>
		<category><![CDATA[EETimes]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[Linley Gwennap]]></category>
		<category><![CDATA[Multicore Expo]]></category>
		<category><![CDATA[Octeon]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[PowerQUICC]]></category>
		<category><![CDATA[qoriq]]></category>
		<category><![CDATA[Rick Merritt]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=703</guid>
		<description><![CDATA[The  EETimes article Multicore CPUs face slow road in comms piqued my interest. There is an interesting chart in there about just how slow more-than-one-core processors will be in penetrating a vaguely defined &#8220;comms&#8221; market place. I can believe that, but I think their comments on the PowerQUICC series require some commentary&#8230; Essentially, the article [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=215901460"><img class="alignleft size-full wp-image-155" title="eetimes logo" src="http://jakob.engbloms.se/wp-content/uploads/2008/07/eetimes.png" alt="eetimes logo" width="127" height="56" /></a>The  EETimes article<a href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=215901460"> Multicore CPUs face slow road in comms</a> piqued my interest. There is an interesting chart in there about just how slow more-than-one-core processors will be in penetrating a vaguely defined &#8220;comms&#8221; market place. I can believe that, but I think their comments on the PowerQUICC series require some commentary&#8230;</p>
<p><span id="more-703"></span>Essentially, the article is a report on a talk by Linley Gwennap at the Multicore Expo last week. The most interesting point are that simple single-core processors are taking over from the traditional heterogeneous processor + big accelerator pattern examplified by the Freescale PowerQUICC series. And that even Freescale themselves are &#8220;replacing&#8221; PowerQUICC chips based on the venerable CPM with &#8220;simpler dual-core chips&#8221;, which has to mean the MPC8572 currently and probably the QorIQ P2000-series chips later on.</p>
<p>The main point is that people are moving away from the &#8220;complexities&#8221; of the CPM-style heterogeneous hardware design, to symmetric multiprocessing designs that are simpler in one way. But harder to program when you want to have a regular old program use more than one core, as we all well know. It is a good question whether this is actually the case: I am not too sure that correctly writing a parallel threaded program for a shared-memory multiprocessor is easier than calling a hardware accelerator API or using a heterogeneous architecture&#8230; more familiar to general-purpose programmers, sure. But easier? Not necessarily.</p>
<p>What I do take some issue with is the implication that the quad-core and dual-core processors expanding into the market, in Gwennaps opinion, do <em>not </em>have these &#8220;complex hardware accelerator APIs&#8221;&#8230; all the hardware I have seen for the comms field certain feature very powerful offload and acceleration engines for tasks like network interface, TCP/IP processing, regular expression matching, security computations, etc.</p>
<p>Look at the feature sets of chips like the Freescale QorIQ P4080 or the Cavium Octeon Plus CN58xx family: their core acceleration engines look every bit as complex as the old CPM to me. The programming might be a bit different, their presence spun as accelerators rather than as a processor in the marketing talk, but still they are complex acceleration blocks that definitely have a lot of power. They also seem quite intent on staying and proliferating, and not going away. I see no sign that the future of computing is anything but <a href="http://jakob.engbloms.se/archives/44">lots of programmable cores augmented by lots of accelerators. </a>The benefits of heterogeneous architectures in terms of power, throughput, and chip size are simply too compelling.</p>
<p>What is interesting in the article is also both the claimed poor state of software that is slowing the adoption of multicore, and that this means that the software stacks actually get some more time than could be expected to adapt to multicore and truly parallel hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/703"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/703" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/703" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/703/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enea and Freescale Article on SMP OS</title>
		<link>http://jakob.engbloms.se/archives/664?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/664#comments</comments>
		<pubDate>Tue, 24 Feb 2009 09:43:16 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[AMP]]></category>
		<category><![CDATA[Enea]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mpc8572e]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OSE]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Patrik Strömblad]]></category>
		<category><![CDATA[SMP]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=664</guid>
		<description><![CDATA[Elektronik i Norden just published a technical insight article about the SMP kernels of Enea OSE and Linux, by Patrik Strömblad and Jonas Svennebring. It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.elinor.se">Elektronik i Norden </a>just published a <a href="http://www.webbkampanj.com/ein/0903/?page=51">technical insight article </a>about the <a href="http://www.enea.com/templates/Extension____24922.aspx?headline=http://cws.huginonline.com/E/1059/PR/200811/1267022.xml">SMP kernels </a>of <a href="http://www.enea.se">Enea </a>OSE and Linux, by Patrik Strömblad and Jonas Svennebring.</p>
<p><span id="more-664"></span>It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in Swedish, but I would expect the CMP network that Elektronik i Norden is part of will place this article in English into EETimes or some other publication of theirs.</p>
<p>The article discusses some Freescale targets, such as my favorite the MPC8641D, the MPC8572E dual-core, and the upcoming QorIQ P4080.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/664"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/664" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/664" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/664/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Three Cores make a Crowd &#8212; or a Problem</title>
		<link>http://jakob.engbloms.se/archives/633?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/633#comments</comments>
		<pubDate>Sat, 07 Feb 2009 21:12:38 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[device tree]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=633</guid>
		<description><![CDATA[A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-634" style="margin: 10px;" title="mpc8640d_pp" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8640d_pp.jpg" alt="mpc8640d_pp" width="130" height="130" />A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.</p>
<p>Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores &#8212; even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible &#8220;in principle&#8221;.</p>
<p>Here is the story of how I convinced Linux to handle more than two cores in a virtual <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8641D&amp;nodeId=0162468rH3bTdG8653">MPC8641D </a>machine.</p>
<p><span id="more-633"></span>In principle, adding more cores to the MPC8641 is easy. The interrupt controller that connects the cores together is the eminently scalable OpenPIC design, which can do at least 32 cores. During run-time this is only addressing that really matters. The Linux SMP support seems sufficiently scalable using the OpenPIC driver as well (and aside here: OpenPIC appears to be a design originally created by AMD or Cyrix for x86-SMP, but that reached common use with the PowerPC CHRP reference design &#8212; however, Internet sources are murky on this).</p>
<p>But the interrupt controller is just the first hurdle. There is another limit in the MPC8641 hardware: the multicore controller module, MCM, has a register that despite a strange name (Port Control Register, or PCR) is essentially what is used to enable and disable processors. PCR has room for only eight cores,. Since the real MPC8641D only has two cores, there is actually a set of six &#8220;reserved&#8221; bits. The Linux board support package has thankfully use a generic scheme based on processor core numbers. So adding in more cores just sets bits in the &#8220;reserved&#8221; field:</p>
<p><img class="aligncenter size-full wp-image-636" title="mpc8641d-mcm-room-for-extension1" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8641d-mcm-room-for-extension1.png" alt="mpc8641d-mcm-room-for-extension1" width="630" height="200" /></p>
<p>Thus, this processor scales to eight cores without recoding the Linux support  package or having to modify the register layout of the hardware.</p>
<p>The next issue was then how to communicate the number of cores to the software stack. There is no standard probing available, so the core count has to be a parameter given to the kernel. In all modern Linux versions, the &#8220;powerpc&#8221; architecture uses an OpenFirmware device tree data structure to obtain the hardware setup: cores, devices, addresses, interrupt routing, and anything else that is not explicitly probed (like PCI or USB, for example).</p>
<p>Once I got a <a href="http://www.jdl.com/software/">device tree compiler </a>installed this was surprisingly straight-forward. Just add a few more cores to the description file, compile, and use the new binary blob (the representation used by the kernel is the dtb, or &#8220;device tree blob&#8221;) instead of the standard one. In a virtual setup, changing this is trivial: just load a different file to memory before booting the system.</p>
<p>However, this did not work. The boot froze after core 2 (the third core) was enabled. Figuring out why and how to fix it took some time, since it turned out not to be a kernel problem at all&#8230; I spent a lot of time tracing and debugging the Linux kernel boot, including reversing back and forth over a hung loop, forcing interupts to be enabled just to see what would happen, and similar standard virtual platform tricks.</p>
<p>The problem turned out to be that the kernel was using processor numbers as a way to check which processors were coming online, and this processor number was read from the &#8220;PIR&#8221; special-purpose register (SPR) on the newly activated core. And this PIR value was set to one for all cores except core zero &#8212; some distance into the boot.</p>
<p>By single-stepping the first few instructions of the reset vector code I finally saw what was happening: code put in place by U-Boot (not the Linux kernel, really) was reading a magical MMU configuration register, and using the single bit it contained for determining the current processor as the processor ID. Thus, here was a piece of hardware with a single architected bit for IDs, and it is not even clear to me that this bit is supposed to be used in the way it is here. This was also a bit that could not be extended: putting data in neighboring (reserved, not used for other purposes) bits in that register just to see what would happen broke page table lookups with very high reliability.</p>
<p>In the end, the solution was just to remove the assembler instruction that wrote the PIR register. There was no other way around the problem. I guess this is &#8220;cheating&#8221;, but if changing a single line of code in the boot loader is what it takes to make Linux work with one to eight processor cores, I am fine with that. It is far less invasive than making changes to the Linux kernel, or creating a new system support package from scratch.</p>
<p>Which has finally provided me with a machine I can provide to <a href="http://www.virtutech.com/products">Simics </a>users that need a easy-to-change embedded SMP machine for multicore studies. I have tested that it works with 2, 3, 4, 6, and 8 cores. Five and seven would be easy to add as well, as it is just a matter of replacing the device tree.</p>
<p>This exercise also told me that the device tree is an interesting data structure that has significant power once you understand how it works. Until now, I have just seen it as a daunting weird thing that you could not do much about&#8230; but that is not the right attitude.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/633"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/633" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/633" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/633/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hardware-Software Race Condition in Interrupt Controller</title>
		<link>http://jakob.engbloms.se/archives/588?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/588#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:16:14 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[interrupt controller]]></category>
		<category><![CDATA[learning by doing]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[operating systems]]></category>
		<category><![CDATA[race condition]]></category>
		<category><![CDATA[teaching setup]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=588</guid>
		<description><![CDATA[The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-589" style="margin: 5px 10px;" title="racecondition" src="http://jakob.engbloms.se/wp-content/uploads/2008/01/racecondition.png" alt="racecondition" width="99" height="78" />The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a global shared memory, as well as a means for the processors to interrupt each other in order to notify about the presence of a message or synchronize in general. Getting this really right turned out to be not so easy.</p>
<p><span id="more-588"></span></p>
<p>I started out with a simple model where each processor had an interrupt location mapped in global memory, and writing to this location would interrupt the processor. As a bonus, the written value was communicated to the receiving processor. Then, the processor being interrupted would acknowledge the interrupt to its local interrupt controller by writing into a local address.  Worked like a charm in simple tests.</p>
<p>It broke completely when I started sending messages from multiple nodes to the same node&#8230; if an interrupt from node B reached node A when A was busy processing an interrupt from C, the interrupt from B would simply be ignored. There was no queuing, no fairness, no arbitration. The software could not solve this, since in order to create a lock around the global interrupt location for a processor, it needs some kind of global signaling mechanism. Which was what this interrupt system was supposed to provide.</p>
<p>I must have had some suspicion that something was not quite right, as I had equipped the interrupt controller with a counter for interruptions raised vs interrupts cleared. This monotonically increased, indicating accumulated non-noticed interrupt attempts.</p>
<p>One obvious solution that did not work either was to provide a way to check that an interrupt was successfully sent. Since the interrupt send register for a processor was put in a shared global memory space, a processor that wrote the interrupt send register and then read the status register would have no way to guarantee that the status it read actually dealt with the interrupt it had tried to send. It would be very likely to read the status resulting from some other processor&#8217;s interrupt attempt. Basically, it would be doing non-protected access to a shared mutable area&#8230; known not to be a good idea.</p>
<p>Another solution would be to use an atomic load-and-store operation that would store a value in a register and then return a value to the processor as well. However, I have never seen this supported for device space, even if atomic operations of this type is available on most machines for regular memory.</p>
<p>So it was back to the drawing board. It is clear that in order to do interrupts in a multiprocessor, it must be possible for any processor to interrupt any other processor without the message getting lost due to simultaneous actions in other processors. How to solve this?</p>
<p>And why did I just not copy an existing design or read a book to tell me how to do this? The problem is that I have not managed to find any good readable text on this kind of subject: how does a multiprocessor (shared-memory or local memory, does not matter really) really handle interrupts and coordinate the code that is actually running locally on each individual processor with that running on other processors &#8212; at the lowest level. A description of the hardware-software interaction design needed to make this work must exist somewhere, but I have not managed to find it, and I suspect that in many cases this is just passed down as lore from one generation of system designers to the next. If someone knows a good text on this subject, please do point it out to me!</p>
<p>My first design was to use N x N registers for an N-processor machine. Essentially, each processor would have a bank of registers with one register for each other processor, indicating the sending processor. Thus, if processors A and B decide to interrupt C simultaneously, they would write into two different locations, and C could scan its register array to tell that both A and B were calling. However, this eats memory space pretty quickly, since it requires 2 times N squared registers:</p>
<ul>
<li>N registers local to a processor, to read out the message sent in.</li>
<li>N registers for each processor,  to write messages to. This can be either a local set for each processor, or a put in global memory.</li>
</ul>
<p>In essence, this is the design of the OpenPIC controller common in PowerPC land. It codes the processors using bits rather than full registers, but it works with a local set of data for each processor where it can set bits to interrupt any other processor.</p>
<p>A colleague of mine pointed out that the SPARC systems do things a bit differently. There, you have a single register into which you send the number of the receiving processor, and a status flag to tell you if you were successful in sending. The sending software is thus responsible for retrying if the remote side is busy. This scales nicely to quite large systems, since there is no need to represent or manage interrupt registers many hundreds of bits wide &#8212; the vast vast majority of which would not be used anyway at any particular point in time. What you lose is the ability of a single processor to do arbitary multicast interrupting, which I don&#8217;t think is that commonly neede (though it might well be, this is a bit of a dark art).</p>
<p>Since both these controller registers are present in memory that is local to a processor, there is no need to worry about races between different processors interrupting the same target processor simultanenously. The hardware interrupt bus will work out so that only one wins, and the software on only one processor will see  a successful flag status and continue. The others will spin, or do more sophisticated waits if needed.</p>
<p>In the end, the code for sending an interrupt that I used was this:</p>
<pre>void interrupt_cpu(int cpu_num, int message) {
  *my_intr_dest = cpu_num;
  *my_intr_send_data = message;
  while(*my_intr_send_status == 0) {
    *my_intr_send_data = message;
  }
}</pre>
<p>Note that I still send a 32-bit message, mostly since that is handy in educational and demo setups that are not completely limited by what current hardware does. In this design, writing to the message register is what triggers the interrupt (or an attempt to send an interrupt, rather) on the other processor. The hardware (or in my case, the virtual hardware model) does the rest, in a way that is guaranteed to deliver all interrupts safely to its end point, eventually. But without any complex buffering in the hardware itself, that is best handled in the software which has an easier time managing state. This also lets the software use other strategies, such as possibly using a busy interrupt as a signal to try some other processor that is less busy.</p>
<p>Anyway, it was an interesting experience to try this, and seeing how hardware devices and software interact in a concurrent machine to create races. Not just software, but also hardware, must be designed right to avoid races from occuring. And races caused by hardware are quite impossible to work around in software at times.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/588"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/588" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/588" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/588/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Nulticore Effect&#8221;</title>
		<link>http://jakob.engbloms.se/archives/447?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/447#comments</comments>
		<pubDate>Tue, 09 Dec 2008 19:50:08 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Embarrassingly Parallel]]></category>
		<category><![CDATA[IEEE Spectrum]]></category>
		<category><![CDATA[Jack Ganssle]]></category>
		<category><![CDATA[manycore]]></category>
		<category><![CDATA[memory bandwidth]]></category>
		<category><![CDATA[Sandia Labs]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=447</guid>
		<description><![CDATA[Jack Ganssle wrote a column about the failure of multicore to scale, based on an article in IEEE Spectrum. He makes the following claim: Now a study in IEEE Spectrum shows that even for the classic embarrassingly parallel problems like weather simulations multicore offers little benefit. The curve in that article is priceless. As the [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-270" title="onoff" src="http://jakob.engbloms.se/wp-content/uploads/2008/09/onoff.png" alt="" width="72" height="70" />Jack Ganssle <a href="http://www.embedded.com/columns/breakpoint/212300032">wrote a column about the failure of multicore to scale</a>, based on an <a href="http://www.spectrum.ieee.org/nov08/6912">article in IEEE Spectrum</a>. He makes the following claim:</p>
<blockquote><p>Now a <a style="font-weight: bold;" href="http://www.spectrum.ieee.org/nov08/6912">study in IEEE Spectrum</a> shows that even for the classic embarrassingly parallel problems like weather simulations multicore offers little benefit. The curve in that article is priceless. As the number of cores grow from two to 64 performance plummets by a factor of five. Additional processors nullify each other.</p>
<p>Call it the <span style="font-weight: bold;">Nulticore Effect.</span></p>
<p><span id="more-447"></span></p></blockquote>
<p>I think that Jack misunderstood some of the article. What it really says, as far as I can tell, is that certain types of applications will have problems with the lower external memory bandwidth per core afforded by a 16-way or 32-way multicore based on traditional processor architectures.</p>
<p>As I read it, regular classic &#8220;embarrassingly parallel&#8221; (or as Grant Martin would say, &#8220;proudly parallel&#8221;) problems can be handled by managing data location and computation location carefully to colocate data and code, which lends itself to on-chip caching and probably also local-memory architectures.</p>
<p>When other problems that are less regular are going to run into the memory bandwidth wall:</p>
<blockquote><p>But an increasing number of important science and                 engineering problems—not to mention national security                 problems—are of a different sort. These fall under the                 general category of informatics and include calculating                 what happens to a transportation network during a                 natural disaster and searching for patterns that predict                 terrorist attacks or nuclear proliferation failures.                 These operations often require sifting through enormous                 databases of information.</p></blockquote>
<p>So while I think the Sandia people have a very good point to make, it is not the end of the usefulness of multicore. It is only the case for bandwidth-intense irregular algorithms, while many systems today make good use of hundreds of cores without a problem. Also, the research in IEEE Spectrum proposes a solution in the form of stacked memory &#8212; so what we really have is a bit of PR for a particular kind of architecture&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/447"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/447" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/447" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/447/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Few Parallel EDA Tools</title>
		<link>http://jakob.engbloms.se/archives/324?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/324#comments</comments>
		<pubDate>Wed, 29 Oct 2008 12:48:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallelized software]]></category>
		<category><![CDATA[SPICE]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=324</guid>
		<description><![CDATA[I keep looking out for interesting examples of parallel  software, and there is constant trickle of these. This past week I spotted a couple of new ones in the EDA field: SPICE simulation and chip timing analysis. Mentor Graphics Olympus-SoC Richard Goering at SCDSource has a good write-up of a recent announcement from Mentor Graphics [...]]]></description>
			<content:encoded><![CDATA[<p>I keep looking out for interesting examples of parallel  software, and there is constant trickle of these. This past week I spotted a couple of new ones in the EDA field: SPICE simulation and chip timing analysis.</p>
<p><span id="more-324"></span></p>
<h2>Mentor Graphics Olympus-SoC</h2>
<p>Richard Goering at SCDSource has <a href="http://www.scdsource.com/article.php?id=315">a good write-up of a recent announcement from Mentor Graphics</a> on a parallelized version of the Olympus-SoC tool suite for timing analysis. The best bit is the description of how they found parallelism in what used to be a serial program: they went down to very small components of the overall computation, and did a data-flow analysis to find independent atomic units to compute on in parallel. Here, fine-grained is the key to finding lots of parallelism, while using larger units does not work as well.</p>
<p>Qouting the article:</p>
<blockquote>
<div>“If you don’t work at the atomic level, it is very difficult to come up with tasks that are not dependent on each other,” Srinivas said. “We collect a lot of tasks, and we just keep all the cores busy all the time.” The goal, he said, is “minimal starvation” so that individual CPUs are not starved for tasks.</div>
<div>A key technology that makes this possible is what Mentor calls “pin levelization.” With this approach, each node is assigned a level number. If another node has a higher number, there is a possible dependency. Pins at the same level, however, are independent, and their tasks can be collected together into one heterogeneous chunk.</div>
</blockquote>
<p>Go read the rest of it for nice illustrations and more background.</p>
<h2>Gemini SPICE Simulator</h2>
<p><a href="http://www.chipdesignmag.com/payne/">Daniel Payne at Chip Design writes about another fast SPICE simulator.</a> Not as much detail here, but very nice graphs from the Gemini marketing folks. Not that they could not have been done in 2D with better information density, though. SPICE simulation would seem to be fairly parallellizable, which is not too surprising, considering the inherent parallelism of the domain. But as always, implementing a program to take advantage of such domain parallelism can be harder than expected if you did not do it from scratch. Which is what the Gemini people did, apparently.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/324"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/324" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/324" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/324/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SiCS Multicore Days: The Debate Points</title>
		<link>http://jakob.engbloms.se/archives/283?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/283#comments</comments>
		<pubDate>Fri, 19 Sep 2008 20:14:24 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[memory bandwidth]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[panel discussion]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=283</guid>
		<description><![CDATA[It is a week ago now, and sometimes it is good to let impressions sink in and get processed a bit before writing about an event like the SiCS Multicore Days. Overall, the event was serious fun, and I found the speakers very insightful and the panel discussion and audience questions added even more information. [...]]]></description>
			<content:encoded><![CDATA[<p>It is a week ago now, and sometimes it is good to let impressions sink in and get processed a bit before writing about an event like the SiCS Multicore Days. Overall, the event was serious fun, and I found the speakers very insightful and the panel discussion and audience questions added even more information.</p>
<p><span id="more-283"></span></p>
<p>What was quite striking this year was the greater difference of opinion between the speakers. I guess that in 2007, most of the discussion was on the level of &#8220;ouch, here comes multicore and what are we going to do about it&#8221;. This year, we got a bit deeper and with one more year of experience and massive research work, the collective world of multicore have made some progress and gained insights. And that&#8217;s when the differences start to show up; the fact that we have differences of opinion tells us that we are starting to dig into details and turning up different answers due to different viewpoints and user experiences.</p>
<p>So where were the differences this time?</p>
<ul>
<li>Heterogeneous vs homogeneous cores (on a single chip). Kunle Olukotun clearly supported the heterogeneous style (which is what you with Sun&#8217;s Niagara that he designed the basis for). Erik Hagersten was more interested in the difference between thin and fat cores of the same basic ISA, and Anant Agarwal was strongly in favor of completely homogeneous systems (which is what they build at Tilera). In my biased view, I think the argument for heterogeneous in pure energy efficiency is always going to prevail. See some of my previous blog posts on this topic, for some background:
<ul>
<li><a href="http://jakob.engbloms.se/archives/222">DNS Hardware Acceleration</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/157">Interview with Kunle Olukotun at the Register</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/44">Homogeneous vs heterogenous</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/90">Homogeneous vs heterogeneous, continued</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/80">IBM Z6 accelerators</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/77">Montalvo and heterogeneous x86</a>.</li>
</ul>
</li>
<li>Domain-specific vs general-purpose programming languages. The same sides here, with Kunle advocating domain-specific languages, and Anant and David Padua more in the general-purpose camp. I like domain-specific better, it seems to rhyme more with what I see people actually doing today to increase programming productivity overall.</li>
<li>Memory bottleneck or not? The most interesting discussion came when memory bandwidth and cache sizes were discussed. One quite common school of thought over the past few years teach that caches per core will shrink, and bandwidth to get data into and out of a chip is going to be a severe restriction on what can be done. Not all in the panel agreed with this, there was the idea (mostly from Kunle) that in some way the massive bandwidths and low latencies achievable within a chip (compared to between chip in a classic discrete-processors multiprocessor) could make this less of a problem. Personally, I think this is going to be some kind of problem, but maybe not as much as passing data around faster might reduce the need to store it temporarily. Despite the need for more bandwidth, nobody really agreed with Erik&#8217;s thought that maybe it makes sense to build chips that do not max out on the number of cores they contain, but rather try to balance core count with achievable IO bandwidth. That idea has some merit.</li>
<li>Core counts. Moore&#8217;s law tells us there are going to be thousands of cores on a chip fairly soon&#8230; but if we do not manage to make good use of them, maybe the growth in core counts will slow soon. Putting four or six or eight cores into a general-purpose system makes sense today, but more than that might turn out to be a waste for the vast majority of users that do not have problems to solve and programs to run that can make of more than that. In the same sense, maybe it is better with slightly fewer more powerful cores than a maximum amount of minimalistic cores, considering the state of software available today. So it sounds like a fairly divergent future here.</li>
<li>Shared memory or local memories? Most of the seemed to be in the camp proposing that shared memory is too convenient not to have, even when it really is bad for you. Several bad jokes comparing shared memory to alcohol, and the moderator of the panel suggesting that a good way to avoid the hangover of shared memory is to stay drunk&#8230; whatever that means in practice.</li>
</ul>
<p>Somethings were generally agreed upon, though.</p>
<ul>
<li>Programming is an issue, shared-memory or local-memory or whatever. the idea for the solution varied, however, as discussed above.</li>
<li>Cores will still be plentiful and that operating-systems focusing on sharing time on a single very valuable core is an idea of the past. The keyword for the future is spatial sharing and reducing the overhead of management (I have some previous blog posts on this topic, especially on the <a href="http://jakob.engbloms.se/archives/58">subject of IMA</a> and <a href="http://jakob.engbloms.se/archives/123">real-time control when cores are free</a>).</li>
<li>Virtualization and isolating partitions of a multicore chip from each are necessary mechanisms. Running multiple different operating systems on a single chip will be quite normal, probably under the control of some global hypervisor.</li>
</ul>
<p>Any comments on this from my small audience? I think the topics under discussion are quite fascinating and the kind of issues on which the success of major chip design projects will be decided. A good architecture with a good programming model has a great chance of success (as long as it looks like a continuation of something existing <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/283"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/283" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/283" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/283/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

