<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; parallel computing</title>
	<atom:link href="http://jakob.engbloms.se/archives/category/parallel-computing/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>GPGPU for Instruction-Set Simulation &#8211; Maybe, Maybe not</title>
		<link>http://jakob.engbloms.se/archives/1506?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1506#comments</comments>
		<pubDate>Sat, 08 Oct 2011 19:17:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CCGrid]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1506</guid>
		<description><![CDATA[I just read a quite interesting article by Christian Pinto et al, &#8220;GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms&#8220;, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png"><img class="alignleft size-full wp-image-125" style="margin: 5px 10px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="" width="100" height="100" /></a>I just read a quite interesting article by Christian Pinto et al, &#8220;<a href="http://infoscience.epfl.ch/record/164471">GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms</a>&#8220;, published at the <a href="http://www.ics.uci.edu/~ccgrid11/">CCGRID 2011 </a>conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.</p>
<p><span id="more-1506"></span>The paper describes a simulation for a network-on-chip based homogeneous system containing a &#8220;ARM-subset&#8221; ISS instances with local instruction and data caches, some local RAM, and also some shared RAM. Each core runs its own local software load, there is no SMP operating system. All communication between cores is over shared memory, using explicit operations across the NoC. All cores run a single cycle before they check communications from their neighbors.</p>
<p>This last point is crucial to understanding why this is feasible at all &#8211; in general, simulating a general shared-memory multiprocessor machine on a shared-memory multiprocessor falls down on the synchronization overhead. If your simulation semantics dictate that you synchronize every cycle anyway, and you do not try to optimize each core simulator, there is clearly decent room for parallel execution. By including the cache, they increase scalability, since there is more work per target cycle that can be run in isolation.</p>
<p>After reading the article, I am impressed by their work &#8211; just getting this to work is pretty good work. But there are quite a few questions which are not really answered in the article and which are crucial to understanding just how well GPGPUs could be used for this kind of ISS work.</p>
<ul>
<li>The targeted level of abstraction is a bit confusing. The authors claim it is &#8220;instruction accurate and not cycle accurate&#8221;, but still simulate caches and cycle-based communications across the NoC. If I read the paper right, communications will take a varying number of cycles depending on the distance for messages to travel. This is more detailed than a typical &#8220;instruction accurate&#8221; simulator.</li>
<li>The target system does not run an OS &#8211; that might (but I do not know) be an advantage for their approach, since it probably implies less variation in the instruction flow in cores, potentially enhancing the amount of time that all ISSes in a thread group in the GPU can execute the same instruction. This would seem crucial, as if each ISS was running a totally different program, the instruction execution part of the code would be running serialized.</li>
<li>They should really try to run the same kind of simulation on a high-end x86 CPU like an Intel Sandy Bridge with 8 or more hardware threads. I wonder if their scaling might not work just as well there &#8211; and with a much faster serial execution engine. This should give  a much more relevant point of comparison for GPU vs CPU execution of the simulator than&#8230;</li>
<li>the comparison object they use right now, a JIT-accelerated multicore simulation using OVP seems pretty irrelevant since it is not doing the same thing at all. That simulator does not simulate the caches or NoC, just a large number of isolated processors. They also do not run a parallel program on OVP, but rather a large number of single-core fibonacci and dhrystone programs. Thus, the fact that OVP uses a large temporal decoupling time slice does not matter for semantics. It just does not seem like a very relevant comparison point. OVP and their simulator try to solve different problems &#8211; fast execution of general code vs. performance profiling of massively parallel machines.</li>
<li>As I understand it, the given &#8220;S-MIPS&#8221; numbers in the evaluation tell us the total number of MIPS that we get out across all target cores. That seems to peak around 2000 &#8211; which isn&#8217;t necessarily that fantastic if we compare to high-performance ISS work in general where a few GIPS is definitely achievable. It is pretty good considering the level of detail here, though, where i would expect a normal ISS + cache simulator to produce at most a few MIPS. Once again, the authors need to be a bit more precise as to what they compare to what.</li>
<li>Not having an MMU and not implementing any interrupts or exceptions in the target machines avoids a large part of the complexity of a real ISS. That complexity might well be too much for the quite rigid execution environment of a GPGPU.</li>
<li>They missed that Simics, unique among instruction-accurate mainstream simulators, is <a href="http://jakob.engbloms.se/archives/128">parallel </a>since version 4.0.</li>
</ul>
<p>So, overall, this paper does not really tell us much whether a GPGPU can be used for instruction-set simulation in general. It does tell us that it might be doable, but there are many crucial complications which are not addressed.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1506"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1506" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1506" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1506/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nvidia &#8220;Kal-El&#8221; Variable SMP</title>
		<link>http://jakob.engbloms.se/archives/1496?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1496#comments</comments>
		<pubDate>Fri, 23 Sep 2011 19:16:33 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1496</guid>
		<description><![CDATA[Nvidia recently announced that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg"><img class="alignleft size-full wp-image-1497" style="margin: 5px 10px;" title="nvidia logo" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg" alt="" width="48" height="48" /></a>Nvidia <a href="http://blogs.nvidia.com/2011/09/quad-core-kal-el’s-stealth-fifth-core-lets-it-save-on-energy/">recently announced </a>that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something truly good.</p>
<p><span id="more-1496"></span>It is common practice in multicore computing today to dynamically change the clock frequency of a processor and turn cores on and off in order to adjust the compute power available to the current workload. Such operations tend to be limited in scope, as processors have minimum clock frequencies that make sense, and often the memory system requires all cores to be at the same frequency. Operating systems also tend to want to work with homogeneous sets of cores, as that makes scheduling reasonably straight-forward. This is probably what has kept the idea of &#8220;small + large&#8221; cores of the same ISA out of the mainstream of SMP design, despite all its advantages in principle.</p>
<p>Now, Nvidia has managed to implement some of that idea in Kal-El.</p>
<p>The key observation is that if you can turn cores on and off, once you get down to a single active core, any system is by definition homogeneous across all cores regardless of what that core is. Changing the nature of this core should then be much easier, since there is only a single core to contend with.</p>
<p>What Nvidia does in Kal-El is to add a fifth low-power core to the main group of four high-performance cores. The fifth core is architecturally identical (ARM Cortex-A9), so that the system state can be moved from the high-performance to the low-performance cores without undue complexities. Indeed, this is all done in hardware, so the OS (typically, Android) thinks it is running on a homogeneous quad-core. When the system is lightly loaded and the OS decides to only have a single core on, the hardware can detect the load is <em>really</em> light, and effectively change the nature of the active core to a low-power-optimized version.</p>
<p>Once more compute power is needed, the hardware invisible slips back to the first high-power core, and then the OS can start increasing clocks and turning on cores as usual. It is effectively the same as a regular ARM Cortex-A9 quad-core setup, but with better low-power performance. The following graph from the Nvidia <a href="http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-whitepaper-0911b.pdf">white paper </a>shows it pretty clearly (red text is my added comment):</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png"><img class="aligncenter size-full wp-image-1498" title="tegra kal-el power curve" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png" alt="" width="655" height="446" /></a></p>
<p>Note the slope of the green line: that core is not a good one if you want high performance. It is optimized to scale within a range of low compute-power requirements, rather than provide the best performance per watt at the high end. Using Variable SMP, Nvidia lets us have both.</p>
<p>Neat.</p>
<p>More reading:</p>
<ul>
<li><a href="http://arstechnica.com/gadgets/news/2011/09/tegra-3-includes-5th-stealth-core-to-optimize-power-efficiency.ars">ArsTechnica</a> has a short summary</li>
<li>There does not seem to be much more right now, everyone is really just reiterating the points from the white paper.</li>
</ul>
<p>&nbsp;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1496"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1496" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1496" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1496/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Memory Models: x86 is TSO, TSO is Good</title>
		<link>http://jakob.engbloms.se/archives/1435?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1435#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:16:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Doug Lea]]></category>
		<category><![CDATA[Francesco Zappa Nardelli]]></category>
		<category><![CDATA[memory consistency]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[SPARC]]></category>
		<category><![CDATA[UpMarc]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1435</guid>
		<description><![CDATA[By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well. He showed a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif"><img class="size-full wp-image-1016 alignleft" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="" width="122" height="45" /></a>By chance, I got to attend a day at the <a href="http://www.it.uu.se/research/upmarc/events/SS2011/Programme.html">UPMARC Summer School</a> with a very enjoyable talk by <a href="http://moscova.inria.fr/~zappa/">Francesco Zappa Nardelli </a>from INRIA. He described his work (along with others) on <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">understanding and modeling multiprocessor memory models</a>. It is a very complex subject, but he managed to explain it very well.</p>
<p><span id="more-1435"></span>He showed a very interesting discussion from a few years ago on the x86 memory model and the implementation of spinlocks in the Linux kernel. Various experts went back and forth over whether the final MOV that sets a lock variable to 1 needed to be prefixed by LOCK or not. The discussion ended when Linus Torvalds said &#8220;I know that it is needed&#8221;. Only to see an Intel architect finally intervene and say &#8220;you know, really, it isn&#8217;t needed&#8221;. This was followed by a series of releases of Intel manuals documenting the x86 memory model, with increasing precision in each release. Intel also actually changed the published rules along the road, withdrawing some optimizations as they realized that they would break existing software.</p>
<p>Note that such a description of a memory model must both describe existing hardware, and serve as the guideline for future hardware. Therefore, there are optimizations that are not implemented today but which are possible given the rules. Such optimization opportunities can be removed from the rulebook as long as they have never been part of shipping hardware, so it is not as crazy as it might sound.</p>
<p>Anyway, the point that Francesco made was both to tell an interesting story from history, and making the point that describing and understanding memory models is hard. I certainly agree with that. I recall an ISCA many years ago when some computer architecture professors all agreed that very few people really understand consistency and weak memory models.</p>
<p>To make life easier for programmers, Francesco and Peter Sewell (in Cambridge) has defined their own set of rules for x86 memory consistency. This is not an architecture spec, but a rule set for regular programmers. It is found at <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">http://www.cl.cam.ac.uk/~pes20/weakmemory/</a>. Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model.</p>
<p>They have also attempted to formalize the Power Architecture memory model. Both the actual memory model and their model of it can only be described as very complex. The programmer&#8217;s model is expressed in terms of store queues, speculative instruction execution, and commits of instructions. Not something you easily keep in your head. It is interesting to note that ARM MPCore essentially copied the Power Architecture.</p>
<p>He showed an interactive simulation of the Power memory model, and the way that you need to think about it in terms of propagating information between threads and committing them. It is possible to propagate values and then another propagation overrides a value before the thread commits&#8230; Fun. Or a headache.</p>
<p>The big take-away from the talk for me is that it confirms the observation made may times before that <a href="http://en.wikipedia.org/wiki/Memory_ordering">SPARC TSO </a>seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines.</p>
<p>Maybe TSO does not theoretically scale in the same insane way as Power or Alpha does/did. But the cost of that theoretical scalability is that programmers might have to litter their code with sync operations just to get it to run correctly. With too many sync operations, the code will run very slowly negating any advantage on the hardware level. Note that sync operations can be very expensive. <a href="http://g.oswego.edu/">Doug Lea</a>, in the audience, pointed out that a sync can cost up to 300 cycles on a POWER5.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1435"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1435" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1435" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1435/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SecurityNow on Randomness</title>
		<link>http://jakob.engbloms.se/archives/1424?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1424#comments</comments>
		<pubDate>Wed, 25 May 2011 20:20:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[random number generation]]></category>
		<category><![CDATA[SecurityNow]]></category>
		<category><![CDATA[Steve Gibson]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1424</guid>
		<description><![CDATA[Episodes 299 and 301 of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>Episodes <a href="http://twit.tv/sn299">299 </a>and <a href="http://twit.tv/sn301">301 </a>of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas of predictability, observability, repeatability, and determinism. I have worked and wrangled with these concepts for almost 15 years now, from my research into timing prediction for embedded processors to my current work with the repeatable and reversible Simics simulator.</p>
<p><span id="more-1424"></span>Let&#8217;s start from the top.</p>
<p>When Steve said that computers are deterministic, I jumped. To me, a computer is anything but deterministic. The idea that rerunning a program does the same thing is an ideal state that you can rarely reach, and having an infrastructure like Simics that <a href="http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html">helps you achieve this </a>is huge win for debugging.</p>
<p>Listening closely, what I think Steve <em>really </em>said is that an algorithm like a random number generator is deterministic. If you know its initial state, it will always compute the same result. That is indeed true for code that just converts an input into an output, and does no communication and is not dependent on time or timing. My experience in random and nondeterministic behavior comes from programs that feature multiple threads and often multiple processes, and plenty of asynchronous activity going on. So, same word, different contexts.</p>
<p>However, Steve also several times talk about computers as being deterministic predictable machines. I think that characterizing today&#8217;s computers as being deterministic is untrue. I would rather say that with multiple cores and multiple chips and timing variations all over the place, a computer has become fundamentally <em>nondeterministic </em>and non-repeatable, since there are so many little things going on where a nanosecond difference in time can cause behavior to diverge incredibly quickly. There is a nice paper from 2003 about the divergent behavior from minor differences, &#8220;<a href="http://portal.acm.org/citation.cfm?id=822813">Variability in Architectural Simulations of Multi-threaded Workloads</a>&#8220;, by Alaa R. Alameldeen and David A. Wood.</p>
<p>The <a href="http://jakob.engbloms.se/archives/1374">HAVEGE program I wrote about a while back </a>is essentially an attempt to harness the fundamental unpredictability of modern hardware timing. Nice idea, which at least in theory fulfills the more important property for security of being <em>unobservable</em>. Security doesn&#8217;t really need &#8220;real&#8221; randomness, all you need is something that an attacker cannot predict or observe. The classic <a href="http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html">Netscape SSL lack-of-randomness in the random seed</a> issue from 1996 is the best illustration of this. Certain things about a target can be inferred or observed, but the low-level hardware timing is not one of them, at least not for an x86 or high-end ARM class machine.</p>
<p>The solution that Steve prefers are the Yarrow and <a href="http://en.wikipedia.org/wiki/Fortuna_%28PRNG%29">Fortuna </a>algorithms that collect randomness from the environment of a computer and uses that as a seed to a normal random number generator, creating lots of useful random data from a fairly small seed. This is the same idea as HAVEGE, but with a different entropy source. In both cases the basic idea seems sound and reasonable, but I kind of hoped that Steve would know of some way to evaluate the quality of the entropy pool generated from hardware events.</p>
<p><a href="http://www.grc.com/sn/sn-301.htm">Steve mentioned </a>the NIST randomness test that was used to test HAVEGE. It is certainly an aggressive test, but <a href="http://jakob.engbloms.se/archives/1374">as my testing showed</a>,  it only demonstrates that a random number generator is random in the data  produced. It does not show that it is unpredictable, and it does not measure the benefit gained from using  unobservable local events in hardware as the source of entropy. You need something  else, like comparing repeated collections of randomness over time from  the same system, to build confidence in unobservable and unpredictable  randomness.</p>
<p>With a computer, you do have such a thing as repeatable,  deterministic, and thus predictable randomness. In a modern desktop or server computer, you also have tons of totally unpredictable non-repeatable non-usefully-observable randomness in the low-level hardware timing and concurrent behavior of independent hardware units. Too bad it seems hard to prove this by measurement.</p>
<p>For yet more randomness discussion, especially randomness in embedded systems, I recommend the <a href="http://secworks.se/2011/03/om-slumptal-och-entropikallan-haveged/">blog post </a>by Joachim Strömbergsson. (it is in Swedish).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1424"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1424" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1424" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1424/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Photoshop Scalability and &#8220;-10% overhead&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1311?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1311#comments</comments>
		<pubDate>Mon, 01 Nov 2010 11:45:51 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Cary Millsap]]></category>
		<category><![CDATA[Clem Cole]]></category>
		<category><![CDATA[Communications of the ACM]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[performance optimization]]></category>
		<category><![CDATA[Photoshop]]></category>
		<category><![CDATA[Russell Williams]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1311</guid>
		<description><![CDATA[I just finished reading the October 2010 issue of Communications of the ACM. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg"><img class="alignleft size-full wp-image-1313" style="margin: 10px 5px;" title="cacm 10 2010" src="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg" alt="" width="62" height="80" /></a>I just finished reading the <a href="http://cacm.acm.org/magazines/2010/10">October 2010 </a>issue of <a href="http://cacm.acm.org/">Communications of the ACM</a>. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;.</p>
<p><span id="more-1311"></span>Cary&#8217;s articles deal mostly with database tuning in the Oracle ecosystem, but most of his observations apply to any kind of programming with a performance requirement. It is worth a read. It was good to see him dissect performance, including obvious &#8211; but not really obvious &#8211; concepts like the difference in usefulness between average and worst-case response times from a user perspective.  In essence, you need to watch the spread of response times, and try to keep the worst times from getting too bad, rather than just look at an average that might conceal extremes that frustrate users.</p>
<p>Cary also made the comment noted in the title of this post. In his opinion, the performance instrumentation built into Oracle has an overhead of -10% &#8211; or even -20% or -30%, since it enables optimizations that would otherwise have been impossible to do. This is something worth noting in general &#8211; overhead that looks bad when considered as a local cost might be a net benefit in the grand scale of things, by enabling measurements and insight that let a program run much faster.</p>
<p>The ACM case study on Photoshop can be found online as a <a href="http://queue.acm.org/detail.cfm?id=1858330">resource at the ACM Queue</a>, with what seems to be mostly the same content. It was written by Clem Cole, at Intel, who interviews Russell Williams of the Photoshop team. It is very instructive to see how the Photoshop team has built an application that works well with 2 to 4 and maybe 8 cores, but that really needs to reconsider parts of its architecture to scale beyond 8.</p>
<p>Clem from Intel pushes Russell by bringing up various examples of next-generation architectures, in particular the fact that clusters-on-a-chip and NUMA memories look inevitable. The Photoshop people seem to take a wait-and-see approach to this: they first want to see some architecture have real traction in the market before they commit and rearchitect their software to make use of it.</p>
<p>The problems of debugging parallel software are also brought up. There used to be a simple bug in the asynchronous I/O system in Photoshop that took ten years to uncover!  Essentially, the programmers had not considered atomicity properly in the presence of multiple threads. With that kind of example, it is not surprising that the Photoshop programmers are very careful when planning and performing parallelizations.</p>
<p>The target domain of Photoshop is to some extent naturally parallel, but not as much as I would have thought. Since a user might operate on any part of an image, large or small, and maybe start and then abort an operation, it is not just a matter of splitting a image evenly across threads or cores. There is a significant amount of variation in just how parallel things can be in Photoshop.</p>
<p>Photoshop has had an easy-to-use parallelization system in place since  around 1994, which lets programmers write simple serial computational  kernels which are automatically applied to parts of an image in  parallel. The Photoshop program itself takes care of the synchronization  between kernels, and the kernels can be simple and robust and without  any parallel code inside. This is a <a href="http://jakob.engbloms.se/archives/209 ">pattern that has been seen before</a>,  and which does make a lot of sense &#8211; if it can be applied successfully.  Apparently, this is not necessarily the easiest thing to scale beyond  four cores.</p>
<p>The main performance limitation for Photoshop performance keeps being memory bandwidth, rather than raw compute performance. This also limits the need to aggressively scale to higher levels of parallelism: as long as multiple threads do not give more bandwidth, it has proven hard to use more than two or three threads on any multicore processor as that is sufficient to saturate the memory system. Apparently, this is different on the Nehalem (Core i7/i5/i3) generation of Intel multicore processors, where each core has a dedicated non-stealable slice of the memory bandwidth.</p>
<p>For the near future, it seems that the big step for Photoshop is going the route of using GPUs for acceleration, rather than 10+ core main processors.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 25px; width: 1px; height: 1px; overflow: hidden;">http://queue.acm.org/detail.cfm?id=1858330</div>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1311"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1311" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1311" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1311/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: &#8220;IMA on Simics&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1304?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1304#comments</comments>
		<pubDate>Tue, 26 Oct 2010 13:00:25 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Integrated Modular Avionics]]></category>
		<category><![CDATA[real-time]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Tennessee Carmel-Veilleux]]></category>
		<category><![CDATA[wcet]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1304</guid>
		<description><![CDATA[I have a fairly lengthy new blog post at my Wind River blog. This time, I interview Tennessee Carmel-Veilleux, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore. I made some cocky comments about just [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 10px 5px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have a <a href="http://blogs.windriver.com/engblom/2010/10/interview-with-tennessee-carmel-veilleux.html">fairly lengthy new blog post </a>at my Wind River blog. This time, I interview <a href="http://www.tentech.ca/">Tennessee Carmel-Veilleux</a>, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore.</p>
<p><span id="more-1304"></span>I made some <a href="http://jakob.engbloms.se/archives/58">cocky comments about just how stupid the current implementations </a>of this idea are a few years ago, but in my discussions with Tennessee I have realized that things are not that simple. Essentially, you are caught between the structures and strictures of the certification agencies, and the complexity of the hardware with its many shared resources making predictability and programming very difficult.</p>
<p>It is a field which touches to my old work on WCET, but with a target system that is much less amenable to analysis. I still <a href="http://jakob.engbloms.se/archives/123">think it is a good idea to try to use seas of simple cores and separate work physically </a>rather than virtually, but that might be a battle that will never be won. If nothing else, even on a massive spatially-divided multicore device, there will be some shared resources that make life difficult when very low levels of jitter are tolerated.</p>
<p>Read the interview, and read Tennessee&#8217;s own blog &#8211; he has done some pretty cool things both in hardware, software, and virtual hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1304"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1304" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1304" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1304/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>S4D 2010</title>
		<link>http://jakob.engbloms.se/archives/1251?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1251#comments</comments>
		<pubDate>Wed, 15 Sep 2010 08:02:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Debug]]></category>
		<category><![CDATA[ESCUG]]></category>
		<category><![CDATA[FDL]]></category>
		<category><![CDATA[Infineon]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[John Aynsley]]></category>
		<category><![CDATA[Pat Brouillette]]></category>
		<category><![CDATA[S4D]]></category>
		<category><![CDATA[Simon Davidmann]]></category>
		<category><![CDATA[Southampton]]></category>
		<category><![CDATA[ST]]></category>
		<category><![CDATA[SystemC]]></category>
		<category><![CDATA[Thorsten Grötker]]></category>
		<category><![CDATA[TrustZone]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1251</guid>
		<description><![CDATA[Looks like S4D (and the co-located FDL) is becoming my most regular conference. S4D is a very interactive event. With some 20 to 30 people in the room, many of them also presenting papers at the conference, it turns into a workshop at its best. There were plenty of discussion going on during sessions and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg"><img class="alignleft size-full wp-image-941" title="S4D" src="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg" alt="" width="143" height="62" /></a>Looks like S4D (and the co-located FDL) is becoming my most regular conference. S4D is a very interactive event. With some 20 to 30 people in the room, many of them also presenting papers at the conference, it turns into a workshop at its best. There were plenty of discussion going on during sessions and the breaks, and I think we all got new insights and ideas.</p>
<p><span id="more-1251"></span></p>
<h2><a href="../wp-content/uploads/2010/09/P1140077.jpg"><img class="aligncenter size-full wp-image-1276" title="P1140077" src="../wp-content/uploads/2010/09/P1140077.jpg" alt="" width="400" height="258" /></a></h2>
<h2>S4D Talks, Themes, and Topics</h2>
<p>More is available in &#8220;<a href="http://jakob.engbloms.se/archives/1280">S4D part 2</a>&#8220;.</p>
<h3>Tracing and Instrumentation</h3>
<p>The papers presented covered a wide variety of topics from a variety of angles. Still, everybody felt that two topics kept coming back in various forms in a majority of the papers and discussions: <em>tracing</em> and <em>instrumentation</em>.</p>
<p>Code instrumentation is not a dirty word anymore. The traditional judgment that inserting probes into your software is plain bad does not apply anymore, at least not in the minds of the people at S4D. Instrumentation was applied to drivers, OS kernels, and regular user-level software. I think the key insight is that there is clear value in having the developers that write a piece of software also mark points of interest in the code. When analyzing a trace of an execution, that means that the information in the trace becomes meaningful to the software developers, as it is on the right level of abstraction. Instrumentation naturally produces traces, which can be fed out using  shared memory, networks, special-purpose hardware, and more.</p>
<p>One of the instrumentation trace solutions presented (the SVEN system from Intel Digital Home presented by Pat Brouillette), actually leaves the instrumentation in place in the shipping customer systems. In this way, you cannot really claim that instrumentation is intrusive &#8211; it is just part of the software, always. Customers can even activate the tracing in deployed systems, and ship the traces back to the developers for analysis of bugs found in the field. It is another approach to <a href="http://jakob.engbloms.se/archives/1231">record and replay</a> that touches on my paper on transporting bugs with checkpoints.</p>
<p>The increased interest in instrumentation probably has something to do with the nature of the systems that are being addressed. For systems using shared memory multicore hardware and general-purpose operating systems, the cost of instrumentation is easier to take than for very small constrained embedded systems. Essentially, as systems get more complex, instrumentation becomes more tractable.</p>
<p>Instrumentation can interact with hardware trace and debug functions is a neat way to build a system which is more powerful than a hardware or software system would be on its own. Especially for software stacks involving hypervisors and multiple complex operating systems, that is likely necessary.</p>
<p>Once we have a trace, just <a href="../archives/942">like last year</a>, we need to have tools for analyzing the tons of data you get from tracing a modern system. ST talked about a tracing system that generated 100s of gigabytes of data.</p>
<p>One trace aspect that kept coming up was the need for <em>time stamps </em>on trace data. To reconcile multiple traces and understand how different concurrent units talk to each other, a global time stamping mechanism is crucial. There seems to be work on hardware to support this.</p>
<h3>Security, Secrecy, and Debug</h3>
<p>I moderated a panel on hardware support for debug, and posed the question on how to balance security and the need to debug. This generated a number of interesting answers from the panel and the audience.</p>
<p>The conflict between debuggability and secrecy is there. Even from the same customer you first get &#8220;you have to make the internal state of the controller inaccessible and hidden to avoid customers modifying their engines&#8221;&#8230; and then when a problem appears in the field, they ask for a way to analyze and trace that very same system. Hard to support both requirements in a reasonable way.</p>
<p>A sophisticated solution to debug security from companies like ARM, Infineon, and ST is debug that can be enabled using key exchange. The chips are built with a &#8220;locked door&#8221; in place, but the keys to the door are kept well-guarded. In this way the same chip can be used in development and in the field.</p>
<p>To support debug of systems involving secure modes like ARM TrustZone, ARM has defined several levels of access in their CoreSight hardware modules. This makes it possible for a debugger to be restricted to just debugging user-level code, just OS and user-level code, or all of the software stack. To me, this sounds like it could allow mobile phone manufacturers to &#8220;securely&#8221; let their application developers use hardware-based debug, without compromising operating systems or secure boot modes.</p>
<p>The classic technique of using fuses to turn off functions is also relevant, at least for systems with moderate levels of security. This can certainly be overcome using special tools to peel off the top of chips and reconnect the fuses, but the panel seemed to think that that level of attack was in general not worth protecting against. However, the audience pointed out that  this was actually being done to automotive engine controllers and there are people making a good living from such antics.</p>
<h3>ESCUG Meeting</h3>
<p>The ESCUG meeting was a mix of fairly slick commercial presentations from OVP/IMperas chief Simon Davidmann and SystemC guru John Aynsley, and research presentations of varying quality.</p>
<p>One thing that struck me was that the academics spent a significant time in all presentations about how their approaches were compatible with the existing SystemC structure, where they host their open-source efforts, etc. I guess that is good in that they show a certain concern for reality &#8211; but it is also a bit sad that they did not get time to actually talk that much about the core ideas they were bringing forward. I am personally much more interested in new ideas than infrastructure and project management. It does not bode well for European research if this is what people are forced to produce, in lieu of real innovation.</p>
<h3>Thorsten Grötker&#8217;s Keynote</h3>
<p>On Wednesday morning, Thorsten from Synopsys did a look back over the history of SystemC, free from product pitching. He only mentioned Synopsys in his introduction, where the high-level message was that the embedded software is really the key problem for industry today. I cannot disagree with that.</p>
<p>During the SystemC parts of his talk he did say a few things that I did not quite agree with&#8230; in particular that TLM was unknown prior to 1999. It was not called that, but it certainly existed in the field of full-system simulation. The main problem is that Thorsten only sees the EDA history of modeling, not the computer architecture and software-driven work that did simulations as far back as 1950 (the famous Gill paper), and fast simulation since at <a href="http://jakob.engbloms.se/archives/130">least 1967</a>.</p>
<p>He also claims that with SystemC you have a single language for both detailed and TLM models. That is true&#8230; but you still need multiple models, one at each level of abstraction. So yes, one language, multiple models. However, that gluability really comes with a performance and complexity cost. It makes it too easy to slip into bad modeling even in TLM.</p>
<p>An interesting theme that Thorsten picked up from John&#8217;s talk at ESCUG is the use of SystemC to model software and RTOS, using the upcoming process control extensions. If you stretch that into the area of software synthesis, it means that SystemC is going to collide with the field of model-driven software development. Will you use SystemC, coming from the hardware world, or UML/MATLAB/Domain-specific languages coming from the software world?  Thorsten makes the interesting point that in order to integrate with that world, SystemC will require some concepts from that world (like pins and clocks enable interaction with RTL). I am not sure that is true, necessarily, I think you can just as well create point adaptors to the same effect.</p>
<h2>Getting to Southampton</h2>
<p>The <a href="http://www.soton.ac.uk/">University of Southampton </a>hosted the event, and it took place in the university lecture halls.  That means that we got free very fast WiFi (unlike any commercial conference venue I have ever seen).  The university campus was full of services (unlike the desolate place that last year&#8217;s FDL/S4D choose).  Housing in the <a href="http://www.soton.ac.uk/accommodation/halls/gleneyre/index.html">Glen Eyre residential halls </a>was a bit spartan but functional. Felt like being back in my days as a student living in student housing.</p>
<p>The instructions from the conference about how to get to the conference was a bit confusing and incomplete. In practice, it is very easy to get to Southampton from both Gatwick (direct train) and Heathrow (NationalExpess bus 203).  At Heathrow, I had a bit of luck with the bus to Southampton. The instructions from the NationalExpress website had me believe that I had to get from Terminal 5 where we landed to the central bus station and then catch the bus at 15.00. As we landed 40 minutes late (14.40), this looked very hopeless&#8230; until I found the NationalExpress counter in the arrivals hall at Terminal 5 and they told me the bus would leave at 15.30. Nice, no stress. The bus to Southampton even had free Wifi on board!</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062.jpg"></a><a href="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062-1.jpg"><img class="aligncenter size-full wp-image-1275" title="P1140062-1" src="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062-1.jpg" alt="" width="400" height="246" /></a></p>
<p>Once in Southampton, you then had to take the bus U1A out to the university campus, and finding a bus stop for that was the most difficult part of the journey, actually. Some of the buses from Heathrow stop at Southampton university.</p>
<p>See also &#8220;<a href="http://jakob.engbloms.se/archives/1280">S4D Part 2</a>&#8221; for a few more tidbits from S4D.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1251"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1251" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1251" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1251/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>VirtualBox SMP</title>
		<link>http://jakob.engbloms.se/archives/1212?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1212#comments</comments>
		<pubDate>Fri, 20 Aug 2010 18:04:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[CECsim]]></category>
		<category><![CDATA[SMP]]></category>
		<category><![CDATA[VirtualBox]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1212</guid>
		<description><![CDATA[I listened to an interesting FLOSS Weekly interview with Adam Hall and Achim Hasenmuller of VirtualBox. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox. VirtualBox is able to give a guest OS up to 32 [...]]]></description>
			<content:encoded><![CDATA[<p>I listened to an interesting <a href="http://twit.tv/floss130">FLOSS Weekly interview </a>with Adam Hall and Achim Hasenmuller of <a href="http://www.virtualbox.org/">VirtualBox</a>. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox.</p>
<p><span id="more-1212"></span>VirtualBox is able to give a guest OS up to 32 virtual processors for its use. The system virtualizes cores so that you can allocate more cores than you have on your host. You don&#8217;t want more active cores than you have physically, or performance might suffer badly (there is a <a href="http://blogs.sun.com/jsavit/entry/virtual_smp_in_virtualbox_3">good blog post hosted by blogs.sun.com about VirtualBox 3 and its SMP support</a>, read it before Oracle decides to kill off all the old content&#8230;).</p>
<p>Second, the development of that SMP support only took some six months. But getting it to work right took 18 months. Sounds like a familiar story for parallelizing software of this kind. The interviewees made it sound like the code for this was utterly complex, and I can believe that too.</p>
<p>VirtualBox makes extensive use of hardware virtualization support on x86  hosts to enhance performance (Intel VT-X and AMD AMD-V). As they see  it, all other alternatives are inferior, including the VmWare tradition  of binary translation and patching. They claimed that VmWare actually  interpreted all of the code in a guest, but I find that a bit hard to  believe.</p>
<p>It is interesting to compare their approach to something like Simics, which was made parallel in the Spring of 2008 with Simics 4.0 Acclerator. The advantage an IT run-time virtual machine like VirtualBox has over a virtual platform like Simics in creating a parallel system is that they can make use of the cache coherence on the host to propagate information between the cores. A virtual platform cannot assume that the host and target are of the same type, which is necessary for this to make sense. It also makes VirtualBox entirely nondeterministic, but that is just plain normal for a physical computer system. One interesting intermediate form here is the IBM CECsim system, where a z-series mainframe is used to simulate a slightly different z-series mainframe, including running multiple simulated processors in parallel. CECsim also makes use of the hardware cache coherence and is nondeterministic.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1212"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1212" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1212" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1212/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multicore is not That Bad</title>
		<link>http://jakob.engbloms.se/archives/1207?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1207#comments</comments>
		<pubDate>Tue, 10 Aug 2010 18:24:15 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Chris Nicols]]></category>
		<category><![CDATA[David Patterson]]></category>
		<category><![CDATA[hypervisor]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1207</guid>
		<description><![CDATA[I recently read a couple of articles on multicore that felt a bit like jumping back in time. In IEEE Spectrum, David Patterson at Berkeley&#8217;s parallel computing lab brings up the issue of just how hard it is to program in parallel and that this makes the wholesale move to multicore into something like a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png"><img class="alignleft size-full wp-image-654" title="opinion" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png" alt="" width="91" height="69" /></a>I recently read a couple of articles on multicore that felt a bit like jumping back in time. In <a href="http://spectrum.ieee.org/computing/software/the-trouble-with-multicore/0">IEEE Spectrum</a>, David Patterson at <a href="http://parlab.eecs.berkeley.edu/">Berkeley&#8217;s parallel computing lab</a> brings up the issue of just how hard it is to program in parallel and  that this makes the wholesale move to multicore into something like a &#8220;<a href="http://en.wikipedia.org/wiki/Hail_Mary_pass">hail Mary pass</a>&#8221; for the computer industry. In <a href="http://www.computerworld.com.au/article/354261/what_will_do_100_cores_/">Computer World</a>, Chris Nicols at <a href="http://www.nicta.com.au/">NICTA in Australia </a>asks what <strong>you </strong>will  do with a hundred cores &#8211; implying that there is not much you can do  today. While both articles make some good points, I also think they  should be taken with a grain of salt. Things are better than they make  them seem. <span id="more-1207"></span></p>
<form></form>
<p>David Patterson&#8217;s article is very similar in its message to what we  used to hear five years ago as everyone woke up and panicked when  single-core computing ran out of steam. Chris Nicols is extrapolating  from the current state in desktop PCs, and asking how the programs we  run today will work when you have scores of cores rather than two or  four.</p>
<p>The main message in both articles is that software  needs to adapt to multicore, and that this is not happening as quickly  as it needs to. It is clear that automatic parallelization of existing  code is a no-starter, and the Computer World article proposes the use of  domain-specific languages (which I agree is a <a href="../archives/747">very </a><a href="../archives/157">good </a><a href="../archives/264">way </a><a href="../archives/905">to </a>go).  David Patterson is less clear on what he thinks is a good programming  model for multicore, leaving that as an open problem. Patterson does  point out that we have quite a few success stories in parallel  computing. I think it is important to not underestimate the set of  problems which  are amenable to parallelization. In particular, in the  embedded field, we have many naturally  parallel problem domains. For example, networking offers abundant  parallelism as many clients, servers, and data streams are active.</p>
<p>However, both articles also miss some of the things which are  happening to make multicore easier to use, especially in embedded. My  favorite example of a technology that nobody seems to talk about outside  of the embedded field is the hypervisor. With an hypervisor (such as  the <a href="http://www.windriver.com/products/hypervisor/">Wind River Hypervisor</a>), you can take existing distributed systems and consolidate them onto a single multicore device, continuing a <a href="../archives/905">long tradition of multiple-processor programming </a>in  embedded. Also, the hardware-based debug tools which are available for  embedded systems &#8211; including multicore &#8211; do not seem to register at all  with the mainstream researchers. It is really a shame, and hopefully we  can start to change that with conferences like the <a href="http://www.ecsi.me/s4d">S4D</a> where we start to bring  industrial debugging experience to the academic community.</p>
<p>Another aspect which is missing from both articles is any discussion  about the debug and test of parallel software. Coding a parallel program  is one thing, making sure it works is quite another. I think this is an  interesting problem in its own right, as exposing parallel bugs is  often just as hard as writing the buggy code to begin with. <a href="http://www.windriver.com/products/simics/">Wind River Simics </a>is  a tool that can be used to really stress multicore software, thanks to  its ability to vary configuration parameters and inject some extra  delays into the target system. Simics is also a very good tool for  debugging multicore software thanks to its controlled deterministic  execution environment. I have already discussed this in a <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html">Wind River blog post</a>, and I will not repeat the argument here.</p>
<h3>Post scriptum: setting some facts straight</h3>
<p>I  want to quickly correct some factual mistakes in the Computer World  article: the  processor power wall at 130W that he discusses is often much lower in  embedded systems &#8211; a mobile phone drawing 130W would not be very  popular, and in networking infrastructure, the magic limit today seems  to be around 30W. So it all depends, and the Itanium example is totally  irrelevant. In certain mainframe computers, the limit is higher than  that. The same is true for the claimed  upper possible clock speed limit at 4 GHz. This has been surpassed by  several companies, most notably by IBM who clocked the Power6 to 4.7 GHz  a few years ago. However, he is right that we will see cores  multiplying in almost all systems as single cores run out of speed  increases.</p>
<p>As Jack Ganssle said, us embedded folks <a href="http://www.eetimes.com/discussion/embedded-pulse/4023943/Can-t-Get-No-Respect">get no respect</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1207"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1207" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1207" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1207/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: True Concurrency is Different</title>
		<link>http://jakob.engbloms.se/archives/1151?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1151#comments</comments>
		<pubDate>Fri, 18 Jun 2010 20:24:04 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1151</guid>
		<description><![CDATA[I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called True Concurrency is Truly Different (Again). It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems. Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="button-quicklink-blogs" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html#more">True Concurrency is Truly Different (Again). </a>It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1151"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1151" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1151" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1151/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: Simics Analyzer</title>
		<link>http://jakob.engbloms.se/archives/1137?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1137#comments</comments>
		<pubDate>Wed, 26 May 2010 19:40:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Wind River Blog]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1137</guid>
		<description><![CDATA[I have a new blog post up at the Wind River blog network, about the new target analysis tools in Simics 4.4. It is a very fun piece of technology to play with, and you learn a lot just by poking around at existing software systems&#8230; Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="button-quicklink-blogs" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have a <a href="http://blogs.windriver.com/engblom/2010/05/analyzed.html">new blog post </a>up at the Wind River blog network, about the new target analysis tools in Simics 4.4. It is a very fun piece of technology to play with, and you learn a lot just by poking around at existing software systems&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1137"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1137" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1137" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1137/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Concurrency in Lego Mindstorms NXT</title>
		<link>http://jakob.engbloms.se/archives/1058?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1058#comments</comments>
		<pubDate>Fri, 08 Jan 2010 21:19:54 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[AVR]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[LabView]]></category>
		<category><![CDATA[lego]]></category>
		<category><![CDATA[Mindstorms]]></category>
		<category><![CDATA[NXT]]></category>
		<category><![CDATA[parallelized software]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1058</guid>
		<description><![CDATA[For my parental leave, I have just bought myself a Lego Mindstorm NXT 2.0 kit. It is not much fun for our youngest, who mostly gets a bit scared by a piece of Lego driving around making noises, but I hope to be able to use it to teach my older child (almost five) to [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1057" style="margin: 5px 10px;" title="lego mindstorms nxt2" src="http://jakob.engbloms.se/wp-content/uploads/2010/01/lego-mindstorms-nxt2.png" alt="lego mindstorms nxt2" width="146" height="126" /></p>
<p>For my parental leave, I have just bought myself a <a href="http://www.mindstorms.com">Lego Mindstorm NXT 2.0 kit</a>. It is not much fun for our youngest, who mostly gets a bit scared by a piece of Lego driving around making noises, but I hope to be able to use it to teach my older child (almost five) to program. Let&#8217;s see how that turns out. It looks hard to make the NXT environment provide the kind of <a href="http://www.boardgamegeek.com/boardgame/18/roborally">Roborally</a>-style programming blocks that I had hoped to create, as I cannot for some reason get a sufficiently custom icon onto custom blocks.</p>
<p>It also presented me with an opportunity to try some domain-specific high-level graphical programming. The programming environment provided for the NXT series of Mindstorms kits is based on LabView from National Instruments, and it really does seem to work. It even features parallel tasks, which I tried to use&#8230;</p>
<p><span id="more-1058"></span>It turned out that it is not that easy to get it right. I was trying to have a few different tasks look at different sensors and steer the robot in different directions based on the sensor readings. However, this quickly turned into a literal deadlock: the robot just sat there, doing nothing, after the first time that my ultrasound sensor task tried to steer it. Failure.</p>
<p>The manual is quite silent on the tasking semantics, and there are no (or I have not yet found) shared variables, locks, or message-passing mechanisms to synchronize the tasks.</p>
<p>Looking for an answer on the web, I came across a nice tutorial on tasking:</p>
<ul>
<li><a href="http://www.ortop.org/NXT_Tutorial/tasks.html">http://www.ortop.org/NXT_Tutorial/tasks.html</a></li>
</ul>
<p>It is a video, and it ends with the note that the only reliable way to multitask in the NXT environment is to NOT try to control the same thing from multiple tasks. If I want to listen to multiple sensors, the only way is to create a loop with switching on inputs. I.e., manual polling. So much for using the natural language of tasks to handle naturally concurrent activities. At some point, I might figure out if I can construct it from tasks and data value transfers, but for now, I will just go with the flow and write (or rather draw) some polling loops.</p>
<p>Here is an example from the manual. Note that the top &#8220;beam&#8221; controls motor A, and the bottom controls motors B and C:</p>
<p><img class="aligncenter size-full wp-image-1059" title="mindstorms manual excerpt" src="http://jakob.engbloms.se/wp-content/uploads/2010/01/mindstorms-manual-excerpt.png" alt="mindstorms manual excerpt" width="691" height="448" /></p>
<p>That last part about data wires is intriguing&#8230; but I will need to find some more reference materials.</p>
<p>Anyway, the way to express parallel tasks here is really quite neat. At least as long as they are dealing with completely separate aspects of the control of the robot.</p>
<p>If this had been a more hard-core environment, it would have been fun to put different tasks on the ARM7 and AVR processors that are inside the NXT 2.0 brick. Yes, it is a true multiprocessor, if very limited in capabilities. At <a href="http://mindstorms.lego.com/en-us/whatisnxt/default.aspx">http://mindstorms.lego.com/en-us/whatisnxt/default.aspx </a>you have the specs. ARM7, 256 kB of FLASH, 64 kB of RAM, and the AVR has 512 bytes of RAM and 4 kB of FLASH. A nice real embedded machine!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1058"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1058" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1058" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1058/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MCC 2009 Presentations Online</title>
		<link>http://jakob.engbloms.se/archives/1023?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1023#comments</comments>
		<pubDate>Thu, 03 Dec 2009 08:29:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[keynote]]></category>
		<category><![CDATA[LTE]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1023</guid>
		<description><![CDATA[The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the program page for the workshop. Let me add some comments on the workshop per se. This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1016" style="margin-top: 5px; margin-bottom: 5px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" />The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">program page for the workshop</a>. Let me add some comments on the workshop per se.</p>
<p><span id="more-1023"></span>This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from a hardware company. So there was really nothing here directly about how to build multicore chips. Rather, the workshop tended to be about how to program, use, measure performance on, verify software for, and generally work with multicore chips. From the perspective of software people, rather than hardware designers.</p>
<p>Obviously, hardware aspects enter into such talks, but it is the perspective of a user, not a designer. For example, a hardware designer could explain how an atomic compare-and-swap is optimized in a multicore device. But here, we saw measurements on the actual operation latencies observed on real machines using such operations. Quite refreshing, and closer to my personal interests.</p>
<p>The keynote by <a href="http://a-vajda.eu/blog/">Andras Vajda</a> of Ericsson was quite interesting. The slides are not online, but the main points that I picked up and that I might not have considered before:</p>
<ul>
<li>Software development costs can mean that the cheapest, fastest, most efficient hardware is not necessarily the most economic. Too hard to code for means the software development time and effort removes the advantage. Obvious, but worth reiterating. Software is king.</li>
<li>The workload on a cellular basestation can sometimes be highly linear and single-threaded. For example, serving a single terminal with a very high bandwidth LTE connection. And suddenly shift to a massively parallel workload as a crowd of a thousand all suddenly appear and start doing data downloads. And then go back to serial again. This means that the age-old argument that signal processing naturally &#8220;<a href="http://www.edn.com/blog/980000298/post/50023005.html">conveniently concurrent</a>&#8221; (<a href="http://www.scdsource.com/article.php?id=87">and here</a>) is not always true. Nice point!</li>
<li>Thus, we need adaptable architectures that can trade serial and parallel performance over time, and rebalance quite quickly. In the same chip.</li>
<li>He is a firm believer that homogeneous systems will win out in the end, I still hold on to a belief in accelerators and offload engines and DSPs. This is partially because of an admitted focus on servers and services processors, and not on the baseband and signalling side. Makes sense.</li>
<li>Domain-specific languages (DSL) are the future of efficient programming. Agree.</li>
</ul>
<p>On the topic of DSLs, there was a question about the cost to support them. To me, that is a non-issue. In the organizations that I have worked, it seems that maintaining a useful DSL requires at most one engineer. Developing one, a few good computer scientists for a fairly limited time. In any case, they tend to appear organically when good programmers <a href="http://jakob.engbloms.se/archives/747">generalize repeated tasks</a>.</p>
<p>I gave a keynote about how multicore has impacted virtual platforms (in particular, <a href="http://www.virtutech.com/products/simics">Virtutech Simics</a>) with the following main points:</p>
<ul>
<li>Multicore targets increase the performance pressure on a virtual platform, as more processors will have to be simulated.</li>
<li>Multicore hosts means that sequential performance of the host is going down compared to the aggregate parallel performance demands from the targets.</li>
<li>To handle large target systems, the virtual platform itself has to run multithreaded on a multicore host. Getting this in place is a major, interesting, and sometimes painful process.</li>
<li>Once you have a parallel virtual platform, multicore hosts provide a very nice boost in scalability and the manageable system sizes. A single multithreaded virtual platform process is also a bit easier to manage from a user perspective.</li>
<li>All features in the virtual platform have to be multicore and multimachine-aware&#8230; meaning that they often get a bit harder to use initially, as there is no &#8220;default processor&#8221; you can fall back to for debugging setups etc. Everything has to be explicitly targeted.</li>
<li>Multicore targets have proven to  be a great sales driver for virtual platforms, as debugging software on a physical multicore, multichip, multiboard system is just too painful.</li>
</ul>
<p>Overall, this was a fun event, looking forward to next year at Chalmers!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1023"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1023" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1023" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1023/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009: 2D Stream Processing for Manycore</title>
		<link>http://jakob.engbloms.se/archives/1015?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1015#comments</comments>
		<pubDate>Thu, 26 Nov 2009 15:03:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[David Black-Schaffer]]></category>
		<category><![CDATA[efficiency]]></category>
		<category><![CDATA[manycore]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Stanford]]></category>
		<category><![CDATA[Stream programming]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1015</guid>
		<description><![CDATA[Today here at the MCC 2009 workshop, I heard an interesting talk by David Black-Schaffer of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing. What [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.it.uu.se/research/upmarc"><img class="alignleft size-full wp-image-1016" style="margin-top: 10px; margin-bottom: 10px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" /></a>Today here at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">MCC 2009 workshop</a>, I heard an interesting talk by <a href="http://cva.stanford.edu/people/davidbbs/">David Black-Schaffer </a>of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing.</p>
<p><span id="more-1015"></span>What was striking was his basic attitude to the target machines: he assumed &#8220;manycore&#8221; like 100-way Tilera, 320-way ATI/AMD graphics processors, and similar devices. With these many cores, the goal is to implement an algorithm using a few cores as possible to save power. He assumes that there are always cores available, and wastes no time reducing the core count. One kernel on each core, including replicated kernels and synthesized buffering kernels.</p>
<p>This was the best example I have seen so far on an actual implementation of idea that <strong>cores are free</strong>. For previous writing on this topic, see &#8220;<a href="http://jakob.engbloms.se/archives/269">what is efficiency when cores are free</a>&#8221; and &#8220;<a href="http://jakob.engbloms.se/archives/123">real-time control when cores are free</a>&#8220;.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1015"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1015" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1015" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1015/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finally, a Bug!</title>
		<link>http://jakob.engbloms.se/archives/975?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/975#comments</comments>
		<pubDate>Sun, 25 Oct 2009 20:41:20 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[demo]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=975</guid>
		<description><![CDATA[Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using Simics&#8220;. This then turns into a hard hunt for a program with a suitable bug [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png"><img class="alignleft size-full wp-image-982" title="butterfly" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png" alt="butterfly" width="90" height="91" /></a>Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using <a href="http://www.virtutech.com/products/simics_hindsight.html">Simics</a>&#8220;. This then turns into a hard hunt for a program with a suitable bug in it&#8230; not the Simics tooling to resolve the bug. For some reason, when I best need bugs, I have hard time getting them into my code.</p>
<p>I guess it is Murphy&#8217;s law &#8212; if you really set out to want a bug to show up in your code,  your code will stubbornly be perfect and refuse to break. If you set out to build a perfect piece of software, it will never work&#8230;</p>
<p>So I was actually quite happy a few weeks ago when I started to get random freezes in a test program I wrote to show multicore scaling. It was the perfect bug! It broke some demos that I wanted to have working, but fixing the code to make the other demos work was a very instructive lesson in multicore debug that would make for a nice demo in its own right. In the end, it managed to nicely illustrate some common wisdom about multicore software. It was not a trivial problem, fortunately.</p>
<p><span id="more-975"></span>First, some notes about the program. It is a producer-consumer system using pthreads, with a single producer thread feeding a variable number of compute threads with data, over a shared queue structure (a simple one that uses a single lock to protect it, making it not very scalable for small data messages and lots of workers).</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png"><img class="aligncenter size-full wp-image-980" title="program structure 2" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png" alt="program structure 2" width="411" height="237" /></a></p>
<p>The queue contains a circular buffer, managed using a standard set of full/empty/tail/head kinds of variables. There is also a flag &#8220;done&#8221; which is set once we are out of data, to tell the compute threads to shut down and terminate the program. As this program is used to demonstrate and test scaling, it is actually something that terminates. The main program spawns off all the threads, and then waits for all threads to finish before it terminates itself.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png"><img class="aligncenter size-full wp-image-981" title="program structure" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png" alt="program structure" width="300" height="458" /></a></p>
<p>This program and the queue subsystem had worked perfectly for a long time for me, running on an MPC8641 machine with a Linux 2.6.23 kernel, with 1 to 8 cores and 1 to 16 threads. Regardless of settings like thread counts, data sizes, number of packets to compute, it always ran smoothly and terminated.</p>
<p>However, the other week, I moved the program, the exact same binary even, over to a new software stack built on a Linux 2.6.27 kernel. Still on the same MPC8641 machine. Suddenly, I started to see occasional freezes where the program would never terminate. I added some more diagnostic printouts to the program, and saw that the main program would simply freeze waiting for the other threads to terminate and report in. The freezes had no real relationship to input variables. Maybe they were a bit more common with short packets, but no real pattern emerged. They also happened randomly, running the program with the same parameters for a few times in a row would sometimes result in a freeze. Using control-C to quit it and restart would keep the new instance of program running well. Doing some other demo work, I found the same effect on a P4080 machine with 8 cores and a 2.6.30 Linux kernel.</p>
<p>This is a common pattern for parallelism bugs: they only manifest themselves as actual visible crashes or freezes or bad computation results once something in the software stack has changed, even though the fundamental issues have been there all the time. In this case, I think it was the Linux scheduler, but it is really hard to tell. Just because a program runs fine today it does not have to run fine tomorrow.</p>
<p>After deciding to finally sit down and turn this lemon into lemonade, I had to reproduce the error. Thankfully, that is easy when you have a simulator. The first few times I had to run the target program 20 times or so before hitting the issue, but with some parameter and timing variations I managed to create a script that would open a <a href="http://jakob.engbloms.se/archives/714">checkpoint</a>, and run the program a few times under script control, triggering the bug on the fourth run (every time, thanks to determinism).</p>
<p>To diagnose the problem I wrote some Simics script code that I actually felt was fairly cool. I guessed that the problem had something to do with the queue and its handling of &#8220;done&#8221;, since that is what told the threads to terminate.</p>
<p>The first problem was that the queue was not a global variable. Instead, it was dynamically allocated on the heap by a function, and a pointer passed around, but never stored in a global variable (a good computer science graduate never uses a global variable other than as the means of last resort). Finally, my script set a breakpoint on the line in the setup function that came after the allocation. With the program stopped at that point, I could read the local variable pointing to the queue, and find and store the addresses of all the interesting members of the structure.</p>
<p>The code looked like this (Simics CLI), for the record:</p>
<pre> $mbp = ($ctx.break ($st.pos (rule30_threaded.c:222)))
 $cpu = (wait-for-breakpoint $mbp)
 $pq_addr  = ($cpu.sym "pq")
 $pq_tail  = ($cpu.sym "&amp;(pq-&gt;tail)")
 $pq_empty = ($cpu.sym "&amp;(pq-&gt;empty)")
 $pq_full  = ($cpu.sym "&amp;(pq-&gt;full)")
 $pq_head  = ($cpu.sym "&amp;(pq-&gt;head)")
 $pq_done  = ($cpu.sym "&amp;(pq-&gt;done)")</pre>
<p>Next, I set breakpoints on all writes to empty, full, and done. This was the most expedient route to catch actual puts and gets to the queue. Breakpoints on the queue_put() and queue_get() functions are not really showing the true flow, as these functions start by contending for the lock. Looking at writes to the actual queue members gave me the point where the tasks had grabbed the lock.</p>
<p>The script that caught all writes to done, full, and empty, and on each write, it dumped the state of the queue including computing out the number of elements in the circular buffer (without having to run any code on the target). To get an idea for who was active, it also used OS awareness to find the currently executing thread ID, and scripted debugging to convert the current program counter into a position in the program source code (actually, the important issue was the name of the function we were executing in).</p>
<p>This trace of activity showed quite an interesting pair of patterns. When the program ran well, the queue was mostly full, and it looked like the producer task always got some kind of priority to fill it before consumers could get in and drain it. When the program froze, the queue was seldom more than a few elements deep. This was the same program, on the same kernel, just run a few milliseconds later.</p>
<p>Clearly, the Linux kernel can exhibit quite variable behavior even for a program this simple. I guess that&#8217;s why this is called &#8220;soft real time&#8221;&#8230; Another parallelism lesson here: the scheduler is very important, and a smart adaptive scheduler can wreak havoc with software that was accidentally tuned for a different scheduler.</p>
<p>In the end, the crucial hint was that whenever the program froze, the &#8220;done&#8221; flag was set with a queue that was empty or contained just a few elements. I was sure that I had handled this case in my code, checking specifically for that and making sure to wake up the other threads with a signal that &#8220;the queue is not empty any more, please come check for more work&#8221;&#8230; but looking closely at the code, it turned out the code only woke up a single thread. Thus, the froze resulted from the producer setting &#8220;done&#8221; with an empty queue, waking up a single compute thread, and then having the other threads wait forever for more data to be put into the queue. The fix was easy: use a broadcast signal rather than a single signal.</p>
<p>In retrospect, it seems really strange that this ever worked reliably&#8230; it almost that I suspect the old Linux kernel of having a flawed pthreads implementation where signals always wake up all waiting threads, and not just a single one like the documentation says. But that will wait for another day to be investigated.</p>
<p>Here is the code, for reference:</p>
<pre>void rule30_packet_queue_signal_done(rule30_packet_queue_t *q) {
 //
 // Grab lock, set the done signal atomically
 //
 pthread_mutex_lock (&amp;(q-&gt;mutex));
 q-&gt;done = 1;
 pthread_mutex_unlock (&amp;(q-&gt;mutex));
 // Signal any threads waiting for data to wake up
 // and discover that we are indeed done
 //
 // This is the bug:
 // - It only wakes up one thread...
 pthread_cond_signal (&amp;(q-&gt;notEmpty));
 // To be correct:
 // pthread_cond_broadcast (&amp;(q-&gt;notEmpty));
}</pre>
<p><em>Updated analysis:</em></p>
<p>My initial analysis was that when things worked, the &#8220;done&#8221; flag was set with enough data left in the queue that all threads had a chance to pull in data and come in and see the done flag being set.</p>
<p>However, today I went back and wrote a deeper analysis script that also checked for reads from the done flag (turning this check on only after the write to &#8216;done&#8217; to reduce the noise). I expected there to be a single reader when the freeze happened&#8230; but that was not the case. In my current test case, three out of five threads actually got in to read the done flag and terminate.  The crucial code for the compute threads looks like this:</p>
<pre> // Grab mutex,
 //   Check if the queue is empty, if so wait for someone
 //   to push something onto the queue, or signal done.
 //   both of which are done by setting the not_empty conditional variable
 pthread_mutex_lock (&amp;(queue-&gt;mutex));
 while ((queue-&gt;empty) &amp;&amp; !(queue-&gt;done)) {
   pthread_cond_wait (&amp;(queue-&gt;notEmpty), &amp;(queue-&gt;mutex));
 }</pre>
<p>To freeze, a thread actually has to be doing the conditional wait here. There are plenty of other places threads can be as the program is finishing. For example, they can be waiting to grab the initial mutex lock, or actually doing compute work. That explains why some threads actually still terminate even with the buggy version. It certainly also illustrates just how chaotic concurrent programs can be. More so that you can ever imagine, really.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/975"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/975" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/975" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/975/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ericsson Blog Post about DSL</title>
		<link>http://jakob.engbloms.se/archives/976?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/976#comments</comments>
		<pubDate>Sun, 25 Oct 2009 19:29:19 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=976</guid>
		<description><![CDATA[Andras Vajda of Ericsson wrote an interesting blog post on domain-specific languages (DSLs). Thanks for some success stories and support in what sometimes feels like an uphill battle trying to make people accept that DSLs are a large part of the future of programming. In particular for parallel computing, as they let you hide the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/ericsson_logo.gif"><img class="alignleft size-full wp-image-977" title="ericsson_logo" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/ericsson_logo.gif" alt="ericsson_logo" width="127" height="62" /></a><a href="http://a-vajda.eu/blog/?p=184">Andras Vajda of Ericsson wrote an interesting blog post on domain-specific languages (DSLs).</a> Thanks for some success stories and support in what sometimes feels like an uphill battle trying to make people accept that DSLs are a large part of the future of programming. In particular for parallel computing, as they let you hide the complexities of parallel programming.</p>
<p><span id="more-976"></span></p>
<p>Quotes from the blog post:</p>
<p>Sequential languages hiding parallelism:</p>
<blockquote><p>Sequential imperative languages are notoriously good at hiding or at least obfuscating inherent parallelism in the applications – which may be good news for compiler providers who spend – and charge – big bucks for building reverse engineering technologies that can detect parallelism from sequential code; as well as for highly skilled programmers capable of building parallel code using sequential tools; however, it weights heavily on the total cost of developing software, not to mention the recurring cost of porting to newer HW with e.g. more cores.</p></blockquote>
<p>Applying DSLs to signal processing:</p>
<blockquote><p>During the work on the domain-specific language for DSP programming we have seen some interesting results; for example, several pages of optimized C code were re-written to one PowerPoint slide worth of DSL code, while the DSL to C compiler was able to output efficient code comparable in size to the original code.</p></blockquote>
<p>I agree fully with the final note:</p>
<blockquote><p>The big question is not anymore if DSLs will take off – it’s more if your pain level is higher than the cost of accepting an alternative and different approach; as the cost for taking in a new language tends to stay constant while the complexity of developing software is increasing, odds are that there will be an uptake of domain-specific languages.</p></blockquote>
<p>Go and read the full article!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/976"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/976" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/976" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/976/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How (Not) To Present Parallel Programming Results</title>
		<link>http://jakob.engbloms.se/archives/946?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/946#comments</comments>
		<pubDate>Mon, 05 Oct 2009 13:06:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DAC]]></category>
		<category><![CDATA[DAC 2009]]></category>
		<category><![CDATA[parallelized software]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=946</guid>
		<description><![CDATA[SCDSource ran a short but good article summarizing a few DAC talks that I would liked to attend. it mostly about the experience of long-term parallel programming research David Bailey in presenting results in the field&#8230; Or more importantly: how not to present results, or how to mislead the audience as to the efficiency of [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-824" title="46daclogo" src="http://jakob.engbloms.se/wp-content/uploads/2009/07/46daclogo.gif" alt="46daclogo" width="81" height="73" /><a href="http://www.scdsource.com/article.php?id=360">SCDSource ran a short but good article</a> summarizing a few DAC talks that I would liked to attend. it mostly about the experience of long-term parallel programming research David Bailey in presenting results in the field&#8230;</p>
<p><span id="more-946"></span>Or more importantly: how not to present results, or how to mislead the audience as to the efficiency of your approach. <a href="http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf">His old 1991 paper </a>on how to do this is still worthy of a read, even if some things have changed (64-bit FP is pretty much as fast as 32-bit these days, for example). The fundamentals of parallelism are still pretty much the same.</p>
<p>The results were from a <a href="http://www.dac.com/events/eventdetails.aspx?id=95-32">DAC panel about multicore and EDA</a>, I wonder if that panel dealt with how to make EDA software itself parallel, or about how to help semiconductor companies help their end user programmers harness the multicore hardware being designed using EDA tools. That does not seems clear to me.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/946"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/946" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/946" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/946/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Freescale P4080, in Physical Form</title>
		<link>http://jakob.engbloms.se/archives/933?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/933#comments</comments>
		<pubDate>Thu, 17 Sep 2009 10:16:37 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[DWF]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[MPC5606]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=933</guid>
		<description><![CDATA[Past Tuesday, I attended the Freescale Design With Freescale (DWF) one-day technology event in Kista, Stockholm. This is a small-scale version of the big Freescale Technology Forum, and featured four tracks of talks running from the morning into the afternoon. All very technical, aimed at designing engineers. There were several topic areas, such as automotive, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png"><img class="alignleft size-full wp-image-878" style="margin: 5px 10px;" title="freescale-logo-icon" src="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png" alt="freescale-logo-icon" width="80" height="80" /></a>Past Tuesday, I attended the Freescale Design With Freescale (DWF) one-day technology event in Kista, Stockholm. This is a small-scale version of the big Freescale Technology Forum, and featured four tracks of talks running from the morning into the afternoon. All very technical, aimed at designing engineers.</p>
<p><span id="more-933"></span>There were several topic areas, such as automotive, consumer, and networking. Networking was mostly focused on the issues of multicore hardware and software.</p>
<p>Of particular interest to me was to see a <a href="http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=0162468rH3bTdG25E4">Freescale QorIQ P4080 </a>8-core networking/control-plane processor live for the first time. This chip was <a href="http://jakob.engbloms.se/archives/137">announced in the Summer of 2008</a>, with a full ecosystem of software support thanks to <a href="http://www.virtutech.com/qoriq">Virtutech Simics</a>. Now, when the silicon is here, software is indeed running on it thanks to the long headstart development got with the virtual platform. Note that several demos at the event used the Simics simulator to show the software support for the P4080, as there was only a single chip to go around.</p>
<p>I would have loved to have a meaningful picture of the first P4080 in Europe, but  a chip is not really very photogenic &#8211; the P4080 processor was in an open computer case, but covered with a 10 cm-high heat sink which made it fairly hard to actually see. That&#8217;s the challenge with infrastructure things: they are not designed to be seen&#8230; just to do their job well. If you have a new consumer electronics processor, you can at least drive a screen quickly or something. But watching 28 Gbps of Ethernet traffic is not as easy <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Jonas Svennebring of Freescale gave a good talk about how the process of bringup on the P4080 had worked out. It was a total validation of the methodology of using virtual platforms, at different levels of abstraction, and slipping in a bit of hardware emulation as well.</p>
<p>Freescale started software development on the functional fast model, and when clock-cycle-level detailed models of subsystems became available, they started using them as well for performance validation for small pieces of code. Any discrepancies in behavior between the two models was then used to correct the models and documentation. Finally, as the RTL for the silicon began to become available, they used a few emulation setups to run parts of the actual RTL (the emulator could only handle a subset of the entire chip), and validate the performance numbers in the detailed model and the behavior of both models. In the end, when the first silicon became available, Linux was up in a very short time (I cannot give the exact number, but it was a matter of days rather than weeks).</p>
<p>This is the typical iterative process that all chip designers are implementing today: using virtual platforms you can get a head start on development of software, and then as more details become available, you tune models and update both designs, models, and software, iterating towards a hardware/software combination that just works once the silicon realization of the hardware comes around.</p>
<p>So that was all cool.</p>
<p>Jonas also showed a die photo of the QorIQ, and that confirmed by opinion from the <a href="http://jakob.engbloms.se/archives/905">SiCS Multicore Day</a>: embedded multicore is not just about processor cores and cache, it is very much about accelerators to help offload repetitive work from the processing cores. More than half the chip was such acceleration logic! To me, this is a clear confirmation that heterogeneity is the future of hardware design, and a useful way to spend hundreds of millions of transistors to boost SoC performance.</p>
<p>The same was true for most other Freescale hardware showcased at the event. For example, there was the <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC560xS">MPC5606S dashboard processor</a>, running an LCD display with lots of dynamic graphics with 0.2% CPU load on a 60 MHz e200 Power Architecture processor. All the work was done by its display driver and accelerator. It is hard to argue with that kind of efficiency. That chip did not need a heatsink, either. It was just mounted on the back of an example board with no need for any external logic chips. Apparently, it could also have moved some physical gauges and blinked LEDs, but that demo was considered too distracting for this particular setting.</p>
<p>I also gave a talk at the DWF, about debugging software on multicore using virtual platforms. That was fun, as always. Need to get out more on the road and talk in conferences, I think <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/933"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/933" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/933" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/933/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU &#8211; a new type of DSP?</title>
		<link>http://jakob.engbloms.se/archives/930?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/930#comments</comments>
		<pubDate>Fri, 11 Sep 2009 14:35:18 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[GPGPU]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=930</guid>
		<description><![CDATA[My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, [...]]]></description>
			<content:encoded><![CDATA[<p>My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.</p>
<p><span id="more-930"></span>The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that &#8220;everyone has anyway&#8221; &#8212; i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.</p>
<p>But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.</p>
<p>How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.</p>
<p>The reasons for GPU raw performance seems to be three-fold:</p>
<ul>
<li>Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.</li>
<li>There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.</li>
<li>Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.</li>
<li>And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans,  and 300W power draws are OK&#8230;</li>
</ul>
<p>What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.</p>
<p>If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs &#8212; even if they can be painful compared to general-purpose processors.</p>
<p>So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.</p>
<p>Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/930"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/930" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/930" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/930/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>SiCS Multicore Day 2009</title>
		<link>http://jakob.engbloms.se/archives/905?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/905#comments</comments>
		<pubDate>Mon, 07 Sep 2009 19:26:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[Anders Landin]]></category>
		<category><![CDATA[CPP]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hazim Shafi]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Richard Kaufmann]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Visual Studio 2010]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=905</guid>
		<description><![CDATA[Last Friday, I attended this year&#8217;s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a [...]]]></description>
			<content:encoded><![CDATA[<p>Last Friday, I attended this year&#8217;s edition of the <a href="http://www.sics.se/node/4360">SiCS Multicore Day</a>. It was smaller in scale than <a href="http://jakob.engbloms.se/archives/283">last year</a>, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from <a href="http://blogs.msdn.com/hshafi/">Hazim Shafi </a>of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community.<span id="more-905"></span></p>
<p>I think that for next year, the organizers need to find keynote speakers that are not from the general computing multicore world. The Microsoft talk this year was a step in that direction, as it rather came from multicore programming than multicore hardware. Richard and Anders gave very interesting and good talks, no doubt about it. But it would have been nice with someone from ARM or Freescale or Tensilica or TI or ST or Ericsson or Cisco talking about the kinds of multicore embedded hardware that is being developed and used today. For example, the &#8220;next new thing&#8221; touted by the keynotes this year was GPGPU. Interesting for HPC and desktops, certainly. But pretty irrelevant for most of the people that I know. GPUs are huge, expensive, and power hungry.</p>
<p>GPGPU was one part of the theme this year. It is definitely catching on as <em>the </em>way to do number crunching in the desktop, server, and HPC world. It is not the universal panacea for any kind of parallelism, however, as Hazim and I noted in the panel discussion that ended the day. There are applications (such as <a href="http://www.virtutech.com/whitepapers/accelerator.html">parallel Simics</a>&#8230;) that scale well on general-purpose cores, but that will never ever work on GPUs. In general, the class of problems that work on GPUs is pretty limited to massive data-parallel problems like image and video manipulation.</p>
<p>In the eternal homogeneous vs heterogeneous debate (follow <a href="http://jakob.engbloms.se/archives/tag/homogeneous">the tags </a>in my blog for more posts on this topic), GPGPU was grudingly accepted as a good candidate for something that will not be homogeneized with the main processors. Additionally, Richard Kaufmann gave some hints that Intel or AMD are coming out with new chips with more accelerators on board&#8230; I guess it will be security, as is already done by Sun and <a href="http://jakob.engbloms.se/archives/80">IBM</a>. When I brought up the topic of more accelerators like pattern matching, compression, and the other things we see in chips from Freescale, Cavium, and others, the response was very &#8220;can only be economical for very high volume applications&#8221;.</p>
<p>It is striking how the GPGPU idea is bringing the classic telecommunications DSP-data plane/CPU-control plane division into the desktop and server space. Without any recognition being paid or any experience being reused from the 40 years that that has been done in telecoms and consumer electronics&#8230; as Jack Ganssle often says, us embedded folks get no respect.</p>
<p>In terms of programming, this year was all about general programming languages. Hazim from Microsoft talked about (and demoed) the quite pervasive addition of parallelism to both native C/C++ and managed .net code in Visual Studio 2010. Microsoft is dead serious about parallel programming, and are bringing out a whole set of different libraries and support structures to allow <a href="http://blogs.msdn.com/pfxteam/archive/2009/08/12/9867246.aspx">easier expression of parallel code</a>. In the &#8220;LINQ&#8221; data query language subset of C#, you could add some easy modifiers to &#8220;foreach&#8221; statements to make them parallel, for example. Having a language that is your own and which you can extend at will certainly pays off in terms of innovation here. C++ moves far slower than C#, that is becoming clearer and clearer. C# and its cousins in the .net system seem to be sneaking in lots of powerful language design ideas from places like Python, and also results from Microsoft&#8217;s powerful group of language researchers.</p>
<p>When I tried to bring up the idea of using domain-specific languages to program parallel applications, Hazim had the wonderful comment that &#8220;that might be applicable in certain domains&#8230;&#8221; &#8212; yes, that is the idea. By being narrow in terms of target domains, you gain expressive power and semantic insight that helps move programming from &#8220;how&#8221; towards &#8220;what&#8221;. But it sounds like domain-specific is a foul word inside of Microsoft &#8212; when the audience asked whether LINQ was not a exactly a domain-specific language for data access, Hazim was a pains to point out that it is Turing-complete and that someone had managed to write a Raytracer using it&#8230; interesting. This feels more political than market-based. I guess Micro</p>
<p>Richard Kaufmann had some interesting notes on throughput vs TTC (time-to-completion) jobs in servers. In the &#8220;cloud computing&#8221; era, throughput is much easier to scale: just add more servers. Classic HPC is more oriented towards TTC, as you do want your results within a reasonable time. Quite often, you can most work into a throughput-oriented style by simply running lots of jobs in parallel rather than pushing through a series of jobs sequentially. Note however that we have the entire field of real-time control, real-time communications, etc., that do not work like this. But that is not the market that HP is building servers for, or that Intel and AMD are servicing.</p>
<p>Outside the keynotes, Per Holmberg of Ericsson gave an interesting presentation on the adoption of multicore in the control plane of the <a href="http://www.ericsson.com/ericsson/corpinfo/publications/review/2002_02/161.shtml">Ericsson CPP </a>platform. The core of his talk was the observation that in these kinds of systems, multicore is not such a big revolution.</p>
<p>They have been distributed since the beginning. Thus, scaling by adding more processors (with local memories) is easy and multicore is only a packaging change from that. Also, most performance-intense operations are already offloaded onto DSP groups, network processors, ASICs, or FPGAs. There is not much parallelism left for the control plane to exploit. Essentially, only functions that unexpectedly become performance bottlenecks due to changes in traffic patterns are likely candidates for parallellization. Interesting point, and might be <a href="http://jakob.engbloms.se/archives/703">why the EETimes noted that multicore is slow to catch on in communications </a>(the article is a bit flawed).</p>
<p>Patrik Nyblom from Ericsson held a talk about how the <a href="http://www.erlang.org">Erlang </a>runtime engine was parallelized. From a practical perspective, the most interesting aspect was that this made applications parallel without changing a single line of code in the applications. Of course, applications had to be threaded to start with, but that is the most natural way in Erlang. He mentioned systems containing up to a quarter of a million threads &#8212; hard to do that in anything except Erlang.</p>
<p>He described how they had evolved from a simple implementation that worked well on synthetic benchmarks to a truly industrial-strength implementation. The difference was quite radical, as real codes feature more complex communications patterns, and make heavy use of device drivers and network stacks. This process forced the use of more and finer locks, and rethinking the balance between shared and separate heaps for threads.</p>
<p>They also had the opportunity to test their solution on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8&#8230; The two key lessons they learned was that <em>no shared lock goes unpunished, </em>and <em>data has to be distributed as well as code.</em> Very interesting to hear this story from real software developers solving real problems.</p>
<p>The next multicore event taking place around here is the Second <a href="http://www.it.uu.se/research/upmarc/MCC09">Swedish WOrkshop on Multicore Computing </a>(MCC 2009), in Uppsala, November 26-27.</p>
<p>Update: note that the presentations from the event are available via <a href="http://www.multicore.se/">http://www.multicore.se/</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/905"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/905" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/905" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/905/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

