<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; multicore software</title>
	<atom:link href="http://jakob.engbloms.se/archives/category/parallel-computing/multicore-software/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>GPGPU for Instruction-Set Simulation &#8211; Maybe, Maybe not</title>
		<link>http://jakob.engbloms.se/archives/1506?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1506#comments</comments>
		<pubDate>Sat, 08 Oct 2011 19:17:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CCGrid]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1506</guid>
		<description><![CDATA[I just read a quite interesting article by Christian Pinto et al, &#8220;GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms&#8220;, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png"><img class="alignleft size-full wp-image-125" style="margin: 5px 10px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="" width="100" height="100" /></a>I just read a quite interesting article by Christian Pinto et al, &#8220;<a href="http://infoscience.epfl.ch/record/164471">GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms</a>&#8220;, published at the <a href="http://www.ics.uci.edu/~ccgrid11/">CCGRID 2011 </a>conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.</p>
<p><span id="more-1506"></span>The paper describes a simulation for a network-on-chip based homogeneous system containing a &#8220;ARM-subset&#8221; ISS instances with local instruction and data caches, some local RAM, and also some shared RAM. Each core runs its own local software load, there is no SMP operating system. All communication between cores is over shared memory, using explicit operations across the NoC. All cores run a single cycle before they check communications from their neighbors.</p>
<p>This last point is crucial to understanding why this is feasible at all &#8211; in general, simulating a general shared-memory multiprocessor machine on a shared-memory multiprocessor falls down on the synchronization overhead. If your simulation semantics dictate that you synchronize every cycle anyway, and you do not try to optimize each core simulator, there is clearly decent room for parallel execution. By including the cache, they increase scalability, since there is more work per target cycle that can be run in isolation.</p>
<p>After reading the article, I am impressed by their work &#8211; just getting this to work is pretty good work. But there are quite a few questions which are not really answered in the article and which are crucial to understanding just how well GPGPUs could be used for this kind of ISS work.</p>
<ul>
<li>The targeted level of abstraction is a bit confusing. The authors claim it is &#8220;instruction accurate and not cycle accurate&#8221;, but still simulate caches and cycle-based communications across the NoC. If I read the paper right, communications will take a varying number of cycles depending on the distance for messages to travel. This is more detailed than a typical &#8220;instruction accurate&#8221; simulator.</li>
<li>The target system does not run an OS &#8211; that might (but I do not know) be an advantage for their approach, since it probably implies less variation in the instruction flow in cores, potentially enhancing the amount of time that all ISSes in a thread group in the GPU can execute the same instruction. This would seem crucial, as if each ISS was running a totally different program, the instruction execution part of the code would be running serialized.</li>
<li>They should really try to run the same kind of simulation on a high-end x86 CPU like an Intel Sandy Bridge with 8 or more hardware threads. I wonder if their scaling might not work just as well there &#8211; and with a much faster serial execution engine. This should give  a much more relevant point of comparison for GPU vs CPU execution of the simulator than&#8230;</li>
<li>the comparison object they use right now, a JIT-accelerated multicore simulation using OVP seems pretty irrelevant since it is not doing the same thing at all. That simulator does not simulate the caches or NoC, just a large number of isolated processors. They also do not run a parallel program on OVP, but rather a large number of single-core fibonacci and dhrystone programs. Thus, the fact that OVP uses a large temporal decoupling time slice does not matter for semantics. It just does not seem like a very relevant comparison point. OVP and their simulator try to solve different problems &#8211; fast execution of general code vs. performance profiling of massively parallel machines.</li>
<li>As I understand it, the given &#8220;S-MIPS&#8221; numbers in the evaluation tell us the total number of MIPS that we get out across all target cores. That seems to peak around 2000 &#8211; which isn&#8217;t necessarily that fantastic if we compare to high-performance ISS work in general where a few GIPS is definitely achievable. It is pretty good considering the level of detail here, though, where i would expect a normal ISS + cache simulator to produce at most a few MIPS. Once again, the authors need to be a bit more precise as to what they compare to what.</li>
<li>Not having an MMU and not implementing any interrupts or exceptions in the target machines avoids a large part of the complexity of a real ISS. That complexity might well be too much for the quite rigid execution environment of a GPGPU.</li>
<li>They missed that Simics, unique among instruction-accurate mainstream simulators, is <a href="http://jakob.engbloms.se/archives/128">parallel </a>since version 4.0.</li>
</ul>
<p>So, overall, this paper does not really tell us much whether a GPGPU can be used for instruction-set simulation in general. It does tell us that it might be doable, but there are many crucial complications which are not addressed.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1506"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1506" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1506" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1506/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Memory Models: x86 is TSO, TSO is Good</title>
		<link>http://jakob.engbloms.se/archives/1435?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1435#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:16:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Doug Lea]]></category>
		<category><![CDATA[Francesco Zappa Nardelli]]></category>
		<category><![CDATA[memory consistency]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[SPARC]]></category>
		<category><![CDATA[UpMarc]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1435</guid>
		<description><![CDATA[By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well. He showed a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif"><img class="size-full wp-image-1016 alignleft" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="" width="122" height="45" /></a>By chance, I got to attend a day at the <a href="http://www.it.uu.se/research/upmarc/events/SS2011/Programme.html">UPMARC Summer School</a> with a very enjoyable talk by <a href="http://moscova.inria.fr/~zappa/">Francesco Zappa Nardelli </a>from INRIA. He described his work (along with others) on <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">understanding and modeling multiprocessor memory models</a>. It is a very complex subject, but he managed to explain it very well.</p>
<p><span id="more-1435"></span>He showed a very interesting discussion from a few years ago on the x86 memory model and the implementation of spinlocks in the Linux kernel. Various experts went back and forth over whether the final MOV that sets a lock variable to 1 needed to be prefixed by LOCK or not. The discussion ended when Linus Torvalds said &#8220;I know that it is needed&#8221;. Only to see an Intel architect finally intervene and say &#8220;you know, really, it isn&#8217;t needed&#8221;. This was followed by a series of releases of Intel manuals documenting the x86 memory model, with increasing precision in each release. Intel also actually changed the published rules along the road, withdrawing some optimizations as they realized that they would break existing software.</p>
<p>Note that such a description of a memory model must both describe existing hardware, and serve as the guideline for future hardware. Therefore, there are optimizations that are not implemented today but which are possible given the rules. Such optimization opportunities can be removed from the rulebook as long as they have never been part of shipping hardware, so it is not as crazy as it might sound.</p>
<p>Anyway, the point that Francesco made was both to tell an interesting story from history, and making the point that describing and understanding memory models is hard. I certainly agree with that. I recall an ISCA many years ago when some computer architecture professors all agreed that very few people really understand consistency and weak memory models.</p>
<p>To make life easier for programmers, Francesco and Peter Sewell (in Cambridge) has defined their own set of rules for x86 memory consistency. This is not an architecture spec, but a rule set for regular programmers. It is found at <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">http://www.cl.cam.ac.uk/~pes20/weakmemory/</a>. Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model.</p>
<p>They have also attempted to formalize the Power Architecture memory model. Both the actual memory model and their model of it can only be described as very complex. The programmer&#8217;s model is expressed in terms of store queues, speculative instruction execution, and commits of instructions. Not something you easily keep in your head. It is interesting to note that ARM MPCore essentially copied the Power Architecture.</p>
<p>He showed an interactive simulation of the Power memory model, and the way that you need to think about it in terms of propagating information between threads and committing them. It is possible to propagate values and then another propagation overrides a value before the thread commits&#8230; Fun. Or a headache.</p>
<p>The big take-away from the talk for me is that it confirms the observation made may times before that <a href="http://en.wikipedia.org/wiki/Memory_ordering">SPARC TSO </a>seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines.</p>
<p>Maybe TSO does not theoretically scale in the same insane way as Power or Alpha does/did. But the cost of that theoretical scalability is that programmers might have to litter their code with sync operations just to get it to run correctly. With too many sync operations, the code will run very slowly negating any advantage on the hardware level. Note that sync operations can be very expensive. <a href="http://g.oswego.edu/">Doug Lea</a>, in the audience, pointed out that a sync can cost up to 300 cycles on a POWER5.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1435"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1435" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1435" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1435/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Photoshop Scalability and &#8220;-10% overhead&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1311?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1311#comments</comments>
		<pubDate>Mon, 01 Nov 2010 11:45:51 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Cary Millsap]]></category>
		<category><![CDATA[Clem Cole]]></category>
		<category><![CDATA[Communications of the ACM]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[performance optimization]]></category>
		<category><![CDATA[Photoshop]]></category>
		<category><![CDATA[Russell Williams]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1311</guid>
		<description><![CDATA[I just finished reading the October 2010 issue of Communications of the ACM. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg"><img class="alignleft size-full wp-image-1313" style="margin: 10px 5px;" title="cacm 10 2010" src="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg" alt="" width="62" height="80" /></a>I just finished reading the <a href="http://cacm.acm.org/magazines/2010/10">October 2010 </a>issue of <a href="http://cacm.acm.org/">Communications of the ACM</a>. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;.</p>
<p><span id="more-1311"></span>Cary&#8217;s articles deal mostly with database tuning in the Oracle ecosystem, but most of his observations apply to any kind of programming with a performance requirement. It is worth a read. It was good to see him dissect performance, including obvious &#8211; but not really obvious &#8211; concepts like the difference in usefulness between average and worst-case response times from a user perspective.  In essence, you need to watch the spread of response times, and try to keep the worst times from getting too bad, rather than just look at an average that might conceal extremes that frustrate users.</p>
<p>Cary also made the comment noted in the title of this post. In his opinion, the performance instrumentation built into Oracle has an overhead of -10% &#8211; or even -20% or -30%, since it enables optimizations that would otherwise have been impossible to do. This is something worth noting in general &#8211; overhead that looks bad when considered as a local cost might be a net benefit in the grand scale of things, by enabling measurements and insight that let a program run much faster.</p>
<p>The ACM case study on Photoshop can be found online as a <a href="http://queue.acm.org/detail.cfm?id=1858330">resource at the ACM Queue</a>, with what seems to be mostly the same content. It was written by Clem Cole, at Intel, who interviews Russell Williams of the Photoshop team. It is very instructive to see how the Photoshop team has built an application that works well with 2 to 4 and maybe 8 cores, but that really needs to reconsider parts of its architecture to scale beyond 8.</p>
<p>Clem from Intel pushes Russell by bringing up various examples of next-generation architectures, in particular the fact that clusters-on-a-chip and NUMA memories look inevitable. The Photoshop people seem to take a wait-and-see approach to this: they first want to see some architecture have real traction in the market before they commit and rearchitect their software to make use of it.</p>
<p>The problems of debugging parallel software are also brought up. There used to be a simple bug in the asynchronous I/O system in Photoshop that took ten years to uncover!  Essentially, the programmers had not considered atomicity properly in the presence of multiple threads. With that kind of example, it is not surprising that the Photoshop programmers are very careful when planning and performing parallelizations.</p>
<p>The target domain of Photoshop is to some extent naturally parallel, but not as much as I would have thought. Since a user might operate on any part of an image, large or small, and maybe start and then abort an operation, it is not just a matter of splitting a image evenly across threads or cores. There is a significant amount of variation in just how parallel things can be in Photoshop.</p>
<p>Photoshop has had an easy-to-use parallelization system in place since  around 1994, which lets programmers write simple serial computational  kernels which are automatically applied to parts of an image in  parallel. The Photoshop program itself takes care of the synchronization  between kernels, and the kernels can be simple and robust and without  any parallel code inside. This is a <a href="http://jakob.engbloms.se/archives/209 ">pattern that has been seen before</a>,  and which does make a lot of sense &#8211; if it can be applied successfully.  Apparently, this is not necessarily the easiest thing to scale beyond  four cores.</p>
<p>The main performance limitation for Photoshop performance keeps being memory bandwidth, rather than raw compute performance. This also limits the need to aggressively scale to higher levels of parallelism: as long as multiple threads do not give more bandwidth, it has proven hard to use more than two or three threads on any multicore processor as that is sufficient to saturate the memory system. Apparently, this is different on the Nehalem (Core i7/i5/i3) generation of Intel multicore processors, where each core has a dedicated non-stealable slice of the memory bandwidth.</p>
<p>For the near future, it seems that the big step for Photoshop is going the route of using GPUs for acceleration, rather than 10+ core main processors.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 25px; width: 1px; height: 1px; overflow: hidden;">http://queue.acm.org/detail.cfm?id=1858330</div>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1311"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1311" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1311" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1311/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: &#8220;IMA on Simics&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1304?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1304#comments</comments>
		<pubDate>Tue, 26 Oct 2010 13:00:25 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Integrated Modular Avionics]]></category>
		<category><![CDATA[real-time]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Tennessee Carmel-Veilleux]]></category>
		<category><![CDATA[wcet]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1304</guid>
		<description><![CDATA[I have a fairly lengthy new blog post at my Wind River blog. This time, I interview Tennessee Carmel-Veilleux, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore. I made some cocky comments about just [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 10px 5px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have a <a href="http://blogs.windriver.com/engblom/2010/10/interview-with-tennessee-carmel-veilleux.html">fairly lengthy new blog post </a>at my Wind River blog. This time, I interview <a href="http://www.tentech.ca/">Tennessee Carmel-Veilleux</a>, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore.</p>
<p><span id="more-1304"></span>I made some <a href="http://jakob.engbloms.se/archives/58">cocky comments about just how stupid the current implementations </a>of this idea are a few years ago, but in my discussions with Tennessee I have realized that things are not that simple. Essentially, you are caught between the structures and strictures of the certification agencies, and the complexity of the hardware with its many shared resources making predictability and programming very difficult.</p>
<p>It is a field which touches to my old work on WCET, but with a target system that is much less amenable to analysis. I still <a href="http://jakob.engbloms.se/archives/123">think it is a good idea to try to use seas of simple cores and separate work physically </a>rather than virtually, but that might be a battle that will never be won. If nothing else, even on a massive spatially-divided multicore device, there will be some shared resources that make life difficult when very low levels of jitter are tolerated.</p>
<p>Read the interview, and read Tennessee&#8217;s own blog &#8211; he has done some pretty cool things both in hardware, software, and virtual hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1304"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1304" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1304" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1304/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>VirtualBox SMP</title>
		<link>http://jakob.engbloms.se/archives/1212?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1212#comments</comments>
		<pubDate>Fri, 20 Aug 2010 18:04:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[CECsim]]></category>
		<category><![CDATA[SMP]]></category>
		<category><![CDATA[VirtualBox]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1212</guid>
		<description><![CDATA[I listened to an interesting FLOSS Weekly interview with Adam Hall and Achim Hasenmuller of VirtualBox. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox. VirtualBox is able to give a guest OS up to 32 [...]]]></description>
			<content:encoded><![CDATA[<p>I listened to an interesting <a href="http://twit.tv/floss130">FLOSS Weekly interview </a>with Adam Hall and Achim Hasenmuller of <a href="http://www.virtualbox.org/">VirtualBox</a>. For someone interested in virtual machines and hardware simulation, the interview was full of interested tidbits. I think the best part was the discussion on multiprocessing in Virtualbox.</p>
<p><span id="more-1212"></span>VirtualBox is able to give a guest OS up to 32 virtual processors for its use. The system virtualizes cores so that you can allocate more cores than you have on your host. You don&#8217;t want more active cores than you have physically, or performance might suffer badly (there is a <a href="http://blogs.sun.com/jsavit/entry/virtual_smp_in_virtualbox_3">good blog post hosted by blogs.sun.com about VirtualBox 3 and its SMP support</a>, read it before Oracle decides to kill off all the old content&#8230;).</p>
<p>Second, the development of that SMP support only took some six months. But getting it to work right took 18 months. Sounds like a familiar story for parallelizing software of this kind. The interviewees made it sound like the code for this was utterly complex, and I can believe that too.</p>
<p>VirtualBox makes extensive use of hardware virtualization support on x86  hosts to enhance performance (Intel VT-X and AMD AMD-V). As they see  it, all other alternatives are inferior, including the VmWare tradition  of binary translation and patching. They claimed that VmWare actually  interpreted all of the code in a guest, but I find that a bit hard to  believe.</p>
<p>It is interesting to compare their approach to something like Simics, which was made parallel in the Spring of 2008 with Simics 4.0 Acclerator. The advantage an IT run-time virtual machine like VirtualBox has over a virtual platform like Simics in creating a parallel system is that they can make use of the cache coherence on the host to propagate information between the cores. A virtual platform cannot assume that the host and target are of the same type, which is necessary for this to make sense. It also makes VirtualBox entirely nondeterministic, but that is just plain normal for a physical computer system. One interesting intermediate form here is the IBM CECsim system, where a z-series mainframe is used to simulate a slightly different z-series mainframe, including running multiple simulated processors in parallel. CECsim also makes use of the hardware cache coherence and is nondeterministic.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1212"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1212" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1212" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1212/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multicore is not That Bad</title>
		<link>http://jakob.engbloms.se/archives/1207?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1207#comments</comments>
		<pubDate>Tue, 10 Aug 2010 18:24:15 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Chris Nicols]]></category>
		<category><![CDATA[David Patterson]]></category>
		<category><![CDATA[hypervisor]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1207</guid>
		<description><![CDATA[I recently read a couple of articles on multicore that felt a bit like jumping back in time. In IEEE Spectrum, David Patterson at Berkeley&#8217;s parallel computing lab brings up the issue of just how hard it is to program in parallel and that this makes the wholesale move to multicore into something like a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png"><img class="alignleft size-full wp-image-654" title="opinion" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png" alt="" width="91" height="69" /></a>I recently read a couple of articles on multicore that felt a bit like jumping back in time. In <a href="http://spectrum.ieee.org/computing/software/the-trouble-with-multicore/0">IEEE Spectrum</a>, David Patterson at <a href="http://parlab.eecs.berkeley.edu/">Berkeley&#8217;s parallel computing lab</a> brings up the issue of just how hard it is to program in parallel and  that this makes the wholesale move to multicore into something like a &#8220;<a href="http://en.wikipedia.org/wiki/Hail_Mary_pass">hail Mary pass</a>&#8221; for the computer industry. In <a href="http://www.computerworld.com.au/article/354261/what_will_do_100_cores_/">Computer World</a>, Chris Nicols at <a href="http://www.nicta.com.au/">NICTA in Australia </a>asks what <strong>you </strong>will  do with a hundred cores &#8211; implying that there is not much you can do  today. While both articles make some good points, I also think they  should be taken with a grain of salt. Things are better than they make  them seem. <span id="more-1207"></span></p>
<form></form>
<p>David Patterson&#8217;s article is very similar in its message to what we  used to hear five years ago as everyone woke up and panicked when  single-core computing ran out of steam. Chris Nicols is extrapolating  from the current state in desktop PCs, and asking how the programs we  run today will work when you have scores of cores rather than two or  four.</p>
<p>The main message in both articles is that software  needs to adapt to multicore, and that this is not happening as quickly  as it needs to. It is clear that automatic parallelization of existing  code is a no-starter, and the Computer World article proposes the use of  domain-specific languages (which I agree is a <a href="../archives/747">very </a><a href="../archives/157">good </a><a href="../archives/264">way </a><a href="../archives/905">to </a>go).  David Patterson is less clear on what he thinks is a good programming  model for multicore, leaving that as an open problem. Patterson does  point out that we have quite a few success stories in parallel  computing. I think it is important to not underestimate the set of  problems which  are amenable to parallelization. In particular, in the  embedded field, we have many naturally  parallel problem domains. For example, networking offers abundant  parallelism as many clients, servers, and data streams are active.</p>
<p>However, both articles also miss some of the things which are  happening to make multicore easier to use, especially in embedded. My  favorite example of a technology that nobody seems to talk about outside  of the embedded field is the hypervisor. With an hypervisor (such as  the <a href="http://www.windriver.com/products/hypervisor/">Wind River Hypervisor</a>), you can take existing distributed systems and consolidate them onto a single multicore device, continuing a <a href="../archives/905">long tradition of multiple-processor programming </a>in  embedded. Also, the hardware-based debug tools which are available for  embedded systems &#8211; including multicore &#8211; do not seem to register at all  with the mainstream researchers. It is really a shame, and hopefully we  can start to change that with conferences like the <a href="http://www.ecsi.me/s4d">S4D</a> where we start to bring  industrial debugging experience to the academic community.</p>
<p>Another aspect which is missing from both articles is any discussion  about the debug and test of parallel software. Coding a parallel program  is one thing, making sure it works is quite another. I think this is an  interesting problem in its own right, as exposing parallel bugs is  often just as hard as writing the buggy code to begin with. <a href="http://www.windriver.com/products/simics/">Wind River Simics </a>is  a tool that can be used to really stress multicore software, thanks to  its ability to vary configuration parameters and inject some extra  delays into the target system. Simics is also a very good tool for  debugging multicore software thanks to its controlled deterministic  execution environment. I have already discussed this in a <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html">Wind River blog post</a>, and I will not repeat the argument here.</p>
<h3>Post scriptum: setting some facts straight</h3>
<p>I  want to quickly correct some factual mistakes in the Computer World  article: the  processor power wall at 130W that he discusses is often much lower in  embedded systems &#8211; a mobile phone drawing 130W would not be very  popular, and in networking infrastructure, the magic limit today seems  to be around 30W. So it all depends, and the Itanium example is totally  irrelevant. In certain mainframe computers, the limit is higher than  that. The same is true for the claimed  upper possible clock speed limit at 4 GHz. This has been surpassed by  several companies, most notably by IBM who clocked the Power6 to 4.7 GHz  a few years ago. However, he is right that we will see cores  multiplying in almost all systems as single cores run out of speed  increases.</p>
<p>As Jack Ganssle said, us embedded folks <a href="http://www.eetimes.com/discussion/embedded-pulse/4023943/Can-t-Get-No-Respect">get no respect</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1207"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1207" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1207" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1207/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: True Concurrency is Different</title>
		<link>http://jakob.engbloms.se/archives/1151?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1151#comments</comments>
		<pubDate>Fri, 18 Jun 2010 20:24:04 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1151</guid>
		<description><![CDATA[I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called True Concurrency is Truly Different (Again). It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems. Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="button-quicklink-blogs" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have another blog up at Wind River. This one is about multicore bugs that cannot happen on multithreaded systems, and is called <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html#more">True Concurrency is Truly Different (Again). </a>It bounces from a recent interesting Windows security flaw into how Simics works with multicore systems.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1151"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1151" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1151" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1151/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009 Presentations Online</title>
		<link>http://jakob.engbloms.se/archives/1023?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1023#comments</comments>
		<pubDate>Thu, 03 Dec 2009 08:29:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[keynote]]></category>
		<category><![CDATA[LTE]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1023</guid>
		<description><![CDATA[The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the program page for the workshop. Let me add some comments on the workshop per se. This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1016" style="margin-top: 5px; margin-bottom: 5px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" />The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">program page for the workshop</a>. Let me add some comments on the workshop per se.</p>
<p><span id="more-1023"></span>This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from a hardware company. So there was really nothing here directly about how to build multicore chips. Rather, the workshop tended to be about how to program, use, measure performance on, verify software for, and generally work with multicore chips. From the perspective of software people, rather than hardware designers.</p>
<p>Obviously, hardware aspects enter into such talks, but it is the perspective of a user, not a designer. For example, a hardware designer could explain how an atomic compare-and-swap is optimized in a multicore device. But here, we saw measurements on the actual operation latencies observed on real machines using such operations. Quite refreshing, and closer to my personal interests.</p>
<p>The keynote by <a href="http://a-vajda.eu/blog/">Andras Vajda</a> of Ericsson was quite interesting. The slides are not online, but the main points that I picked up and that I might not have considered before:</p>
<ul>
<li>Software development costs can mean that the cheapest, fastest, most efficient hardware is not necessarily the most economic. Too hard to code for means the software development time and effort removes the advantage. Obvious, but worth reiterating. Software is king.</li>
<li>The workload on a cellular basestation can sometimes be highly linear and single-threaded. For example, serving a single terminal with a very high bandwidth LTE connection. And suddenly shift to a massively parallel workload as a crowd of a thousand all suddenly appear and start doing data downloads. And then go back to serial again. This means that the age-old argument that signal processing naturally &#8220;<a href="http://www.edn.com/blog/980000298/post/50023005.html">conveniently concurrent</a>&#8221; (<a href="http://www.scdsource.com/article.php?id=87">and here</a>) is not always true. Nice point!</li>
<li>Thus, we need adaptable architectures that can trade serial and parallel performance over time, and rebalance quite quickly. In the same chip.</li>
<li>He is a firm believer that homogeneous systems will win out in the end, I still hold on to a belief in accelerators and offload engines and DSPs. This is partially because of an admitted focus on servers and services processors, and not on the baseband and signalling side. Makes sense.</li>
<li>Domain-specific languages (DSL) are the future of efficient programming. Agree.</li>
</ul>
<p>On the topic of DSLs, there was a question about the cost to support them. To me, that is a non-issue. In the organizations that I have worked, it seems that maintaining a useful DSL requires at most one engineer. Developing one, a few good computer scientists for a fairly limited time. In any case, they tend to appear organically when good programmers <a href="http://jakob.engbloms.se/archives/747">generalize repeated tasks</a>.</p>
<p>I gave a keynote about how multicore has impacted virtual platforms (in particular, <a href="http://www.virtutech.com/products/simics">Virtutech Simics</a>) with the following main points:</p>
<ul>
<li>Multicore targets increase the performance pressure on a virtual platform, as more processors will have to be simulated.</li>
<li>Multicore hosts means that sequential performance of the host is going down compared to the aggregate parallel performance demands from the targets.</li>
<li>To handle large target systems, the virtual platform itself has to run multithreaded on a multicore host. Getting this in place is a major, interesting, and sometimes painful process.</li>
<li>Once you have a parallel virtual platform, multicore hosts provide a very nice boost in scalability and the manageable system sizes. A single multithreaded virtual platform process is also a bit easier to manage from a user perspective.</li>
<li>All features in the virtual platform have to be multicore and multimachine-aware&#8230; meaning that they often get a bit harder to use initially, as there is no &#8220;default processor&#8221; you can fall back to for debugging setups etc. Everything has to be explicitly targeted.</li>
<li>Multicore targets have proven to  be a great sales driver for virtual platforms, as debugging software on a physical multicore, multichip, multiboard system is just too painful.</li>
</ul>
<p>Overall, this was a fun event, looking forward to next year at Chalmers!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1023"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1023" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1023" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1023/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009: 2D Stream Processing for Manycore</title>
		<link>http://jakob.engbloms.se/archives/1015?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1015#comments</comments>
		<pubDate>Thu, 26 Nov 2009 15:03:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[David Black-Schaffer]]></category>
		<category><![CDATA[efficiency]]></category>
		<category><![CDATA[manycore]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Stanford]]></category>
		<category><![CDATA[Stream programming]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1015</guid>
		<description><![CDATA[Today here at the MCC 2009 workshop, I heard an interesting talk by David Black-Schaffer of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing. What [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.it.uu.se/research/upmarc"><img class="alignleft size-full wp-image-1016" style="margin-top: 10px; margin-bottom: 10px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" /></a>Today here at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">MCC 2009 workshop</a>, I heard an interesting talk by <a href="http://cva.stanford.edu/people/davidbbs/">David Black-Schaffer </a>of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing.</p>
<p><span id="more-1015"></span>What was striking was his basic attitude to the target machines: he assumed &#8220;manycore&#8221; like 100-way Tilera, 320-way ATI/AMD graphics processors, and similar devices. With these many cores, the goal is to implement an algorithm using a few cores as possible to save power. He assumes that there are always cores available, and wastes no time reducing the core count. One kernel on each core, including replicated kernels and synthesized buffering kernels.</p>
<p>This was the best example I have seen so far on an actual implementation of idea that <strong>cores are free</strong>. For previous writing on this topic, see &#8220;<a href="http://jakob.engbloms.se/archives/269">what is efficiency when cores are free</a>&#8221; and &#8220;<a href="http://jakob.engbloms.se/archives/123">real-time control when cores are free</a>&#8220;.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1015"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1015" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1015" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1015/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finally, a Bug!</title>
		<link>http://jakob.engbloms.se/archives/975?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/975#comments</comments>
		<pubDate>Sun, 25 Oct 2009 20:41:20 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[demo]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=975</guid>
		<description><![CDATA[Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using Simics&#8220;. This then turns into a hard hunt for a program with a suitable bug [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png"><img class="alignleft size-full wp-image-982" title="butterfly" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/butterfly.png" alt="butterfly" width="90" height="91" /></a>Part of my daily work at Virtutech is building demos. One particularly interesting and frustrating aspect of demo-building is getting good raw material. I might have an idea like &#8220;let&#8217;s show how we unravel a randomly occurring hard-to-reproduce bug using <a href="http://www.virtutech.com/products/simics_hindsight.html">Simics</a>&#8220;. This then turns into a hard hunt for a program with a suitable bug in it&#8230; not the Simics tooling to resolve the bug. For some reason, when I best need bugs, I have hard time getting them into my code.</p>
<p>I guess it is Murphy&#8217;s law &#8212; if you really set out to want a bug to show up in your code,  your code will stubbornly be perfect and refuse to break. If you set out to build a perfect piece of software, it will never work&#8230;</p>
<p>So I was actually quite happy a few weeks ago when I started to get random freezes in a test program I wrote to show multicore scaling. It was the perfect bug! It broke some demos that I wanted to have working, but fixing the code to make the other demos work was a very instructive lesson in multicore debug that would make for a nice demo in its own right. In the end, it managed to nicely illustrate some common wisdom about multicore software. It was not a trivial problem, fortunately.</p>
<p><span id="more-975"></span>First, some notes about the program. It is a producer-consumer system using pthreads, with a single producer thread feeding a variable number of compute threads with data, over a shared queue structure (a simple one that uses a single lock to protect it, making it not very scalable for small data messages and lots of workers).</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png"><img class="aligncenter size-full wp-image-980" title="program structure 2" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure-2.png" alt="program structure 2" width="411" height="237" /></a></p>
<p>The queue contains a circular buffer, managed using a standard set of full/empty/tail/head kinds of variables. There is also a flag &#8220;done&#8221; which is set once we are out of data, to tell the compute threads to shut down and terminate the program. As this program is used to demonstrate and test scaling, it is actually something that terminates. The main program spawns off all the threads, and then waits for all threads to finish before it terminates itself.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png"><img class="aligncenter size-full wp-image-981" title="program structure" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/program-structure.png" alt="program structure" width="300" height="458" /></a></p>
<p>This program and the queue subsystem had worked perfectly for a long time for me, running on an MPC8641 machine with a Linux 2.6.23 kernel, with 1 to 8 cores and 1 to 16 threads. Regardless of settings like thread counts, data sizes, number of packets to compute, it always ran smoothly and terminated.</p>
<p>However, the other week, I moved the program, the exact same binary even, over to a new software stack built on a Linux 2.6.27 kernel. Still on the same MPC8641 machine. Suddenly, I started to see occasional freezes where the program would never terminate. I added some more diagnostic printouts to the program, and saw that the main program would simply freeze waiting for the other threads to terminate and report in. The freezes had no real relationship to input variables. Maybe they were a bit more common with short packets, but no real pattern emerged. They also happened randomly, running the program with the same parameters for a few times in a row would sometimes result in a freeze. Using control-C to quit it and restart would keep the new instance of program running well. Doing some other demo work, I found the same effect on a P4080 machine with 8 cores and a 2.6.30 Linux kernel.</p>
<p>This is a common pattern for parallelism bugs: they only manifest themselves as actual visible crashes or freezes or bad computation results once something in the software stack has changed, even though the fundamental issues have been there all the time. In this case, I think it was the Linux scheduler, but it is really hard to tell. Just because a program runs fine today it does not have to run fine tomorrow.</p>
<p>After deciding to finally sit down and turn this lemon into lemonade, I had to reproduce the error. Thankfully, that is easy when you have a simulator. The first few times I had to run the target program 20 times or so before hitting the issue, but with some parameter and timing variations I managed to create a script that would open a <a href="http://jakob.engbloms.se/archives/714">checkpoint</a>, and run the program a few times under script control, triggering the bug on the fourth run (every time, thanks to determinism).</p>
<p>To diagnose the problem I wrote some Simics script code that I actually felt was fairly cool. I guessed that the problem had something to do with the queue and its handling of &#8220;done&#8221;, since that is what told the threads to terminate.</p>
<p>The first problem was that the queue was not a global variable. Instead, it was dynamically allocated on the heap by a function, and a pointer passed around, but never stored in a global variable (a good computer science graduate never uses a global variable other than as the means of last resort). Finally, my script set a breakpoint on the line in the setup function that came after the allocation. With the program stopped at that point, I could read the local variable pointing to the queue, and find and store the addresses of all the interesting members of the structure.</p>
<p>The code looked like this (Simics CLI), for the record:</p>
<pre> $mbp = ($ctx.break ($st.pos (rule30_threaded.c:222)))
 $cpu = (wait-for-breakpoint $mbp)
 $pq_addr  = ($cpu.sym "pq")
 $pq_tail  = ($cpu.sym "&amp;(pq-&gt;tail)")
 $pq_empty = ($cpu.sym "&amp;(pq-&gt;empty)")
 $pq_full  = ($cpu.sym "&amp;(pq-&gt;full)")
 $pq_head  = ($cpu.sym "&amp;(pq-&gt;head)")
 $pq_done  = ($cpu.sym "&amp;(pq-&gt;done)")</pre>
<p>Next, I set breakpoints on all writes to empty, full, and done. This was the most expedient route to catch actual puts and gets to the queue. Breakpoints on the queue_put() and queue_get() functions are not really showing the true flow, as these functions start by contending for the lock. Looking at writes to the actual queue members gave me the point where the tasks had grabbed the lock.</p>
<p>The script that caught all writes to done, full, and empty, and on each write, it dumped the state of the queue including computing out the number of elements in the circular buffer (without having to run any code on the target). To get an idea for who was active, it also used OS awareness to find the currently executing thread ID, and scripted debugging to convert the current program counter into a position in the program source code (actually, the important issue was the name of the function we were executing in).</p>
<p>This trace of activity showed quite an interesting pair of patterns. When the program ran well, the queue was mostly full, and it looked like the producer task always got some kind of priority to fill it before consumers could get in and drain it. When the program froze, the queue was seldom more than a few elements deep. This was the same program, on the same kernel, just run a few milliseconds later.</p>
<p>Clearly, the Linux kernel can exhibit quite variable behavior even for a program this simple. I guess that&#8217;s why this is called &#8220;soft real time&#8221;&#8230; Another parallelism lesson here: the scheduler is very important, and a smart adaptive scheduler can wreak havoc with software that was accidentally tuned for a different scheduler.</p>
<p>In the end, the crucial hint was that whenever the program froze, the &#8220;done&#8221; flag was set with a queue that was empty or contained just a few elements. I was sure that I had handled this case in my code, checking specifically for that and making sure to wake up the other threads with a signal that &#8220;the queue is not empty any more, please come check for more work&#8221;&#8230; but looking closely at the code, it turned out the code only woke up a single thread. Thus, the froze resulted from the producer setting &#8220;done&#8221; with an empty queue, waking up a single compute thread, and then having the other threads wait forever for more data to be put into the queue. The fix was easy: use a broadcast signal rather than a single signal.</p>
<p>In retrospect, it seems really strange that this ever worked reliably&#8230; it almost that I suspect the old Linux kernel of having a flawed pthreads implementation where signals always wake up all waiting threads, and not just a single one like the documentation says. But that will wait for another day to be investigated.</p>
<p>Here is the code, for reference:</p>
<pre>void rule30_packet_queue_signal_done(rule30_packet_queue_t *q) {
 //
 // Grab lock, set the done signal atomically
 //
 pthread_mutex_lock (&amp;(q-&gt;mutex));
 q-&gt;done = 1;
 pthread_mutex_unlock (&amp;(q-&gt;mutex));
 // Signal any threads waiting for data to wake up
 // and discover that we are indeed done
 //
 // This is the bug:
 // - It only wakes up one thread...
 pthread_cond_signal (&amp;(q-&gt;notEmpty));
 // To be correct:
 // pthread_cond_broadcast (&amp;(q-&gt;notEmpty));
}</pre>
<p><em>Updated analysis:</em></p>
<p>My initial analysis was that when things worked, the &#8220;done&#8221; flag was set with enough data left in the queue that all threads had a chance to pull in data and come in and see the done flag being set.</p>
<p>However, today I went back and wrote a deeper analysis script that also checked for reads from the done flag (turning this check on only after the write to &#8216;done&#8217; to reduce the noise). I expected there to be a single reader when the freeze happened&#8230; but that was not the case. In my current test case, three out of five threads actually got in to read the done flag and terminate.  The crucial code for the compute threads looks like this:</p>
<pre> // Grab mutex,
 //   Check if the queue is empty, if so wait for someone
 //   to push something onto the queue, or signal done.
 //   both of which are done by setting the not_empty conditional variable
 pthread_mutex_lock (&amp;(queue-&gt;mutex));
 while ((queue-&gt;empty) &amp;&amp; !(queue-&gt;done)) {
   pthread_cond_wait (&amp;(queue-&gt;notEmpty), &amp;(queue-&gt;mutex));
 }</pre>
<p>To freeze, a thread actually has to be doing the conditional wait here. There are plenty of other places threads can be as the program is finishing. For example, they can be waiting to grab the initial mutex lock, or actually doing compute work. That explains why some threads actually still terminate even with the buggy version. It certainly also illustrates just how chaotic concurrent programs can be. More so that you can ever imagine, really.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/975"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/975" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/975" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/975/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ericsson Blog Post about DSL</title>
		<link>http://jakob.engbloms.se/archives/976?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/976#comments</comments>
		<pubDate>Sun, 25 Oct 2009 19:29:19 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=976</guid>
		<description><![CDATA[Andras Vajda of Ericsson wrote an interesting blog post on domain-specific languages (DSLs). Thanks for some success stories and support in what sometimes feels like an uphill battle trying to make people accept that DSLs are a large part of the future of programming. In particular for parallel computing, as they let you hide the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/10/ericsson_logo.gif"><img class="alignleft size-full wp-image-977" title="ericsson_logo" src="http://jakob.engbloms.se/wp-content/uploads/2009/10/ericsson_logo.gif" alt="ericsson_logo" width="127" height="62" /></a><a href="http://a-vajda.eu/blog/?p=184">Andras Vajda of Ericsson wrote an interesting blog post on domain-specific languages (DSLs).</a> Thanks for some success stories and support in what sometimes feels like an uphill battle trying to make people accept that DSLs are a large part of the future of programming. In particular for parallel computing, as they let you hide the complexities of parallel programming.</p>
<p><span id="more-976"></span></p>
<p>Quotes from the blog post:</p>
<p>Sequential languages hiding parallelism:</p>
<blockquote><p>Sequential imperative languages are notoriously good at hiding or at least obfuscating inherent parallelism in the applications – which may be good news for compiler providers who spend – and charge – big bucks for building reverse engineering technologies that can detect parallelism from sequential code; as well as for highly skilled programmers capable of building parallel code using sequential tools; however, it weights heavily on the total cost of developing software, not to mention the recurring cost of porting to newer HW with e.g. more cores.</p></blockquote>
<p>Applying DSLs to signal processing:</p>
<blockquote><p>During the work on the domain-specific language for DSP programming we have seen some interesting results; for example, several pages of optimized C code were re-written to one PowerPoint slide worth of DSL code, while the DSL to C compiler was able to output efficient code comparable in size to the original code.</p></blockquote>
<p>I agree fully with the final note:</p>
<blockquote><p>The big question is not anymore if DSLs will take off – it’s more if your pain level is higher than the cost of accepting an alternative and different approach; as the cost for taking in a new language tends to stay constant while the complexity of developing software is increasing, odds are that there will be an uptake of domain-specific languages.</p></blockquote>
<p>Go and read the full article!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/976"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/976" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/976" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/976/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SiCS Multicore Day 2009</title>
		<link>http://jakob.engbloms.se/archives/905?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/905#comments</comments>
		<pubDate>Mon, 07 Sep 2009 19:26:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[Anders Landin]]></category>
		<category><![CDATA[CPP]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hazim Shafi]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Richard Kaufmann]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Visual Studio 2010]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=905</guid>
		<description><![CDATA[Last Friday, I attended this year&#8217;s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a [...]]]></description>
			<content:encoded><![CDATA[<p>Last Friday, I attended this year&#8217;s edition of the <a href="http://www.sics.se/node/4360">SiCS Multicore Day</a>. It was smaller in scale than <a href="http://jakob.engbloms.se/archives/283">last year</a>, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from <a href="http://blogs.msdn.com/hshafi/">Hazim Shafi </a>of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community.<span id="more-905"></span></p>
<p>I think that for next year, the organizers need to find keynote speakers that are not from the general computing multicore world. The Microsoft talk this year was a step in that direction, as it rather came from multicore programming than multicore hardware. Richard and Anders gave very interesting and good talks, no doubt about it. But it would have been nice with someone from ARM or Freescale or Tensilica or TI or ST or Ericsson or Cisco talking about the kinds of multicore embedded hardware that is being developed and used today. For example, the &#8220;next new thing&#8221; touted by the keynotes this year was GPGPU. Interesting for HPC and desktops, certainly. But pretty irrelevant for most of the people that I know. GPUs are huge, expensive, and power hungry.</p>
<p>GPGPU was one part of the theme this year. It is definitely catching on as <em>the </em>way to do number crunching in the desktop, server, and HPC world. It is not the universal panacea for any kind of parallelism, however, as Hazim and I noted in the panel discussion that ended the day. There are applications (such as <a href="http://www.virtutech.com/whitepapers/accelerator.html">parallel Simics</a>&#8230;) that scale well on general-purpose cores, but that will never ever work on GPUs. In general, the class of problems that work on GPUs is pretty limited to massive data-parallel problems like image and video manipulation.</p>
<p>In the eternal homogeneous vs heterogeneous debate (follow <a href="http://jakob.engbloms.se/archives/tag/homogeneous">the tags </a>in my blog for more posts on this topic), GPGPU was grudingly accepted as a good candidate for something that will not be homogeneized with the main processors. Additionally, Richard Kaufmann gave some hints that Intel or AMD are coming out with new chips with more accelerators on board&#8230; I guess it will be security, as is already done by Sun and <a href="http://jakob.engbloms.se/archives/80">IBM</a>. When I brought up the topic of more accelerators like pattern matching, compression, and the other things we see in chips from Freescale, Cavium, and others, the response was very &#8220;can only be economical for very high volume applications&#8221;.</p>
<p>It is striking how the GPGPU idea is bringing the classic telecommunications DSP-data plane/CPU-control plane division into the desktop and server space. Without any recognition being paid or any experience being reused from the 40 years that that has been done in telecoms and consumer electronics&#8230; as Jack Ganssle often says, us embedded folks get no respect.</p>
<p>In terms of programming, this year was all about general programming languages. Hazim from Microsoft talked about (and demoed) the quite pervasive addition of parallelism to both native C/C++ and managed .net code in Visual Studio 2010. Microsoft is dead serious about parallel programming, and are bringing out a whole set of different libraries and support structures to allow <a href="http://blogs.msdn.com/pfxteam/archive/2009/08/12/9867246.aspx">easier expression of parallel code</a>. In the &#8220;LINQ&#8221; data query language subset of C#, you could add some easy modifiers to &#8220;foreach&#8221; statements to make them parallel, for example. Having a language that is your own and which you can extend at will certainly pays off in terms of innovation here. C++ moves far slower than C#, that is becoming clearer and clearer. C# and its cousins in the .net system seem to be sneaking in lots of powerful language design ideas from places like Python, and also results from Microsoft&#8217;s powerful group of language researchers.</p>
<p>When I tried to bring up the idea of using domain-specific languages to program parallel applications, Hazim had the wonderful comment that &#8220;that might be applicable in certain domains&#8230;&#8221; &#8212; yes, that is the idea. By being narrow in terms of target domains, you gain expressive power and semantic insight that helps move programming from &#8220;how&#8221; towards &#8220;what&#8221;. But it sounds like domain-specific is a foul word inside of Microsoft &#8212; when the audience asked whether LINQ was not a exactly a domain-specific language for data access, Hazim was a pains to point out that it is Turing-complete and that someone had managed to write a Raytracer using it&#8230; interesting. This feels more political than market-based. I guess Micro</p>
<p>Richard Kaufmann had some interesting notes on throughput vs TTC (time-to-completion) jobs in servers. In the &#8220;cloud computing&#8221; era, throughput is much easier to scale: just add more servers. Classic HPC is more oriented towards TTC, as you do want your results within a reasonable time. Quite often, you can most work into a throughput-oriented style by simply running lots of jobs in parallel rather than pushing through a series of jobs sequentially. Note however that we have the entire field of real-time control, real-time communications, etc., that do not work like this. But that is not the market that HP is building servers for, or that Intel and AMD are servicing.</p>
<p>Outside the keynotes, Per Holmberg of Ericsson gave an interesting presentation on the adoption of multicore in the control plane of the <a href="http://www.ericsson.com/ericsson/corpinfo/publications/review/2002_02/161.shtml">Ericsson CPP </a>platform. The core of his talk was the observation that in these kinds of systems, multicore is not such a big revolution.</p>
<p>They have been distributed since the beginning. Thus, scaling by adding more processors (with local memories) is easy and multicore is only a packaging change from that. Also, most performance-intense operations are already offloaded onto DSP groups, network processors, ASICs, or FPGAs. There is not much parallelism left for the control plane to exploit. Essentially, only functions that unexpectedly become performance bottlenecks due to changes in traffic patterns are likely candidates for parallellization. Interesting point, and might be <a href="http://jakob.engbloms.se/archives/703">why the EETimes noted that multicore is slow to catch on in communications </a>(the article is a bit flawed).</p>
<p>Patrik Nyblom from Ericsson held a talk about how the <a href="http://www.erlang.org">Erlang </a>runtime engine was parallelized. From a practical perspective, the most interesting aspect was that this made applications parallel without changing a single line of code in the applications. Of course, applications had to be threaded to start with, but that is the most natural way in Erlang. He mentioned systems containing up to a quarter of a million threads &#8212; hard to do that in anything except Erlang.</p>
<p>He described how they had evolved from a simple implementation that worked well on synthetic benchmarks to a truly industrial-strength implementation. The difference was quite radical, as real codes feature more complex communications patterns, and make heavy use of device drivers and network stacks. This process forced the use of more and finer locks, and rethinking the balance between shared and separate heaps for threads.</p>
<p>They also had the opportunity to test their solution on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8&#8230; The two key lessons they learned was that <em>no shared lock goes unpunished, </em>and <em>data has to be distributed as well as code.</em> Very interesting to hear this story from real software developers solving real problems.</p>
<p>The next multicore event taking place around here is the Second <a href="http://www.it.uu.se/research/upmarc/MCC09">Swedish WOrkshop on Multicore Computing </a>(MCC 2009), in Uppsala, November 26-27.</p>
<p>Update: note that the presentations from the event are available via <a href="http://www.multicore.se/">http://www.multicore.se/</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/905"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/905" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/905" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/905/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Øredev 2009: Meanwhile, Parallel</title>
		<link>http://jakob.engbloms.se/archives/913?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/913#comments</comments>
		<pubDate>Mon, 07 Sep 2009 06:52:21 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Öredev]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=913</guid>
		<description><![CDATA[Øredev is the premier software development conference in Sweden and Europe (they claim). I gave some presentations there in 2006 and 2007, but since then they have dropped the general embedded software development track and just focused on programming for mobile phones. Most of the material is &#8220;general IT&#8221;. If you are doing software development [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://oredev.org/"><img class="size-full wp-image-50 alignleft" style="margin: 5px 10px;" title="Öredev logo" src="http://jakob.engbloms.se/wp-content/uploads/2007/11/oredev.png" alt="Öredev logo" width="256" height="49" />Øredev </a>is the premier software development conference in Sweden and Europe (they claim). I gave some presentations there in 2006 and 2007, but since then they have dropped the general embedded software development track and just focused on programming for mobile phones. Most of the material is &#8220;general IT&#8221;. If you are doing software development on the desktop or for servers, it is a good place to go to learn new things from the general world of computing.</p>
<p><span id="more-913"></span></p>
<p>Anyhow, I looked at this year&#8217;s program to see how they developed it. In particular, it was striking that their main set of topics (<a href="http://oredev.org/">see front page</a>) did <em>not</em> include parallel programming or multicore.</p>
<p>However, there is a session called <a href="http://518.nu/Prod/Oredev/site.nsf/docsbycodename/tracks!OpenDocument&amp;day=5&amp;track=2556B90C592E1E23C12575A500499CC6"><em>Meanwhile</em></a>, which includes five talks on parallel programming.</p>
<p>Interesting, I guess this is indicative of how little impact parallel programming has yet had on the general world of IT programming.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/913"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/913" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/913" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/913/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Downloadable Book about Embedded Multicore</title>
		<link>http://jakob.engbloms.se/archives/877?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/877#comments</comments>
		<pubDate>Sat, 08 Aug 2009 19:27:08 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[John Logan]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[Patrik Strömblad]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=877</guid>
		<description><![CDATA[Freescale has now released the collected, updated, and restyled book version of the article series on embedded multicore that I wrote last year together with Patrik Strömblad of Enea, and Jonas Svennebring, and John Logan of Freescale. The book covers the basics of multicore software and hardware, as well as operating systems issues and virtual [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.freescale.com"><img class="alignleft size-full wp-image-878" style="margin-left: 5px; margin-right: 5px;" title="freescale-logo-icon" src="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png" alt="freescale-logo-icon" width="80" height="80" /></a>Freescale has now released the collected, updated, and restyled <a href="http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf">book version </a>of the article series on embedded multicore that I <a href="http://jakob.engbloms.se/archives/423">wrote last year </a>together with Patrik Strömblad of <a href="http://www.enea.com">Enea</a>, and Jonas Svennebring, and John Logan of <a href="http://www.freescale.com">Freescale</a>. The book covers the basics of multicore software and hardware, as well as operating systems issues and virtual platforms. Obviously, the virtual platform part was my contribution.</p>
<p><span id="more-877"></span></p>
<p>It is one of the more comprehensive introductions to how to think about and use multicore architectures in the high-end embedded space. It is free to download and print, but if you want a printed copy, such can be ordered at a price of (I am told) 15 USD (did not try it myself).</p>
<p>The PDF is at <a href="http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf">http://www.freescale.com/files/32bit/doc/ref_manual/EMBMCRM.pdf </a>.</p>
<p>It will also be linked from the &#8220;Documentation&#8221; section for most Freescale multicore chips&#8217; information pages.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/877"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/877" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/877" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/877/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>StackOverflow interviews CouchDB</title>
		<link>http://jakob.engbloms.se/archives/830?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/830#comments</comments>
		<pubDate>Tue, 07 Jul 2009 18:29:55 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[desktop software]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[couchDB]]></category>
		<category><![CDATA[Damien Katz]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Jan Lehnard]]></category>
		<category><![CDATA[Jeff Atwood]]></category>
		<category><![CDATA[Joel Spolsky]]></category>
		<category><![CDATA[parallelized software]]></category>
		<category><![CDATA[stackoverflow.com]]></category>
		<category><![CDATA[transactions]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=830</guid>
		<description><![CDATA[Last year, FLOSS Weekly interviewed Jan Lehnard of the CouchDB project. I put up a blog post on this, noting that it was interesting with a scalable parallel program written in Erlang, a true concurrent language. The interview was interesting,  but not very deeply technical. Now, almost a year later, the StackOverflow podcast, number 59, [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-237" style="margin: 5px 10px;" title="couchdb" src="http://jakob.engbloms.se/wp-content/uploads/2008/08/couchdb.png" alt="couchdb" width="158" height="96" />Last year, <a href="http://www.twit.tv/floss">FLOSS Weekly </a>interviewed Jan Lehnard of the CouchDB project. I put up a <a href="http://jakob.engbloms.se/archives/236">blog post </a>on this, noting that it was interesting with a scalable parallel program written in Erlang, a true concurrent language. The interview was interesting,  but not very deeply technical. Now, almost a year later, <a href="http://blog.stackoverflow.com/category/podcasts/">the StackOverflow podcast</a>, <a href="http://blog.stackoverflow.com/2009/06/podcast-59/">number 59,</a> interviewed the founder of the project, Damien Katz. This interview goes a bit more into the technical details and what CouchDB is good for and what not, as well as some details on the use and performance of Erlang.</p>
<p><span id="more-830"></span>An interesting point made is that the light-weight user-level threading of the virtual machine in Erlang optimizes for massively threaded performance. The key property is that the context for each thread is very small compared to an OS-level application thread (like pthreads, for example), and this means that the context switch cost is dramatically smaller thanks to less cache and TLB contents needing to be swapped in and out. Thus, for lots of threads, Erlang tends to get more work done per time unit, as there is less execution time lost to friction in the memory system. I am not sure you can emulate this in C using a user-level package. The very small initial stack and heap size of the Erlang VM is partially achieved by the very fact that in a VM, you have more insight into and control over when memory allocation happens, and thus you can more easily do stack and heap grow operations in small units.</p>
<p>Another interesting aspect of Erlang as opposed to C/C++ brought out in the interview is how to do error handling. In Erlang, this is part of the language, while in C/C++, writing code to handle all cases (and handle them correctly) quickly gets painful and overwhelming. Instead in Erlang, you have a system policy to kill any thread that does something bad and restart it. With that simple strategy imposed on you, the code gets much simpler.</p>
<p><img class="alignright size-full wp-image-300" title="stackoverflowlogo250hq2" src="http://jakob.engbloms.se/wp-content/uploads/2008/10/stackoverflowlogo250hq2.png" alt="stackoverflowlogo250hq2" width="47" height="61" />The podcast also brought up <a href="http://stackoverflow.com/questions/299723/can-i-do-transactions-and-locks-in-couchdb">a StackOverflow question about CouchDB </a>that resulted in a good explanation of the concurrency model (optimistic concurrency on entire documents, an nothing smaller or larger than that). Damien Katz came in with some more insights on transactions and CouchDB, in a discussion on how to solve the classic bank account problem: moving money from one account to another. The &#8220;ACID&#8221; solution is to make sure that changes to two accounts are always both done or none done. The CouchDB solution is to put in a record of the account-to-account money transfer (I won&#8217;t use the word &#8220;transaction&#8221; as that is overloaded in this context) in the database, and just go through all records pertaining to a particular account to arrive at its current balance. That does feel more like proper bookkeeping practice, rather than having a single unauditable  balance in an account record&#8230;</p>
<p>Overall, worth its time to listen to.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/830"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/830" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/830" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/830/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Article in ECNmag about Multicore and Virtual Platforms</title>
		<link>http://jakob.engbloms.se/archives/807?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/807#comments</comments>
		<pubDate>Tue, 09 Jun 2009 06:46:49 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[ECNmag]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=807</guid>
		<description><![CDATA[I have a short article on multicore systems development and virtual platforms in the May 2009 issue of ECN magazine, over at www.ecnmag.com. Tweet]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-808" style="margin: 5px;" title="ecn_logos" src="http://jakob.engbloms.se/wp-content/uploads/2009/06/ecn_logos.gif" alt="ecn_logos" width="84" height="52" />I have a short article on <a href="http://www.ecnmag.com/article-cover-story-Virtual-Platforms-051509.aspx">multicore systems development and virtual platforms </a>in the May 2009 issue of ECN magazine, over at <a href="http://www.ecnmag.com">www.ecnmag.com</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/807"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/807" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/807" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/807/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallelism in Action</title>
		<link>http://jakob.engbloms.se/archives/793?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/793#comments</comments>
		<pubDate>Sun, 24 May 2009 12:53:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Embarrassingly Parallel]]></category>
		<category><![CDATA[iPod]]></category>
		<category><![CDATA[Nero]]></category>
		<category><![CDATA[parallelized software]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=793</guid>
		<description><![CDATA[Last year in a blog post on video encoding for the iPod Nano, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) video encoding is a good example of an application that can take advantage of parallelism. Yesterday I put these two topics together in a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-125 alignleft" style="margin: 5px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="Shrinking cores" width="100" height="100" /></p>
<p>Last year in a blog post on <a href="http://jakob.engbloms.se/archives/28">video encoding for the iPod Nano</a>, I complained about the lack of performance on my old Athlon. A bit later, I noted that (obviously) <a href="http://jakob.engbloms.se/archives/31">video encoding is a good example of an application that can take advantage of parallelism</a>. Yesterday I put these two topics together in a practical test. And it worked nicely enough.</p>
<p><span id="more-793"></span></p>
<p>My new Core i7 920-based machine was very well utilized by the Nero 8 suite&#8217;s Nero Recode 3 application when converting some children&#8217;s movies for use on my Nano. Here is a screenshot of the CPU load at one point in the computation:</p>
<p><img class="aligncenter size-full wp-image-794" title="skarmklipp" src="http://jakob.engbloms.se/wp-content/uploads/2009/05/skarmklipp.png" alt="skarmklipp" width="162" height="139" />It was much higher than this at times, but capturing that using the <a href="http://jakob.engbloms.se/archives/580">snipping tool </a>was harder than expected.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/793"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/793" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/793" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/793/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EETimes.com &#8211; Multicore CPUs face slow road in comms</title>
		<link>http://jakob.engbloms.se/archives/703?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/703#comments</comments>
		<pubDate>Sun, 22 Mar 2009 21:16:36 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Cavium]]></category>
		<category><![CDATA[Communications market]]></category>
		<category><![CDATA[EETimes]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[Linley Gwennap]]></category>
		<category><![CDATA[Multicore Expo]]></category>
		<category><![CDATA[Octeon]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[PowerQUICC]]></category>
		<category><![CDATA[qoriq]]></category>
		<category><![CDATA[Rick Merritt]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=703</guid>
		<description><![CDATA[The  EETimes article Multicore CPUs face slow road in comms piqued my interest. There is an interesting chart in there about just how slow more-than-one-core processors will be in penetrating a vaguely defined &#8220;comms&#8221; market place. I can believe that, but I think their comments on the PowerQUICC series require some commentary&#8230; Essentially, the article [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=215901460"><img class="alignleft size-full wp-image-155" title="eetimes logo" src="http://jakob.engbloms.se/wp-content/uploads/2008/07/eetimes.png" alt="eetimes logo" width="127" height="56" /></a>The  EETimes article<a href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=215901460"> Multicore CPUs face slow road in comms</a> piqued my interest. There is an interesting chart in there about just how slow more-than-one-core processors will be in penetrating a vaguely defined &#8220;comms&#8221; market place. I can believe that, but I think their comments on the PowerQUICC series require some commentary&#8230;</p>
<p><span id="more-703"></span>Essentially, the article is a report on a talk by Linley Gwennap at the Multicore Expo last week. The most interesting point are that simple single-core processors are taking over from the traditional heterogeneous processor + big accelerator pattern examplified by the Freescale PowerQUICC series. And that even Freescale themselves are &#8220;replacing&#8221; PowerQUICC chips based on the venerable CPM with &#8220;simpler dual-core chips&#8221;, which has to mean the MPC8572 currently and probably the QorIQ P2000-series chips later on.</p>
<p>The main point is that people are moving away from the &#8220;complexities&#8221; of the CPM-style heterogeneous hardware design, to symmetric multiprocessing designs that are simpler in one way. But harder to program when you want to have a regular old program use more than one core, as we all well know. It is a good question whether this is actually the case: I am not too sure that correctly writing a parallel threaded program for a shared-memory multiprocessor is easier than calling a hardware accelerator API or using a heterogeneous architecture&#8230; more familiar to general-purpose programmers, sure. But easier? Not necessarily.</p>
<p>What I do take some issue with is the implication that the quad-core and dual-core processors expanding into the market, in Gwennaps opinion, do <em>not </em>have these &#8220;complex hardware accelerator APIs&#8221;&#8230; all the hardware I have seen for the comms field certain feature very powerful offload and acceleration engines for tasks like network interface, TCP/IP processing, regular expression matching, security computations, etc.</p>
<p>Look at the feature sets of chips like the Freescale QorIQ P4080 or the Cavium Octeon Plus CN58xx family: their core acceleration engines look every bit as complex as the old CPM to me. The programming might be a bit different, their presence spun as accelerators rather than as a processor in the marketing talk, but still they are complex acceleration blocks that definitely have a lot of power. They also seem quite intent on staying and proliferating, and not going away. I see no sign that the future of computing is anything but <a href="http://jakob.engbloms.se/archives/44">lots of programmable cores augmented by lots of accelerators. </a>The benefits of heterogeneous architectures in terms of power, throughput, and chip size are simply too compelling.</p>
<p>What is interesting in the article is also both the claimed poor state of software that is slowing the adoption of multicore, and that this means that the software stacks actually get some more time than could be expected to adapt to multicore and truly parallel hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/703"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/703" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/703" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/703/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enea and Freescale Article on SMP OS</title>
		<link>http://jakob.engbloms.se/archives/664?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/664#comments</comments>
		<pubDate>Tue, 24 Feb 2009 09:43:16 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[AMP]]></category>
		<category><![CDATA[Enea]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Jonas Svennebring]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[mpc8572e]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OSE]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Patrik Strömblad]]></category>
		<category><![CDATA[SMP]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=664</guid>
		<description><![CDATA[Elektronik i Norden just published a technical insight article about the SMP kernels of Enea OSE and Linux, by Patrik Strömblad and Jonas Svennebring. It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.elinor.se">Elektronik i Norden </a>just published a <a href="http://www.webbkampanj.com/ein/0903/?page=51">technical insight article </a>about the <a href="http://www.enea.com/templates/Extension____24922.aspx?headline=http://cws.huginonline.com/E/1059/PR/200811/1267022.xml">SMP kernels </a>of <a href="http://www.enea.se">Enea </a>OSE and Linux, by Patrik Strömblad and Jonas Svennebring.</p>
<p><span id="more-664"></span>It has a nice discussion about AMP and SMP, and OS scheduling policies. It is particularly interesting to see how OSE tries to combine the two. Unfortunately, the article is in Swedish, but I would expect the CMP network that Elektronik i Norden is part of will place this article in English into EETimes or some other publication of theirs.</p>
<p>The article discusses some Freescale targets, such as my favorite the MPC8641D, the MPC8572E dual-core, and the upcoming QorIQ P4080.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/664"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/664" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/664" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/664/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adding to Schirrmeister&#8217;s Virtual Platform Myth Busting</title>
		<link>http://jakob.engbloms.se/archives/651?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/651#comments</comments>
		<pubDate>Wed, 18 Feb 2009 12:22:43 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[clock-cycle models]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[Eve]]></category>
		<category><![CDATA[Frank Schirrmeister]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[Grant Martin]]></category>
		<category><![CDATA[Lauro Ritazzi]]></category>
		<category><![CDATA[p4080]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[software tools]]></category>
		<category><![CDATA[Synopsys]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=651</guid>
		<description><![CDATA[Frank Schirrmeister of Synopsys recently published a blog post called &#8220;Busting Virtual Platform Myths – Part 1: “Virtual Platforms are for application software only”. In it, he is refuting a claim by Eve that virtual platforms are for application-level software-development only, basically claiming that they are mostly for driver and OS development and citing some [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-654" style="margin: 10px;" title="opinion" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png" alt="opinion" width="91" height="69" />Frank Schirrmeister of Synopsys recently published a blog post called <a href="http://www.synopsysoc.org/viewfromtop/?p=64#comment-1008">&#8220;Busting Virtual Platform Myths – Part 1: “Virtual Platforms are for application software only”</a>. In it, he is refuting a claim by Eve that virtual platforms are for application-level software-development only, basically claiming that they are mostly for driver and OS development and citing some Synopsys-Virtio Innovator examples of such uses. In his view, most appication-software is being developed using host-compiled techniques.  I want to add to this refutal by adding that application-software is surely a very important &#8212; and large &#8212; use case for virtual platforms.</p>
<p><span id="more-651"></span>The beginning of the argument was found in an <a href="http://www.edadesignline.com/howto/212200519">EDA Design Line article titled &#8220;Unified Verification for Hardware and Embedded Software Developers&#8221; </a>by Lauro Ritazzi of Eve USA. In it, he makes the following claim:</p>
<blockquote><p>While some may have achieved the scope of jump-starting software development, they only address application programs that do not require an accurate representation of the underling hardware design. They fall short when testing the interaction of the embedded software with hardware, such as firmware, device drivers, operating systems and diagnostics. For this testing, embedded software developers need an accurate model of the hardware to validate their code, while hardware designers need fairly complete software to fully validate their application specific integrated circuit (ASIC) or SoC.</p></blockquote>
<p>The interesting part here is really that jump-start is just for applications, and that OS and drivers require more details than a fast virtual platform can supply. I do not quite agree with this. But let&#8217;s first see what Frank Schirrmeister said:</p>
<blockquote><p>the majority of the software development on virtual platforms is spent on firmware, device drivers, operating system porting and diagnostics. And that is not &#8211; as one could assume &#8211; on cycle accurate models, but on functionally accurate models with only essential timing, the type of models called loosely timed (LT) in SystemC.</p></blockquote>
<p>I totally agree with this. As is evident from many different <a href="http://www.virtutech.com/casestudies">public use cases</a>, OS, BSP, and driver development is a big use of virtual platforms. For example, last summer, <a href="http://jakob.engbloms.se/archives/137">Freescale announced the QorIQ P4080 with pretty good software support </a>in terms of Linux and VxWorks operating systems, as well as some middleware stacks. All developed on Simics using an even more timing-abstracted model of the hardware.</p>
<p>However, Frank then makes the following claim that I have a harder time with:</p>
<blockquote><p>In contrast, application software is developed more often than not using completely hardware independent techniques, including cross compilation from the host development machine using development kits like Apple’s iPhone development kit.</p></blockquote>
<p>This is to some extent true, but as time goes on, I think this type of development environment is going to be less useful. Traditionally, OS vendors have had tools like VxSim and OSE SoftKernel in place to help customers &#8220;run code on their desktop&#8221;, while using the API of the operating system of choice. However, such solutions have lots of problems in how close they can get to the target.</p>
<ul>
<li>If you have any kind of third-party binary-only application, or want to use an existing binary component without lots of complex recompilation, you need a virtual platform running the underlying OS. You cannot squeeze that into a host-compiled API simulator.</li>
<li>You are not using the same compiler and code-generation settings and build settings as you are for your actual target, and this can (read: will) introduce nasty compiler version issues.</li>
<li>It forces you to maintain an additional build variant for your code, which can be pretty expensive for a complex build.</li>
<li>You are not using the real OS scheduler, device drivers, and interrupt structure found on the target system. This can have a huge impact, especially for multithreaded multiprocessor systems.</li>
<li>The API simulator needs to be kept in synch with the real software stack, and customized in the same way for any particular target. This is hard to get right (even though it has been done).</li>
<li>The API simulator does not handle heterogeneous systems very well, such as chips or boards or racks mixing two or more different OS kernels in the same system (like a DSP and a main processor OS).</li>
<li>API simulation completely falls apart when the OS is no longer the lowest level of the software stack, but you also have a hypervisor layer underneath the OSes on your target system. An API simulator simply cannot represent this kind of case.</li>
<li>Using a virtual platform and the real target binaires also fits with the very important &#8220;fly what you test, test what you fly&#8221; principle of embedded software development.</li>
</ul>
<p>For various subsets of these reasons, I see many users picking up virtual platforms as a way to streamline application development. For example, <a href="http://www.virtutech.com/news_events/pr/pr2009-02-11-595.html">NASA recently selected a virtual platform based on Simics </a>to develop the software for the new Orion spacecraft. That is going to be a complete software stack, not just OS and drivers, which tend to to be fairly off-the-shelf component for these kinds of systems. Most of the effort is on the application level, and the platform used is a virtual platform.</p>
<p>However, note that there are cases where a fast virtual platform like we are discussing here is not sufficient to validate all aspects of the code. I think the main reason we see different viewpoints on this, is that we are looking at very different types of software-hardware integration.</p>
<p>In a <a href="http://jakob.engbloms.se/archives/153">blog post I wrote last year on the dead-ness of cycle-accurate simulation</a>, Grant Martin of Tensilica pointed out that <a href="http://jakob.engbloms.se/archives/153#comment-1652">some software desperately needs cycle-accuracy </a>as it is intimately dependent on the timing of the hardware. This is certainly true for some aspects of drivers, and more so for the really early boot code.</p>
<p>Here, FPGA-based hardware-accelerated simulation of the actual design in VHDL or Verilog makes eminent sense as a way to get the details perfectly right. But that is only one part of a much greater system development puzzle, and it really only applies to very small subsystems as  it is kind of hard to fit much more than a single chip inside a hardware acceleration unit. Just as Frank Schirrmeister says, hardware accelerated simulation is very important. The nice article on the <a href="http://jakob.engbloms.se/archives/639">IBM z10 development </a>that I blogged about earlier says exactly that: for some parts of the validation, there is no way around using the actual hardware RTL design.</p>
<p>And in the end, you have to test the timing and analogue aspects of a design on physical hardware anyway. There should not be too many suprises at this stage, if you have used all of the cool current tools right. But there surely will be some &#8212; even a VHDL simulation is a simulation, and not reality, after all.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/651"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/651" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/651" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/651/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

