<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; UpMarc</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/upmarc/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Memory Models: x86 is TSO, TSO is Good</title>
		<link>http://jakob.engbloms.se/archives/1435?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1435#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:16:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Doug Lea]]></category>
		<category><![CDATA[Francesco Zappa Nardelli]]></category>
		<category><![CDATA[memory consistency]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[SPARC]]></category>
		<category><![CDATA[UpMarc]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1435</guid>
		<description><![CDATA[By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well. He showed a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif"><img class="size-full wp-image-1016 alignleft" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="" width="122" height="45" /></a>By chance, I got to attend a day at the <a href="http://www.it.uu.se/research/upmarc/events/SS2011/Programme.html">UPMARC Summer School</a> with a very enjoyable talk by <a href="http://moscova.inria.fr/~zappa/">Francesco Zappa Nardelli </a>from INRIA. He described his work (along with others) on <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">understanding and modeling multiprocessor memory models</a>. It is a very complex subject, but he managed to explain it very well.</p>
<p><span id="more-1435"></span>He showed a very interesting discussion from a few years ago on the x86 memory model and the implementation of spinlocks in the Linux kernel. Various experts went back and forth over whether the final MOV that sets a lock variable to 1 needed to be prefixed by LOCK or not. The discussion ended when Linus Torvalds said &#8220;I know that it is needed&#8221;. Only to see an Intel architect finally intervene and say &#8220;you know, really, it isn&#8217;t needed&#8221;. This was followed by a series of releases of Intel manuals documenting the x86 memory model, with increasing precision in each release. Intel also actually changed the published rules along the road, withdrawing some optimizations as they realized that they would break existing software.</p>
<p>Note that such a description of a memory model must both describe existing hardware, and serve as the guideline for future hardware. Therefore, there are optimizations that are not implemented today but which are possible given the rules. Such optimization opportunities can be removed from the rulebook as long as they have never been part of shipping hardware, so it is not as crazy as it might sound.</p>
<p>Anyway, the point that Francesco made was both to tell an interesting story from history, and making the point that describing and understanding memory models is hard. I certainly agree with that. I recall an ISCA many years ago when some computer architecture professors all agreed that very few people really understand consistency and weak memory models.</p>
<p>To make life easier for programmers, Francesco and Peter Sewell (in Cambridge) has defined their own set of rules for x86 memory consistency. This is not an architecture spec, but a rule set for regular programmers. It is found at <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">http://www.cl.cam.ac.uk/~pes20/weakmemory/</a>. Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model.</p>
<p>They have also attempted to formalize the Power Architecture memory model. Both the actual memory model and their model of it can only be described as very complex. The programmer&#8217;s model is expressed in terms of store queues, speculative instruction execution, and commits of instructions. Not something you easily keep in your head. It is interesting to note that ARM MPCore essentially copied the Power Architecture.</p>
<p>He showed an interactive simulation of the Power memory model, and the way that you need to think about it in terms of propagating information between threads and committing them. It is possible to propagate values and then another propagation overrides a value before the thread commits&#8230; Fun. Or a headache.</p>
<p>The big take-away from the talk for me is that it confirms the observation made may times before that <a href="http://en.wikipedia.org/wiki/Memory_ordering">SPARC TSO </a>seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines.</p>
<p>Maybe TSO does not theoretically scale in the same insane way as Power or Alpha does/did. But the cost of that theoretical scalability is that programmers might have to litter their code with sync operations just to get it to run correctly. With too many sync operations, the code will run very slowly negating any advantage on the hardware level. Note that sync operations can be very expensive. <a href="http://g.oswego.edu/">Doug Lea</a>, in the audience, pointed out that a sync can cost up to 300 cycles on a POWER5.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1435"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1435" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1435" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1435/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009 Presentations Online</title>
		<link>http://jakob.engbloms.se/archives/1023?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1023#comments</comments>
		<pubDate>Thu, 03 Dec 2009 08:29:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[keynote]]></category>
		<category><![CDATA[LTE]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1023</guid>
		<description><![CDATA[The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the program page for the workshop. Let me add some comments on the workshop per se. This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1016" style="margin-top: 5px; margin-bottom: 5px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" />The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">program page for the workshop</a>. Let me add some comments on the workshop per se.</p>
<p><span id="more-1023"></span>This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from a hardware company. So there was really nothing here directly about how to build multicore chips. Rather, the workshop tended to be about how to program, use, measure performance on, verify software for, and generally work with multicore chips. From the perspective of software people, rather than hardware designers.</p>
<p>Obviously, hardware aspects enter into such talks, but it is the perspective of a user, not a designer. For example, a hardware designer could explain how an atomic compare-and-swap is optimized in a multicore device. But here, we saw measurements on the actual operation latencies observed on real machines using such operations. Quite refreshing, and closer to my personal interests.</p>
<p>The keynote by <a href="http://a-vajda.eu/blog/">Andras Vajda</a> of Ericsson was quite interesting. The slides are not online, but the main points that I picked up and that I might not have considered before:</p>
<ul>
<li>Software development costs can mean that the cheapest, fastest, most efficient hardware is not necessarily the most economic. Too hard to code for means the software development time and effort removes the advantage. Obvious, but worth reiterating. Software is king.</li>
<li>The workload on a cellular basestation can sometimes be highly linear and single-threaded. For example, serving a single terminal with a very high bandwidth LTE connection. And suddenly shift to a massively parallel workload as a crowd of a thousand all suddenly appear and start doing data downloads. And then go back to serial again. This means that the age-old argument that signal processing naturally &#8220;<a href="http://www.edn.com/blog/980000298/post/50023005.html">conveniently concurrent</a>&#8221; (<a href="http://www.scdsource.com/article.php?id=87">and here</a>) is not always true. Nice point!</li>
<li>Thus, we need adaptable architectures that can trade serial and parallel performance over time, and rebalance quite quickly. In the same chip.</li>
<li>He is a firm believer that homogeneous systems will win out in the end, I still hold on to a belief in accelerators and offload engines and DSPs. This is partially because of an admitted focus on servers and services processors, and not on the baseband and signalling side. Makes sense.</li>
<li>Domain-specific languages (DSL) are the future of efficient programming. Agree.</li>
</ul>
<p>On the topic of DSLs, there was a question about the cost to support them. To me, that is a non-issue. In the organizations that I have worked, it seems that maintaining a useful DSL requires at most one engineer. Developing one, a few good computer scientists for a fairly limited time. In any case, they tend to appear organically when good programmers <a href="http://jakob.engbloms.se/archives/747">generalize repeated tasks</a>.</p>
<p>I gave a keynote about how multicore has impacted virtual platforms (in particular, <a href="http://www.virtutech.com/products/simics">Virtutech Simics</a>) with the following main points:</p>
<ul>
<li>Multicore targets increase the performance pressure on a virtual platform, as more processors will have to be simulated.</li>
<li>Multicore hosts means that sequential performance of the host is going down compared to the aggregate parallel performance demands from the targets.</li>
<li>To handle large target systems, the virtual platform itself has to run multithreaded on a multicore host. Getting this in place is a major, interesting, and sometimes painful process.</li>
<li>Once you have a parallel virtual platform, multicore hosts provide a very nice boost in scalability and the manageable system sizes. A single multithreaded virtual platform process is also a bit easier to manage from a user perspective.</li>
<li>All features in the virtual platform have to be multicore and multimachine-aware&#8230; meaning that they often get a bit harder to use initially, as there is no &#8220;default processor&#8221; you can fall back to for debugging setups etc. Everything has to be explicitly targeted.</li>
<li>Multicore targets have proven to  be a great sales driver for virtual platforms, as debugging software on a physical multicore, multichip, multiboard system is just too painful.</li>
</ul>
<p>Overall, this was a fun event, looking forward to next year at Chalmers!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1023"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1023" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1023" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1023/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009: 2D Stream Processing for Manycore</title>
		<link>http://jakob.engbloms.se/archives/1015?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1015#comments</comments>
		<pubDate>Thu, 26 Nov 2009 15:03:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[David Black-Schaffer]]></category>
		<category><![CDATA[efficiency]]></category>
		<category><![CDATA[manycore]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[Stanford]]></category>
		<category><![CDATA[Stream programming]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1015</guid>
		<description><![CDATA[Today here at the MCC 2009 workshop, I heard an interesting talk by David Black-Schaffer of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing. What [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.it.uu.se/research/upmarc"><img class="alignleft size-full wp-image-1016" style="margin-top: 10px; margin-bottom: 10px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" /></a>Today here at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">MCC 2009 workshop</a>, I heard an interesting talk by <a href="http://cva.stanford.edu/people/davidbbs/">David Black-Schaffer </a>of Stanford university.  His work is on stream programming for image processing (&#8220;2D streams&#8221;). Pretty simple basic idea, to use 2D blobs of pixels as kernel inputs rather than single values or vectors. Makes eminent sense for image processing.</p>
<p><span id="more-1015"></span>What was striking was his basic attitude to the target machines: he assumed &#8220;manycore&#8221; like 100-way Tilera, 320-way ATI/AMD graphics processors, and similar devices. With these many cores, the goal is to implement an algorithm using a few cores as possible to save power. He assumes that there are always cores available, and wastes no time reducing the core count. One kernel on each core, including replicated kernels and synthesized buffering kernels.</p>
<p>This was the best example I have seen so far on an actual implementation of idea that <strong>cores are free</strong>. For previous writing on this topic, see &#8220;<a href="http://jakob.engbloms.se/archives/269">what is efficiency when cores are free</a>&#8221; and &#8220;<a href="http://jakob.engbloms.se/archives/123">real-time control when cores are free</a>&#8220;.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1015"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1015" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1015" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1015/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

