<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; computer simulation technology</title>
	<atom:link href="http://jakob.engbloms.se/archives/category/virtual/simulation-tech/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Wind River Blog: Why Simics will not run Super Mario</title>
		<link>http://jakob.engbloms.se/archives/1510?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1510#comments</comments>
		<pubDate>Fri, 14 Oct 2011 09:27:46 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[Wind River Blog]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1510</guid>
		<description><![CDATA[On my Wind River blog, I just posted a fairly long post about simulation abstraction levels. It was inspired by a cool article in ArsTechnica about Nintendo emulators, and the costs and benefits of being ever more faithful to the hardware. Tweet]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" />On my Wind River blog, I just posted a <a href="http://blogs.windriver.com/wind_river_blog/2011/10/why-simics-wont-run-super-mario-.html">fairly long post about simulation abstraction levels</a>. It was inspired by a cool article in <a href="http://arstechnica.com/gaming/news/2011/08/accuracy-takes-power-one-mans-3ghz-quest-to-build-a-perfect-snes-emulator.ars">ArsTechnica</a> about Nintendo emulators, and the costs and benefits of being ever more faithful to the hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1510"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1510" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1510" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1510/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU for Instruction-Set Simulation &#8211; Maybe, Maybe not</title>
		<link>http://jakob.engbloms.se/archives/1506?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1506#comments</comments>
		<pubDate>Sat, 08 Oct 2011 19:17:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CCGrid]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1506</guid>
		<description><![CDATA[I just read a quite interesting article by Christian Pinto et al, &#8220;GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms&#8220;, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png"><img class="alignleft size-full wp-image-125" style="margin: 5px 10px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="" width="100" height="100" /></a>I just read a quite interesting article by Christian Pinto et al, &#8220;<a href="http://infoscience.epfl.ch/record/164471">GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms</a>&#8220;, published at the <a href="http://www.ics.uci.edu/~ccgrid11/">CCGRID 2011 </a>conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.</p>
<p><span id="more-1506"></span>The paper describes a simulation for a network-on-chip based homogeneous system containing a &#8220;ARM-subset&#8221; ISS instances with local instruction and data caches, some local RAM, and also some shared RAM. Each core runs its own local software load, there is no SMP operating system. All communication between cores is over shared memory, using explicit operations across the NoC. All cores run a single cycle before they check communications from their neighbors.</p>
<p>This last point is crucial to understanding why this is feasible at all &#8211; in general, simulating a general shared-memory multiprocessor machine on a shared-memory multiprocessor falls down on the synchronization overhead. If your simulation semantics dictate that you synchronize every cycle anyway, and you do not try to optimize each core simulator, there is clearly decent room for parallel execution. By including the cache, they increase scalability, since there is more work per target cycle that can be run in isolation.</p>
<p>After reading the article, I am impressed by their work &#8211; just getting this to work is pretty good work. But there are quite a few questions which are not really answered in the article and which are crucial to understanding just how well GPGPUs could be used for this kind of ISS work.</p>
<ul>
<li>The targeted level of abstraction is a bit confusing. The authors claim it is &#8220;instruction accurate and not cycle accurate&#8221;, but still simulate caches and cycle-based communications across the NoC. If I read the paper right, communications will take a varying number of cycles depending on the distance for messages to travel. This is more detailed than a typical &#8220;instruction accurate&#8221; simulator.</li>
<li>The target system does not run an OS &#8211; that might (but I do not know) be an advantage for their approach, since it probably implies less variation in the instruction flow in cores, potentially enhancing the amount of time that all ISSes in a thread group in the GPU can execute the same instruction. This would seem crucial, as if each ISS was running a totally different program, the instruction execution part of the code would be running serialized.</li>
<li>They should really try to run the same kind of simulation on a high-end x86 CPU like an Intel Sandy Bridge with 8 or more hardware threads. I wonder if their scaling might not work just as well there &#8211; and with a much faster serial execution engine. This should give  a much more relevant point of comparison for GPU vs CPU execution of the simulator than&#8230;</li>
<li>the comparison object they use right now, a JIT-accelerated multicore simulation using OVP seems pretty irrelevant since it is not doing the same thing at all. That simulator does not simulate the caches or NoC, just a large number of isolated processors. They also do not run a parallel program on OVP, but rather a large number of single-core fibonacci and dhrystone programs. Thus, the fact that OVP uses a large temporal decoupling time slice does not matter for semantics. It just does not seem like a very relevant comparison point. OVP and their simulator try to solve different problems &#8211; fast execution of general code vs. performance profiling of massively parallel machines.</li>
<li>As I understand it, the given &#8220;S-MIPS&#8221; numbers in the evaluation tell us the total number of MIPS that we get out across all target cores. That seems to peak around 2000 &#8211; which isn&#8217;t necessarily that fantastic if we compare to high-performance ISS work in general where a few GIPS is definitely achievable. It is pretty good considering the level of detail here, though, where i would expect a normal ISS + cache simulator to produce at most a few MIPS. Once again, the authors need to be a bit more precise as to what they compare to what.</li>
<li>Not having an MMU and not implementing any interrupts or exceptions in the target machines avoids a large part of the complexity of a real ISS. That complexity might well be too much for the quite rigid execution environment of a GPGPU.</li>
<li>They missed that Simics, unique among instruction-accurate mainstream simulators, is <a href="http://jakob.engbloms.se/archives/128">parallel </a>since version 4.0.</li>
</ul>
<p>So, overall, this paper does not really tell us much whether a GPGPU can be used for instruction-set simulation in general. It does tell us that it might be doable, but there are many crucial complications which are not addressed.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1506"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1506" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1506" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1506/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: A Virtual Year</title>
		<link>http://jakob.engbloms.se/archives/1483?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1483#comments</comments>
		<pubDate>Sat, 20 Aug 2011 07:25:59 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[hypersimulation]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1483</guid>
		<description><![CDATA[There is a new post at my Wind River blog, about hypersimulation in virtual platforms and how it lets virtual time fly much faster than real time. It was the result of simple mistake of leaving Simics running in the background as I did other work on  my machine. Tweet]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" />There is a new post at my Wind River blog, about <a href="http://blogs.windriver.com/engblom/2011/08/a-virtual-year.html">hypersimulation in virtual platforms and how it lets virtual time fly much faster than real time</a>. It was the result of simple mistake of leaving Simics running in the background as I did other work on  my machine.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1483"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1483" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1483" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1483/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Steve Furber: Emulated BBC Micro on Archimedes on PC</title>
		<link>http://jakob.engbloms.se/archives/1476?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1476#comments</comments>
		<pubDate>Sat, 13 Aug 2011 09:35:43 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[ACORN Archimedes]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[BBC Micro]]></category>
		<category><![CDATA[Communications of the ACM]]></category>
		<category><![CDATA[Steve Furber]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1476</guid>
		<description><![CDATA[I just read an interview with Steve Furber, the original ARM designer, in the May 2011 issue of the Communications of the ACM. It is a good read about the early days of the home computing revolution in the UK. He not only designed the ARM processor, but also the BBC Micro and some other [...]]]></description>
			<content:encoded><![CDATA[<p>I just read an <a href="http://cacm.acm.org/magazines/2011/5/107684-an-interview-with-steve-furber/fulltext">interview with Steve Furber</a>, the original ARM designer, in the May 2011 issue of the Communications of the ACM. It is a good read about the early days of the home computing revolution in the UK. He not only designed the ARM processor, but also the <a href="http://en.wikipedia.org/wiki/BBC_Micro">BBC Micro </a>and some other early machines.</p>
<p><span id="more-1476"></span>The title of this blog post is based on one fun little tidbit in the article. Steve actually keeps a functioning BBC Micro system around in the shape of an emulator.  But the emulator is not running on his current PC directly, but on top an emulated Acorn Archimedes machine.  A machine from 1981 emulated on top of a machine from 1987 that is in turn emulated on top of a machine from this year.  A real-life example of how nested emulation can help save our digital heritage (an interesting topic I <a href="http://jakob.engbloms.se/archives/180">blogged about before</a>).</p>
<p>He also has a physical BBC Micro around, and it is impressive just how high-quality construction could be back then. I still believe the old IBM keyboards are the best ever made, while the cheap laptop I am writing this on probably costs less to manufacture than that keyboard did&#8230; obviously at the expense of keyboard feeling.  Grumble grumble kurmudgeon me, right? Steve loves the way the BBC Micro keyboard was designed, and claims that it could hold up to 10 years of abuse by children. In contrast, my old ZX Spectrum had some 4 keyboard membranes worn out in the six years or so I used it. But that is what cheap gives you.</p>
<p>He also touches on the topic of just how inaccessible computing is getting in terms of getting to grips with the machine and the hardware level. Back in the 1980s, you could pretty easily build things and attach them to your home computer.  Slow buses and low-level access and no stupid operating system getting in your way facilitated that.  He proposes people using a PIC development kit, but maybe an Arduino is a better choice.  I like LEGO Mindstorms, but that is indeed a very high-level view of hardware and robotics.  Not at all like the direct programming experience I had with <a href="http://jakob.engbloms.se/archives/758">my first ZX Spectrum back in 1984</a>.</p>
<p>It is also funny to read how 1983-era 16-bit processors were not keeping up with memory &#8211; too complex instructions made the machine use memory too rarely. How the world had changed!</p>
<p>Recommended reading!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1476"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1476" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1476" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1476/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: How to Get Virtual</title>
		<link>http://jakob.engbloms.se/archives/1474?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1474#comments</comments>
		<pubDate>Tue, 02 Aug 2011 12:10:57 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[development methodology]]></category>
		<category><![CDATA[Functional models]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1474</guid>
		<description><![CDATA[There is a new post at my Wind River blog, about how you build virtual platforms with Simics. The post is more about the methodology than the nature of models, cycle accuracy, endianness, and all the other details of virtual platform modeling. I have written about modeling methodology on this blog too, and in particular I [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" />There is a new post at my Wind River blog, about how you <a href="http://blogs.windriver.com/engblom/2011/08/how-to-get-virtual-.html">build virtual platforms with Simics</a>. The post is more about the methodology than the nature of models, cycle accuracy, endianness, and all the other details of virtual platform modeling. I have written about modeling methodology on this blog too, and in particular I would recommend looking at &#8220;<a href="../archives/1317">Two perspectives on modeling</a>&#8220;.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1474"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1474" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1474" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1474/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Memory Models: x86 is TSO, TSO is Good</title>
		<link>http://jakob.engbloms.se/archives/1435?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1435#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:16:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Doug Lea]]></category>
		<category><![CDATA[Francesco Zappa Nardelli]]></category>
		<category><![CDATA[memory consistency]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[SPARC]]></category>
		<category><![CDATA[UpMarc]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1435</guid>
		<description><![CDATA[By chance, I got to attend a day at the UPMARC Summer School with a very enjoyable talk by Francesco Zappa Nardelli from INRIA. He described his work (along with others) on understanding and modeling multiprocessor memory models. It is a very complex subject, but he managed to explain it very well. He showed a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif"><img class="size-full wp-image-1016 alignleft" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="" width="122" height="45" /></a>By chance, I got to attend a day at the <a href="http://www.it.uu.se/research/upmarc/events/SS2011/Programme.html">UPMARC Summer School</a> with a very enjoyable talk by <a href="http://moscova.inria.fr/~zappa/">Francesco Zappa Nardelli </a>from INRIA. He described his work (along with others) on <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">understanding and modeling multiprocessor memory models</a>. It is a very complex subject, but he managed to explain it very well.</p>
<p><span id="more-1435"></span>He showed a very interesting discussion from a few years ago on the x86 memory model and the implementation of spinlocks in the Linux kernel. Various experts went back and forth over whether the final MOV that sets a lock variable to 1 needed to be prefixed by LOCK or not. The discussion ended when Linus Torvalds said &#8220;I know that it is needed&#8221;. Only to see an Intel architect finally intervene and say &#8220;you know, really, it isn&#8217;t needed&#8221;. This was followed by a series of releases of Intel manuals documenting the x86 memory model, with increasing precision in each release. Intel also actually changed the published rules along the road, withdrawing some optimizations as they realized that they would break existing software.</p>
<p>Note that such a description of a memory model must both describe existing hardware, and serve as the guideline for future hardware. Therefore, there are optimizations that are not implemented today but which are possible given the rules. Such optimization opportunities can be removed from the rulebook as long as they have never been part of shipping hardware, so it is not as crazy as it might sound.</p>
<p>Anyway, the point that Francesco made was both to tell an interesting story from history, and making the point that describing and understanding memory models is hard. I certainly agree with that. I recall an ISCA many years ago when some computer architecture professors all agreed that very few people really understand consistency and weak memory models.</p>
<p>To make life easier for programmers, Francesco and Peter Sewell (in Cambridge) has defined their own set of rules for x86 memory consistency. This is not an architecture spec, but a rule set for regular programmers. It is found at <a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/">http://www.cl.cam.ac.uk/~pes20/weakmemory/</a>. Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model.</p>
<p>They have also attempted to formalize the Power Architecture memory model. Both the actual memory model and their model of it can only be described as very complex. The programmer&#8217;s model is expressed in terms of store queues, speculative instruction execution, and commits of instructions. Not something you easily keep in your head. It is interesting to note that ARM MPCore essentially copied the Power Architecture.</p>
<p>He showed an interactive simulation of the Power memory model, and the way that you need to think about it in terms of propagating information between threads and committing them. It is possible to propagate values and then another propagation overrides a value before the thread commits&#8230; Fun. Or a headache.</p>
<p>The big take-away from the talk for me is that it confirms the observation made may times before that <a href="http://en.wikipedia.org/wiki/Memory_ordering">SPARC TSO </a>seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines.</p>
<p>Maybe TSO does not theoretically scale in the same insane way as Power or Alpha does/did. But the cost of that theoretical scalability is that programmers might have to litter their code with sync operations just to get it to run correctly. With too many sync operations, the code will run very slowly negating any advantage on the hardware level. Note that sync operations can be very expensive. <a href="http://g.oswego.edu/">Doug Lea</a>, in the audience, pointed out that a sync can cost up to 300 cycles on a POWER5.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1435"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1435" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1435" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1435/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disappointing SystemC Debugger Integration Paper</title>
		<link>http://jakob.engbloms.se/archives/1419?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1419#comments</comments>
		<pubDate>Wed, 25 May 2011 19:35:21 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[SystemC]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1419</guid>
		<description><![CDATA[Since I have a certain interest in debugging, I was happy find the article &#8220;Guidelines for SystemC &#8211; Debugger Integration&#8221; at the usually interesting Design and Reuse website. However, I must say that it was pretty disappointing. The key idea of the article is to put the debug service in a thread and the debugged [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/05/debug-small.png"><img class="alignleft size-full wp-image-1421" title="debug small" src="http://jakob.engbloms.se/wp-content/uploads/2011/05/debug-small.png" alt="" width="81" height="73" /></a>Since I have a certain interest in debugging, I was happy find the article <a href="http://www.design-reuse.com/articles/26457/guidelines-for-systemc-debugger-integration.html">&#8220;Guidelines for SystemC &#8211; Debugger Integration&#8221;</a> at the usually interesting Design and Reuse website. However, I must say that it was pretty disappointing.</p>
<p><span id="more-1419"></span>The key idea of the article is to put the debug service in a thread and the debugged SystemC system in another thread, and stop SystemC using a mutex. Yes, you have to do that.</p>
<p>But the really interesting part is how to connect the debugger into the virtual platform, and what that requires from the models and processors and the infrastructure. Unfortunately, the article is pretty silent on that. There is some talk of breakpoint handling required in the ISS, and how to update target memory that mostly corresponds to the debug interface of SystemC TLM-2.0 in scope.</p>
<p>Also, nothing about multicore debug and how to deal with temporal decoupling and debugging, or the need for repeatability across runs. Or breakpoints on things like hardware accesses and internal actions in the simulator.</p>
<p>Too bad.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1419"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1419" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1419" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1419/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: 20, 30, 60 years ago</title>
		<link>http://jakob.engbloms.se/archives/1408?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1408#comments</comments>
		<pubDate>Fri, 06 May 2011 12:27:37 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Wind River]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1408</guid>
		<description><![CDATA[There is a new post at my Wind River blog, about some computing history. Wind River turns thirty this year, Simics twenty, and simulation for debug (and probably debug in general) turns sixty. Computing has come a long way. Tweet]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" />There is a new post at my Wind River blog, about <a href="http://blogs.windriver.com/engblom/2011/05/twenty-thirty-and-sixty-years-ago.html">some computing history</a>. Wind River turns thirty this year, Simics twenty, and simulation for debug (and probably debug in general) turns sixty. Computing has come a long way.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1408"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1408" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1408" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1408/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: Testing Integrated Software in Simulation</title>
		<link>http://jakob.engbloms.se/archives/1391?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1391#comments</comments>
		<pubDate>Sun, 13 Mar 2011 20:36:30 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[control software]]></category>
		<category><![CDATA[NASA]]></category>
		<category><![CDATA[Toyota]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1391</guid>
		<description><![CDATA[There is a new post at my Wind River blog, about the testing on an integrated software stack in simulation. I base the discussion on the very interesting report about the Toyota &#8220;unintended acceleration&#8221; problems and the deep investigation into the control software of the affected vehicles performed by a NASA team (!). The report [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" />There is a new post at my Wind River blog, about the <a href="http://blogs.windriver.com/wind_river_blog/2011/03/integration-and-testing-the-integration.html">testing on an integrated software stack in simulation</a>. I base the discussion on the very interesting report about the Toyota &#8220;unintended acceleration&#8221; problems and the deep investigation into the control software of the affected vehicles performed by a NASA team (!). The report covers a lot of different tools, but also notes that about the only thing not done was to integrate the complete software stack in simulation.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1391"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1391" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1391" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1391/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EETimes: James Aldis on Performance Modeling</title>
		<link>http://jakob.engbloms.se/archives/1387?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1387#comments</comments>
		<pubDate>Thu, 03 Mar 2011 20:13:03 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[hardware design]]></category>
		<category><![CDATA[hardware modeling]]></category>
		<category><![CDATA[James Aldis]]></category>
		<category><![CDATA[OMAP]]></category>
		<category><![CDATA[performance optimization]]></category>
		<category><![CDATA[TI]]></category>
		<category><![CDATA[Virtio]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1387</guid>
		<description><![CDATA[James Aldis of TI has published an article in the EEtimes about how Texas Instruments uses SystemC in the modeling of their OMAP2 platform. SystemC is used for early architecture modeling and performance analysis, but not really for a virtual platform that can actually run software. The article offers a good insight into the virtual [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/03/TI-logo.png"><img class="alignleft size-full wp-image-1388" style="margin: 5px 10px;" title="TI logo" src="http://jakob.engbloms.se/wp-content/uploads/2011/03/TI-logo.png" alt="" width="80" height="76" /></a>James Aldis of TI has published an article in the <a href="http://www.eetimes.com">EEtimes</a> about how <a href="http://www.eetimes.com/General/DisplayPrintViewContent?contentItemId=4212778">Texas Instruments uses SystemC in the modeling of their OMAP2 platform</a>. SystemC is used for early architecture modeling and performance analysis, but not really for a virtual platform that can actually run software. The article offers a good insight into the virtual platform use of hardware designers, which is significantly different from the virtual platform use of software designers.<br />
<span id="more-1387"></span>For a software person like myself, this article offers a well-written  insight into the world of hardware design and bus optimization for SoCs.</p>
<p>TI deploys two totally different platforms for hardware and software development, which makes perfect sense.  The goals are so different between a high-speed software development platform and performance-accurate hardware design platform that trying to force them together would likely just create a bad compromise that is bad for everybody.</p>
<p>Additionally, FPGAs are used to create timing-dependent low-level code, where you need both timing accuracy and decent speed.  It is worth noting that the performance model is mostly &#8220;dataless&#8221; &#8211; it models the timing of actions and their dependencies, but not their values and computations.</p>
<blockquote><p>The different models serve different purposes, require different levels of effort to use, and become available at different times during the project. The SystemC performance model is always available first and is always the simplest to create and use. The virtual platform is the next to become available. It is used for software development and has very little timing accuracy.  TI uses Virtio technology to create this model rather than SystemC.</p></blockquote>
<p>Given the number of ultimately failed attempts I have seen at making timing and function available in the same model but as orthogonal concerns, this observation in the article is very insightful:</p>
<blockquote><p>It would appear the choice of two different technologies for the virtual platform and the performance model is inefficient, wasting potential code reuse. However, the two have completely different (almost fully orthogonal) requirements, and at module level almost no code reuse is possible.</p></blockquote>
<p>Maybe this is an impossible dream in the general case.</p>
<p>One somewhat surprising statement in the article is that there is no real software available to use in the SoC design phase. Often, virtual platforms are sold as being able to use &#8220;the real software&#8221; when designing hardware. But in the case of TI, the software is mostly written by their customers, with little available for TI to use. Thus, they are forced to design their own test cases to drive the hardware design process.</p>
<blockquote><p>The requirements on the simulation technology are first and foremost ease in creating test cases and models and credibility of results. The emphasis on test-case creation is a consequence of the complexity of the devices and of the way in which an SoC platform such as OMAP-2 is used: because the whole motivation is to be able to move from marketing requirements to RTL freeze and tape-out in a very short time; and because in many cases large parts of the software will be written by the end customer and not by the SoC provider (Texas Instruments, in this article), the performance-area-power tradeoff of a proposed new SoC must be achieved without the aid of &#8220;the software.&#8221;</p></blockquote>
<p>The platform they built is all based on clock-cycle-level interfaces (CC), which is very natural when the primary use case is hardware design.</p>
<p>The primary component optimized in the TI design process is the on-chip interconnect structure, called the &#8220;NoC&#8221; in the article. Each SoC variant is built from a set of (usually already existing) devices and processor cores. The main work of the integration is designing an appropriate NoC for the SoC. The NoC design is crucial to the actual performance level the final SoC product will have.</p>
<p>By playing with the topology, the level of concurrency, and the level of pipelining in the NOC, it&#8217;s possible to create SoCs from the same basic modules with quite different capabilities.</p>
<p>The only real instruction-set simulators used are CC-level models of DSPs, used for software optimization taking but contention into account. No models of the ARM control cores are used. Mostly, processors are represented by stochastic or trace-driven traffic generators that put transactions on buses but do not actually run any real code.</p>
<p>The stochastic processor models are very powerful and provide traffic that is very similar to a real processor.  A very elegant property of such models is that it is very easy to change the parameters of the model to model quite different software/processor scenarios. Compared to writing real test programs for a full ISS, this is much faster and allows for the exploration of more alternatives.</p>
<p>The stochastic models are used along side function-graph breakdowns of software, essentially models that say that an application does A, then B, then C, and that maybe D can happen in parallel. This model of an application is connected to the hardware simulation and can control when things happen and what goes on in parallel. It amounts to a simple model of what an RTOS would do, to some extent.</p>
<p>Configurability is a key theme throughout the OMAP architecture exploration platform. SystemC being what it is, it is limited to configuration at start-up time, but that is perfectly sensible for an architecture exploration use case where you want to setup and platform and test its performance. Dynamic reconfiguration during a run is not that important.  TI has spent a great deal of effort in making the system easy to configure using parameter files.</p>
<p>The article goes into many more fascinating details on the models used.  I can only say one thing: read it, if you have any interest in these kinds of issues.</p>
<p>Good work, James!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1387"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1387" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1387" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1387/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interrupts and Temporal Decoupling</title>
		<link>http://jakob.engbloms.se/archives/1384?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1384#comments</comments>
		<pubDate>Sun, 27 Feb 2011 21:09:17 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Grant Martin]]></category>
		<category><![CDATA[interrupt]]></category>
		<category><![CDATA[Temporal decoupling]]></category>
		<category><![CDATA[Tensilica]]></category>
		<category><![CDATA[virtual]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1384</guid>
		<description><![CDATA[I am just finishing off reading the chapters of the Processor and System-on-Chip Simulation book (where I was part of contributing a chapter), and just read through the chapter about the Tensilica instruction-set simulator (ISS) solutions written by Grant Martin, Nenad Nedeljkovic and David Heine. They have a slightly different architecture from most other ISS [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears1.png"><img class="alignleft size-full wp-image-737" style="margin: 5px;" title="gears1" src="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears1.png" alt="" width="56" height="57" /></a>I am just finishing off reading the chapters of the <a href="http://www.springer.com/engineering/circuits+%26+systems/book/978-1-4419-6174-7" target="_self"><em>Processor and System-on-Chip Simulation </em></a>book (where <a href="http://blogs.windriver.com/engblom/2011/01/processor-and-soc-simulation-book.html">I was part of contributing a chapter</a>), and just read through the chapter about the <a href="http://www.tensilica.com">Tensilica </a>instruction-set simulator (ISS) solutions written by <a href="http://www.chipdesignmag.com/martins/">Grant Martin</a>, Nenad Nedeljkovic and David Heine. They have a slightly different architecture from most other ISS solutions, since that they have an inherently variable target in the configurable and extensible Tensilica cores. However, the more interesting part of the chapter was the discussion on system modeling beyond the core. In particular, how they deal with interrupts to the core in the context of a <a href="http://jakob.engbloms.se/?s=temporal+decoupling">temporally decoupled </a>simulation.</p>
<p><span id="more-1384"></span>This is a small detail, but one where I have always had a feeling that some fundamental assumption was missing in my discussions with various people from the hardware design community. It always seemed that hardware designers assumed a different basic design &#8211; and Grant Martin explained it very well just what that was. They only check for interrupts at the beginning of a time slice. Which makes interrupts less precise  versus the code, but also makes the core interpreter fairly simple since all it has to do is to churn through instructions.</p>
<p>There is another solution, which is employed in Simics, where the processor can take an interrupt at any point in a time quantum. To do this, the processor needs to be aware of what is going to happen. The essentials of the solution is to have devices call the processor and tell it that they intend to interrupt it at some point T in time. The processor simulator then makes sure to stop and give the device model a chance to act at that exact point in time. It is obvious that this solution is easily generalized to cover all time callbacks needed to drive device work. A significant part of the responsibility for running the event-driven simulation is moved into the processor core.</p>
<p>Making the event queue visible to the processor also gives the processor a chance to hypersimulate, or skip idle  time. Since it knows the next point in time that something will happen  (either the end of a time quantum or an event posted by a device), it  can very easily, safely, and <a href="http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html">repeatably </a>jump forward in time without any  impact on simulation semantics.</p>
<p>When dealing with multiple processors, this means that each processor will have precise interrupts from the devices that are close to it. Timers and IO interrupts tend to work closely with a certain processor for a prolonged period of time. Interrupts between processors suffer a time-quantum delay sometimes, but that is no worse than the solution of checking all interrupts at time-quantum boundaries.</p>
<p>Qemu uses a solution which is a mix of the two. <a href="http://www.usenix.org/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf">According to the 2005 Usenix paper</a>, devices do call into the processor to announce an interrupt, but this is handled by &#8220;soon&#8221; returning to the processor main loop. Processors are not responsible for keeping track of interrupts, making it very imprecise and not very repeatable when interrupts will happen.</p>
<p>Thus, we can see that there are a few different ways to implement interrupts in virtual platforms. Each approach comes from a different tradition and features different trade-offs.</p>
<p>I was a bit surprised by the comment in the Tensilica chapter that only  correctly synchronized programs will work on a temporally decoupled  simulation. In my experience, temporal decoupling is transparent to software functionality &#8211; all software runs. The perceived timing of operations can be different, and some tightly-coupled code might behave in suboptimal ways, but it certainly runs and works. And lets you <a href="http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html">observe  parallel code errors</a>.</p>
<p>Temporal decoupling is necessary in any fast platform, and its effect on semantics are really minor. With the simple tweak of having a processor know when interrupts might happen, it will also not affect the device-processor interface very much, maintaining very tight synchronization between processors and their controlled hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1384"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1384" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1384" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1384/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Evaluating HAVEGE Randomness</title>
		<link>http://jakob.engbloms.se/archives/1374?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1374#comments</comments>
		<pubDate>Thu, 17 Feb 2011 21:33:14 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[evaluation methodology]]></category>
		<category><![CDATA[HAVEGE]]></category>
		<category><![CDATA[random number generation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1374</guid>
		<description><![CDATA[I previously blogged about the HAVEGE algorithm that is billed as extracting randomness from microarchitectural variations in modern processors. Since it was supposed to rely on hardware timing variations, I wondered what would happen if I ran it on Simics that does not model the processor pipeline, caches, and branch predictor. Wouldn&#8217;t that make the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>I previously blogged about the <a href="http://jakob.engbloms.se/archives/1370">HAVEGE algorithm </a>that is billed as extracting randomness from microarchitectural variations in modern processors. Since it was supposed to rely on hardware timing variations, I wondered what would happen if I ran it on Simics that does not model the processor pipeline, caches, and branch predictor. Wouldn&#8217;t that make the randomness of HAVEGE go away?</p>
<p><span id="more-1374"></span>I got HAVEGE up on a Simics x86 target model with Linux pretty quickly, and ran the two provided tests. <em>Ent</em>, which is a quick entropy test, and <em>nist</em> which supposedly much more thorough.</p>
<p>To my surprise, they both said the randomness we got was totally acceptable. This would seem to invalidate the fundamental assumption of HAVEGE &#8211; that it needs to collect randomness from hardware in order to produce good-quality randomness. To try to understand a bit more of what was going on, I took at look at the execution using <a href="http://blogs.windriver.com/engblom/2010/05/analyzed.html">Simics Analyzer</a> (the dredd.motherboard.processor lines are the processors, and the orange part is the HAVEGE program, yellow is the kernel):</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged.png"><img class="aligncenter size-medium wp-image-1377" title="OS scheduler messing with haveged" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-300x128.png" alt="" width="300" height="128" /></a></p>
<p>Zooming in a bit:</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-closer-look.png"><img class="aligncenter size-medium wp-image-1378" title="OS scheduler messing with haveged closer look" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-closer-look-300x128.png" alt="" width="300" height="128" /></a>We can see that the program is regularly interrupted by the OS, which could be  a reason for random timing variations. The instructions run by the OS should vary in count, which would disturb the time stamp counter values read by the HAVEGE program. That could be sufficient to cause random variations, essentially showing that HAVEGE really works well just from OS noise &#8211; even in an otherwise idle machine.</p>
<p>However, at this point I started to have my doubts. Something did not feel right.</p>
<p>So I tried to remove all variations from the HAVEGE program. I replaced the &#8220;HARDTICKS&#8221; macro in HAVEGE with the constant 0 (zero) rather than reading the time stamp counter of the processor. This immediately failed the randomness test.</p>
<p>However, when I used the constant 1 (one) instead, the <em>ent </em>test passed. And even <em>nist </em>almost passed with only a single missed test out of the 426 tests executed.</p>
<p>Thus, the conclusion is that we do not know how well HAVEGE &#8216;s collection of hardware randomness works, since the evaluation software is too weak. In essence, we do not know if the collection of hardware randomness matters or not, as the proposed measurement hides the randomness behind a pretty good PRNG algorithm.</p>
<p>Ideally, we would need a measurement that would evaluate the predictability of the randomness generated. Or at least one that can correctly estimate the impact of the variation of low-level hardware timing on the quality of the final random numbers. Unfortunately, that is not the case here, throwing the entire idea into doubt.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1374"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1374" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1374" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1374/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Simple Machine, Hard to Simulate</title>
		<link>http://jakob.engbloms.se/archives/1360?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1360#comments</comments>
		<pubDate>Sat, 01 Jan 2011 20:29:29 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Communications of the ACM]]></category>
		<category><![CDATA[George Phillips]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1360</guid>
		<description><![CDATA[In the June 2010 issue of Communications of the ACM, as well as the April 2010 edition of the ACM Queue magazine, George Phillips discusses the development of a simulator for the graphics system of the 1977 Tandy-RadioShack TRS-80 home computer.  It is a very interesting read for all interested in simulation, as well as [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/01/trs80-i-name.jpg"><img class="alignleft size-full wp-image-1361" style="margin: 5px 10px;" title="trs80-i-name" src="http://jakob.engbloms.se/wp-content/uploads/2011/01/trs80-i-name.jpg" alt="" width="76" height="100" /></a>In the <a href="http://mags.acm.org/communications/201006/?folio=52&amp;CFID=4598775&amp;CFTOKEN=51596944#pg54">June 2010 issue of Communications of the ACM</a>, as well as the <a href="http://queue.acm.org/detail.cfm?id=1755886">April 2010 edition of the ACM Queue magazine</a>, George Phillips discusses the development of a simulator for the graphics system of the 1977 Tandy-RadioShack TRS-80 home computer.  It is a very interesting read for all interested in simulation, as well as a good example of just why this kind of old hardware is much harder to simulate than more recent machines.</p>
<p><span id="more-1360"></span>You really should read the article to get the full story. The short summary is that while the basic principle of the graphics display is very simple to simulate, the effect of rewriting the display contents as it is being drawn on the CRT is quite difficult to get right.</p>
<p>I found this picture of the system online, for reference of what the graphics might look like:</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/01/trs80-i.jpg"><img class="aligncenter size-full wp-image-1363" title="trs80-i" src="http://jakob.engbloms.se/wp-content/uploads/2011/01/trs80-i.jpg" alt="" width="570" height="375" /></a></p>
<p>The hardware of the TRS-80 works pretty much like my old ZX Spectrum did: a bit of memory is used to hold image data (single buffer), and then the video hardware simply reads this memory as the CRT scan goes by to display the right picture. There is no locking of the display memory during this time, so the processor can race the video hardware and modify the contents of a location in memory between scan lines. This is indeed done, and that&#8217;s what makes simulating the machine much much harder.</p>
<p>On the TRS-80, racing the scan makes it possible to (for example) increase the apparent vertical resolution of the display (since each graphics &#8220;pixel&#8221; actually consists of four vertical pixels that could not be individually addressed). On the ZX Spectrum, you could increase the <a href="http://en.wikipedia.org/wiki/ZX_Spectrum_graphic_modes">apparent vertical color resolution</a>. On the Commodore 64, I think you change the graphics palette during redraw to allow more colors to be displayed simultaneously.</p>
<p>For a simulator, this is pure pain. The result of graphics code written for those machines essentially depends on making a very precise simulation of the timing of all processor instructions, as well the behavior of the video hardware. As noted in the article, you need to model the memory access contention resulting from the processor writing as the display memory simultaneously with the video hardware reading it. You need to know the cycle count of each pixel, and the setup time between each row of pixels. To account for effects like dithering on a modern perfectly stable LCD display, you have to apply filters to the basic bitmap, simulating the fuzziness of the old cheap TVs that these computers used to drive. If you want to simulate the horrible tricks used to make music on the ZX Spectrum&#8217;s 1-bit sound output (as I recall it, you made noise by flipping a bit in an OUT instruction on the Z-80 CPU), you probably need to do a bit of analog waveform simulation.</p>
<p>Indeed, it seems to me that one enabler for today&#8217;s virtual platforms is that we have hardware that is not entangled in low-level timing like this. Instead, thanks to the variability of execution time in modern processors, hardware is asynchronous and tends to use interrupts or status bits in registers to report when a requested operation is complete. You do not see code of the type &#8220;do X, wait precisely Y cycles, and then do Z&#8221; anymore, and that really helps in changing the target system timing to enable fast simulation &#8211; which is what any fast virtual platform has to do, replacing a real processor with variable timing with an ISS with pretty simple instruction timing. It also means that hardware models can be simpler too, since they do not model how the hardware achieves its work, only what that work is. To loop back to the article prompting this blog post, in a modern virtual platform, all you would simulate is the specified effect of setting graphics bytes in memory. Not the incidental effect of how it is drawn onto a TV, scan-line by scan-line.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1360"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1360" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1360" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1360/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Modeling Endianness</title>
		<link>http://jakob.engbloms.se/archives/1336?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1336#comments</comments>
		<pubDate>Sun, 26 Dec 2010 15:58:19 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[big-endian]]></category>
		<category><![CDATA[endianess]]></category>
		<category><![CDATA[hardware modeling]]></category>
		<category><![CDATA[little-endian]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1336</guid>
		<description><![CDATA[Endianness is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way. This blog post describes how I am used to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/egg.png"><img class="alignleft size-full wp-image-1337" style="margin: 5px 10px;" title="egg" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/egg.png" alt="" width="74" height="66" /></a><a href="http://en.wikipedia.org/wiki/Endianness">Endianness </a>is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way.</p>
<p>This blog post describes how I am used to working with endianness in virtual platforms, and why this approach makes sense to me. There are other ways of dealing with endianness, with different trade-offs and overriding goals.</p>
<h2><span id="more-1336"></span>Fundamentals</h2>
<p>What is endianness? In my way of looking at it, it is the arbitrary solution to the problem you get when a large unit of information (say, a 32-bit word) needs to be stored as a set of smaller units (say, 8-bit bytes). When this happens, you need to split the large unit into smaller units, and decide on how to order the smaller units. There is no objectively better or worse way to do this &#8211; as long as the result is unambiguous and based on positional numerics (i.e., no roman numerals, please), it is hard to claim that one order is better than another.</p>
<p>We use &#8220;endianness&#8221; all time without really thinking about it, when we write regular decimal numbers. In our <a href="http://en.wikipedia.org/wiki/Hindu_numerals">standard </a>base-10 decimal writing system, any value &gt;9 has to be written down using multiple digits. The order we use is a big endian representation: the most significant numbers come first in our reading order (hundreds before tens before single digits, etc.).</p>
<p>In computer architecture, we have three main schools of endianness:</p>
<ul>
<li>No endian, where we never break things down to bytes but always operate on equal-size words (not very common in practice, but certain machines like the Microchip PIC have instruction ROMs as wide as the instructions, and no way to address components of the intructions)</li>
<li>Big endian, BE, where the most significant bytes are put first in order of ascending addresses. I.e., the &#8220;big end&#8221; comes first.</li>
<li>Little endian, LE, where the least significant bytes are put first</li>
<li>&#8220;Middle endian&#8221;, where the ordering differs for different sizes of data (<a href="http://en.wikipedia.org/wiki/Endianness">Wikipedia </a>mentions this, but I have never seen an example). I have heard stories about chips that also used different endianness to store data by different instructions (by misdesign, I am not referring to the Power Architecture load/store byte-reversed instructions).</li>
</ul>
<p>BE is the traditional choice of IBM and the major early RISC chips, with Power Architecture, MIPS, SPARC, and the zSeries as the most important representatives. LE is the choice of x86, and more recently ARM. MIPS also seems to be gravitating towards LE, probably as a way to make x86 software slightly easier to port. Note that even though some processor cores are described as endianness-neutral, that really means that they can run as either LE or BE. In practice, particular chip designs incorporating such cores tend to lean heavily towards one endianness, since devices are designed for a particular endianness.</p>
<h2>The Software View</h2>
<p>For me, the most important view of endianness is how the software sees it. When a program is running on any current architecture, it logically sees memory as an array of bytes. Inside the memory chips, we have a very different physical layout, usually with words much wider than a byte, as well as an addressing scheme that is not one-dimensional. The interconnect (&#8220;bus&#8221;) moving data from a processor to memory and back is a complex system containing caches, buses of different widths (usually 64 bits or more), memory controllers, cache controllers, bus bridges, and other devices. All of this is usually completely invisible to software, as illustrated below:</p>
<p style="text-align: center;"><img class="aligncenter" style="margin-top: 5px; margin-bottom: 5px;" title="endianness 1" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-1.png" alt="" width="504" height="389" /></p>
<p style="text-align: left;">Basically, the bus system is invisible. The important endianness property as far as software is concerned is the order in which bytes are put into memory, and memory is considered as an array of bytes (since a byte is the smallest unit of addressing). If you look at the memory of a computer system using a debugger, this is the view you will get &#8211; both for on-target and off-target debuggers like ICE units and JTAG debuggers. Each memory access (store or load) will logically pass a small array of bytes into some position in the very large array that is memory.</p>
<h2 style="text-align: left;">The Modeling View</h2>
<p style="text-align: left;">Modeling endianness is not optional when building a virtual platform. The software will at some point assume a certain relationship between word layouts and byte addresses in memory (such as overlaying a byte array on an integer in a C union), or when interpreting network packets (which are defined to use BE byte order, and therefore network code has to convert values to native endian to process them).</p>
<p style="text-align: left;">If you start from the software view of endianness and memory, the obvious simulation model for memory operations is to maintain the array of bytes view of memory matching the physical target.</p>
<ul>
<li>Each memory access from a simulated processor gets turned into a transaction in the simulator.</li>
<li>The transaction has variable size, matching the size of the memory access operation issued by the processor.</li>
<li>The transaction contains a sequence of bytes, in the same order as they would end up in target memory on a physical machine. I.e., the order reflects the endianness of the processor.</li>
<li>The transaction has a starting address (byte-based) matching the memory access the processor issues.</li>
<li>The contents of the memory model in the simulation is an array of bytes, and its content matches what you would find on the physical target &#8211; the logical software view of the target.</li>
<li>The bus system connecting the processor to the memory is basically considered as a black box that just moves the transaction to memory.</li>
</ul>
<p>The above is very easy to implement, and actually a very convenient implementation for someone used to the software view of hardware. The only thing that remains to be considered is how a processor simulator is implemented in practice.</p>
<p>In a typical processor simulator, you represent the target system registers using words of the same size as the target processor uses. I.e., for a 32-bit processor, you use 32-bit words on the host to represent the contents of a register. As the processor model is running, the contents of the register might have to stored in data structures internal to the processor (such as an array of words representing the register file). Naturally, such data structures are kept in host endianness since they are just plain compiled C code. As the processor model runs, arithmetic is carried out using host endianness.</p>
<p>Actually, usually no endianness is involved as the values are considered as words. Remember that a word does not have endianness until it is broken down into bytes and someone actually looks at the bytes. In particular, an operation like</p>
<pre>uint8  a;
uint32 b;
a = (b &amp; 0xff)</pre>
<p>will pick up the 8 lowest bits of a word on any processor. The code is logically working inside of registers and is perfectly portable. However, the result of</p>
<pre>uint32 *c;
*c = b;
a = *((uint8 *)c);</pre>
<p>will pick up the first (at the lowest address) byte stored in memory when b was written &#8211; which is the same as the above on an LE processor, but different on a BE processor. The crucial observation here is that the latter variant contains an explicit store of a word, and an explicit load of a byte. Thus, endianness enters as we store the word (the byte load has no endianness, as it is loading the smallest unit of addressability).</p>
<p>What this means is that a processor simulator will have to do an explicit ordering of bytes as it is writing out values to memory. The simulator will need to take a word it has represented in &#8220;host order&#8221; (as it is within the simulator itself) and convert it to the byte order of the target processor. If the two match, such as simulating a little-endian ARM target on a (always little-endian) x86 host, nothing needs to be done. If they do not match, such as simulating a big-endian PPC target on an x86 host, the bytes have to be swapped before being sent to simulated memory.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-2.png"><img class="aligncenter size-full wp-image-1340" title="endianness 2" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-2.png" alt="" width="422" height="368" /></a>When the processor does a load, it similarly has to swap the bytes being read from memory (if using different target and host endianness).</p>
<p>As soon as we leave the processor simulator, the order of bytes in transactions and simulated memory has to defined and managed in a host-independent way. This is crucial to enable  snapshots of memory to be <a href="http://blogs.windriver.com/engblom/2010/08/transporting-bugs-with-checkpoints.html#more">shared across hosts, time, and space</a>, and simply to allow the simulation to work correctly. The semantics of the simulation must be defined by the simulator, not by the nature of the host.</p>
<p>Note that as an optimization, quite often we do not create an explicit transaction, but rather use the optimization of letting the processor simulator write directly to the representation of the target memory in the memory simulator. In this design, the target memory representation is just an array of bytes mirroring the contents that the processor would see on a physical target.</p>
<p>Let&#8217;s go through this with a simple example. We assume we are on an x86 host. Our processor simulator contains a 32-bit register with the value 0&#215;01020304. This value is endianless until we have to send it to simulated memory, it is just a value of 32 bits. We write it to target memory at address 0&#215;100</p>
<p>On a simulated LE target, the memory write will result in a transaction containing the byte sequence (0&#215;04, 0&#215;03, 0&#215;02, 0&#215;01) &#8211; lowest byte comes first. The memory model will store this with 0&#215;04 at address 0&#215;100, 0&#215;03 at 0&#215;101, etc. The processor model can achieve this effect by simply doing a host-native word store to the memory array.</p>
<p>On a simulate BE target, the memory write will result in a transaction containing (0&#215;01, 0&#215;02, 0&#215;03, 0&#215;04). In memory, 0&#215;01 will be stored at address 0&#215;100, 0&#215;02 at 0&#215;101, etc. To store this word correctly, the processor model will have to do a byte swap operation on the word before writing it out to memory. Such a byte swap operation might seem expensive, but the evidence does not indicate that it matters. All the fastest instruction-set simulators use this method internally as far as I know (Wind River Simics, Imperas OVP, Qemu, IBM Mambo), which to me indicates that the design works well on a simulation system level.</p>
<h2>Device Models</h2>
<p>Device models are the main part of a functional simulator for a computer system. They also have endianness, as they expose memory-mapped interfaces to software. To deal with devices in a consistent manner, they will interpret inbound memory transactions using their local register endianness. This makes it simple and reliable to simulate systems where the processor and the devices have different endianness.</p>
<p>Systems with mixed device endianness is very common, mostly thanks to PCI. PCI is defined to use little-endian byte ordering in all memory accesses, as it originated in the x86 world. PCI is still being used in almost all computer systems, and thus LE PCI devices are being connected to BE processors.</p>
<p>Internally, a device model will also use words to represent data. When data is written to a device, it will interpret the bytes in the write transaction using its local order. When data is read from a device, it will fill in the data in the read transaction using its local order.This makes device drivers that byte-swap incoming data from an LE PCI device on a BE processor work just like they do on physical hardware.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-3.png"><img class="aligncenter size-full wp-image-1341" title="endianness 3" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-3.png" alt="" width="473" height="414" /></a>This makes endianness a local property of the device. The same device model can be used without change in both an LE and a BE target system. This mirrors reality: PCI devices are used in all kinds of systems, and the devices do not change, and neither do the models have to.</p>
<p>In some systems, the designers try to hide the RISC-processor-to-PCI endianness mismatch by making the hardware swap bytes around as they move from the memory bus into the PCI subsystem. If this is the case in a target system, the simplest simulation method is to insert an byte-swapping intermediary on the path from the processor to the devices. This will do an extra byte swap on all transactions passing by, and things will work correctly (note that this byte swap has to be defined to work on a certain word length, and if transactions are bigger than this length, you will also have to order the words).</p>
<p>Note that as long as all units involved on the path from a device to a processor use the same word length, you can replace all the byte swapping operations with a simple flag. This flag will indicate if a transaction has been swapped or not. For example, when we have a BE processor talking to a BE device, on an LE host. The BE processor will flag the transaction as &#8220;wrong-endian&#8221; as it sends it out but actually store the bytes in LE order in the transaction. The BE device will check the flag and realize that it is wrong-endian too. And since two wrongs make a right, it does not have  to swap the bytes either but can copy the transaction contents directly into its internal registers.</p>
<h2>Dealing with Data</h2>
<p>There are other things you want to do with a memory image in a virtual platform apart from reading and writing it from a processor. One particular task is to move data into and out of memory model in order to load code and data, as well as to save the state of the system. The representation of a memory as an array of bytes works very well for this approach, since it corresponds naturally to how software files are created on the host. Since most software files are intended to be loaded by the target into target memory, they are prepared in target byte order. Another advantage of using a byte-based memory representation is that file formats like ELF can be loaded straight into virtual memory without having to convert addresses.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-5.png"><img class="aligncenter size-full wp-image-1344" title="endianness 5" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-5.png" alt="" width="495" height="395" /></a>The representation is also host-independent, which facilitates moving memory images from one host to another, a key part of <a href="http://jakob.engbloms.se/archives/1235">using virtual platforms as a communications mechanism</a>. Another benefit of viewing memory as an array of bytes as accessed from a processor is that debuggers can look at memory in the same way as they would when running on the same host.</p>
<h2>Summary</h2>
<p>This long post (WordPress tells me it is more than 2500 words) really only starts to scratch the surface of this fascinating topic. It has described one approach to endianness modeling, and some of the subtleties involved. There are many more subtleties that we could go into.</p>
<h2>Footnote: SystemC TLM-2.0</h2>
<p>There are other ways to model endianness. In particular, the approach described here is not used in the SystemC TLM-2.0 standard. In TLM-2.0, all data is stored in a transaction in <em>host</em> order, not target order. To model the target endianness, you instead change a descriptor array that tells the simulator about how to interpret the bytes when viewed from the target.</p>
<p>As I see it, this means that TLM-2.0 is better suited for modeling the ins and outs of a bus system, including discovering how data ends up at a target from the actions of the various components of the bus system. It models byte lanes and the width of buses, and uses host byte order for all transfers of data. In contrast, the approach described in this blog post works by modeling the documented (or intended) effect of the hardware at the software level.</p>
<p>Overall, I would say that TLM-2.0 is slightly more geared towards the &#8220;<a href="http://jakob.engbloms.se/archives/1083">design&#8221; use of modeling, rather than &#8220;describe</a>&#8220;. By modeling bus widths, actual byte lanes, and other concepts, the simulator will discover the shape and endianness of data as it arrives at a target memory or device.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1336"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1336" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1336" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1336/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel SystemC Simulation</title>
		<link>http://jakob.engbloms.se/archives/1327?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1327#comments</comments>
		<pubDate>Fri, 26 Nov 2010 19:08:47 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Christopher Schumacher]]></category>
		<category><![CDATA[CODES]]></category>
		<category><![CDATA[ISSS]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[parallelized software]]></category>
		<category><![CDATA[Rainer Leupers]]></category>
		<category><![CDATA[SystemC]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1327</guid>
		<description><![CDATA[I just found a recent paper on the topic of parallel simulation of computer  systems. Christopher Schumacher et al., published an articles at CODES+ISSS in October of 2010 talking about &#8220;parSC: Synchronous Parallel SystemC Simulation on Multicore Architectures&#8220;. Essentially, parallel SystemC. This is very much a hot topic: for the past few years, everyone has [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears.png"><img class="alignleft size-full wp-image-735" style="margin: 5px 10px;" title="gears" src="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears.png" alt="" width="56" height="57" /></a>I just found a recent paper on the topic of parallel simulation of computer  systems. Christopher Schumacher et al., published an articles at <a href="http://www.public.asu.edu/~ashriva6/esweek2010/codesisss2010/">CODES+ISSS in October of 2010 </a>talking about &#8220;<a href="http://doi.acm.org/10.1145/1878961.1879005">parSC: Synchronous Parallel SystemC Simulation on Multicore Architectures</a>&#8220;. Essentially, parallel SystemC.</p>
<p><span id="more-1327"></span></p>
<p>This is very much a hot topic: for the past few years, everyone has been looking for ways to run various forms of simulators in parallel. We had some good discussions on this only last Wednesday at a seminar at KTH where I was presenting about Simics.</p>
<p>The approach taken in this paper is different from what you find being done in tools like Simics (as I briefly discussed at <a href="http://jakob.engbloms.se/archives/1023">MCC 2009 </a>and <a href="http://jakob.engbloms.se/archives/246">SiCS Multicore Days 2008</a>). They do not exploit <a href="http://jakob.engbloms.se/archives/97">temporal decoupling</a> or islands with different local time. Instead, they have a single global clock in the entire simulation, and just parallelize the work that is done during each cycle.</p>
<p>The key for this to be beneficial and practical is that the work done per cycle is far greater than the cost to drive the simulation forward &#8220;between&#8221; cycles. In a high-level TLM model where the work per cycle might be as small as a single host instruction (JIT translation of a simple integer instruction from target to host), it is obvious that this approach would not work at all. However, this work explicitly targets clock-cycle-level simulations, where the work per cycle per hardware unit can be very large. The paper discusses actions that take 1000 to 2000 host cycles per step, and at that level of effort, there is definitely some potential for parallel gain.</p>
<p>What is nice with the approach is that they do peg semantics to a sequential reference, which does aid debugging. Due to the very tight synchronization, it would seem to be deterministic, at least on the same host (the SystemC kernel can theoretically behave differently on different hosts).</p>
<p>They do have one example that is simulating a shared-memory multiprocessor using temporally decoupled CPU models (100 target cycles per invocation, probably 1000 to 10000 host cycles). This achieves fairly neat speedups on a very symmetric case. However, this comes at the cost of making the simulation nondeterministic &#8211; even for the single-threaded case which is pretty scary.</p>
<p>Overall, an interesting paper showing that there is more to be discovered in parallel simulation.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1327"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1327" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1327" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1327/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: &#8220;IMA on Simics&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1304?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1304#comments</comments>
		<pubDate>Tue, 26 Oct 2010 13:00:25 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Integrated Modular Avionics]]></category>
		<category><![CDATA[real-time]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[Tennessee Carmel-Veilleux]]></category>
		<category><![CDATA[wcet]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1304</guid>
		<description><![CDATA[I have a fairly lengthy new blog post at my Wind River blog. This time, I interview Tennessee Carmel-Veilleux, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore. I made some cocky comments about just [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 10px 5px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have a <a href="http://blogs.windriver.com/engblom/2010/10/interview-with-tennessee-carmel-veilleux.html">fairly lengthy new blog post </a>at my Wind River blog. This time, I interview <a href="http://www.tentech.ca/">Tennessee Carmel-Veilleux</a>, a Canadian MSc student who have done some very smart things with Simics. His research is in IMA, Integrated Modular Avionics, and how to make that work on multicore.</p>
<p><span id="more-1304"></span>I made some <a href="http://jakob.engbloms.se/archives/58">cocky comments about just how stupid the current implementations </a>of this idea are a few years ago, but in my discussions with Tennessee I have realized that things are not that simple. Essentially, you are caught between the structures and strictures of the certification agencies, and the complexity of the hardware with its many shared resources making predictability and programming very difficult.</p>
<p>It is a field which touches to my old work on WCET, but with a target system that is much less amenable to analysis. I still <a href="http://jakob.engbloms.se/archives/123">think it is a good idea to try to use seas of simple cores and separate work physically </a>rather than virtually, but that might be a battle that will never be won. If nothing else, even on a massive spatially-divided multicore device, there will be some shared resources that make life difficult when very low levels of jitter are tolerated.</p>
<p>Read the interview, and read Tennessee&#8217;s own blog &#8211; he has done some pretty cool things both in hardware, software, and virtual hardware.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1304"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1304" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1304" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1304/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>S4D 2010</title>
		<link>http://jakob.engbloms.se/archives/1251?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1251#comments</comments>
		<pubDate>Wed, 15 Sep 2010 08:02:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Debug]]></category>
		<category><![CDATA[ESCUG]]></category>
		<category><![CDATA[FDL]]></category>
		<category><![CDATA[Infineon]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[John Aynsley]]></category>
		<category><![CDATA[Pat Brouillette]]></category>
		<category><![CDATA[S4D]]></category>
		<category><![CDATA[Simon Davidmann]]></category>
		<category><![CDATA[Southampton]]></category>
		<category><![CDATA[ST]]></category>
		<category><![CDATA[SystemC]]></category>
		<category><![CDATA[Thorsten Grötker]]></category>
		<category><![CDATA[TrustZone]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1251</guid>
		<description><![CDATA[Looks like S4D (and the co-located FDL) is becoming my most regular conference. S4D is a very interactive event. With some 20 to 30 people in the room, many of them also presenting papers at the conference, it turns into a workshop at its best. There were plenty of discussion going on during sessions and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg"><img class="alignleft size-full wp-image-941" title="S4D" src="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg" alt="" width="143" height="62" /></a>Looks like S4D (and the co-located FDL) is becoming my most regular conference. S4D is a very interactive event. With some 20 to 30 people in the room, many of them also presenting papers at the conference, it turns into a workshop at its best. There were plenty of discussion going on during sessions and the breaks, and I think we all got new insights and ideas.</p>
<p><span id="more-1251"></span></p>
<h2><a href="../wp-content/uploads/2010/09/P1140077.jpg"><img class="aligncenter size-full wp-image-1276" title="P1140077" src="../wp-content/uploads/2010/09/P1140077.jpg" alt="" width="400" height="258" /></a></h2>
<h2>S4D Talks, Themes, and Topics</h2>
<p>More is available in &#8220;<a href="http://jakob.engbloms.se/archives/1280">S4D part 2</a>&#8220;.</p>
<h3>Tracing and Instrumentation</h3>
<p>The papers presented covered a wide variety of topics from a variety of angles. Still, everybody felt that two topics kept coming back in various forms in a majority of the papers and discussions: <em>tracing</em> and <em>instrumentation</em>.</p>
<p>Code instrumentation is not a dirty word anymore. The traditional judgment that inserting probes into your software is plain bad does not apply anymore, at least not in the minds of the people at S4D. Instrumentation was applied to drivers, OS kernels, and regular user-level software. I think the key insight is that there is clear value in having the developers that write a piece of software also mark points of interest in the code. When analyzing a trace of an execution, that means that the information in the trace becomes meaningful to the software developers, as it is on the right level of abstraction. Instrumentation naturally produces traces, which can be fed out using  shared memory, networks, special-purpose hardware, and more.</p>
<p>One of the instrumentation trace solutions presented (the SVEN system from Intel Digital Home presented by Pat Brouillette), actually leaves the instrumentation in place in the shipping customer systems. In this way, you cannot really claim that instrumentation is intrusive &#8211; it is just part of the software, always. Customers can even activate the tracing in deployed systems, and ship the traces back to the developers for analysis of bugs found in the field. It is another approach to <a href="http://jakob.engbloms.se/archives/1231">record and replay</a> that touches on my paper on transporting bugs with checkpoints.</p>
<p>The increased interest in instrumentation probably has something to do with the nature of the systems that are being addressed. For systems using shared memory multicore hardware and general-purpose operating systems, the cost of instrumentation is easier to take than for very small constrained embedded systems. Essentially, as systems get more complex, instrumentation becomes more tractable.</p>
<p>Instrumentation can interact with hardware trace and debug functions is a neat way to build a system which is more powerful than a hardware or software system would be on its own. Especially for software stacks involving hypervisors and multiple complex operating systems, that is likely necessary.</p>
<p>Once we have a trace, just <a href="../archives/942">like last year</a>, we need to have tools for analyzing the tons of data you get from tracing a modern system. ST talked about a tracing system that generated 100s of gigabytes of data.</p>
<p>One trace aspect that kept coming up was the need for <em>time stamps </em>on trace data. To reconcile multiple traces and understand how different concurrent units talk to each other, a global time stamping mechanism is crucial. There seems to be work on hardware to support this.</p>
<h3>Security, Secrecy, and Debug</h3>
<p>I moderated a panel on hardware support for debug, and posed the question on how to balance security and the need to debug. This generated a number of interesting answers from the panel and the audience.</p>
<p>The conflict between debuggability and secrecy is there. Even from the same customer you first get &#8220;you have to make the internal state of the controller inaccessible and hidden to avoid customers modifying their engines&#8221;&#8230; and then when a problem appears in the field, they ask for a way to analyze and trace that very same system. Hard to support both requirements in a reasonable way.</p>
<p>A sophisticated solution to debug security from companies like ARM, Infineon, and ST is debug that can be enabled using key exchange. The chips are built with a &#8220;locked door&#8221; in place, but the keys to the door are kept well-guarded. In this way the same chip can be used in development and in the field.</p>
<p>To support debug of systems involving secure modes like ARM TrustZone, ARM has defined several levels of access in their CoreSight hardware modules. This makes it possible for a debugger to be restricted to just debugging user-level code, just OS and user-level code, or all of the software stack. To me, this sounds like it could allow mobile phone manufacturers to &#8220;securely&#8221; let their application developers use hardware-based debug, without compromising operating systems or secure boot modes.</p>
<p>The classic technique of using fuses to turn off functions is also relevant, at least for systems with moderate levels of security. This can certainly be overcome using special tools to peel off the top of chips and reconnect the fuses, but the panel seemed to think that that level of attack was in general not worth protecting against. However, the audience pointed out that  this was actually being done to automotive engine controllers and there are people making a good living from such antics.</p>
<h3>ESCUG Meeting</h3>
<p>The ESCUG meeting was a mix of fairly slick commercial presentations from OVP/IMperas chief Simon Davidmann and SystemC guru John Aynsley, and research presentations of varying quality.</p>
<p>One thing that struck me was that the academics spent a significant time in all presentations about how their approaches were compatible with the existing SystemC structure, where they host their open-source efforts, etc. I guess that is good in that they show a certain concern for reality &#8211; but it is also a bit sad that they did not get time to actually talk that much about the core ideas they were bringing forward. I am personally much more interested in new ideas than infrastructure and project management. It does not bode well for European research if this is what people are forced to produce, in lieu of real innovation.</p>
<h3>Thorsten Grötker&#8217;s Keynote</h3>
<p>On Wednesday morning, Thorsten from Synopsys did a look back over the history of SystemC, free from product pitching. He only mentioned Synopsys in his introduction, where the high-level message was that the embedded software is really the key problem for industry today. I cannot disagree with that.</p>
<p>During the SystemC parts of his talk he did say a few things that I did not quite agree with&#8230; in particular that TLM was unknown prior to 1999. It was not called that, but it certainly existed in the field of full-system simulation. The main problem is that Thorsten only sees the EDA history of modeling, not the computer architecture and software-driven work that did simulations as far back as 1950 (the famous Gill paper), and fast simulation since at <a href="http://jakob.engbloms.se/archives/130">least 1967</a>.</p>
<p>He also claims that with SystemC you have a single language for both detailed and TLM models. That is true&#8230; but you still need multiple models, one at each level of abstraction. So yes, one language, multiple models. However, that gluability really comes with a performance and complexity cost. It makes it too easy to slip into bad modeling even in TLM.</p>
<p>An interesting theme that Thorsten picked up from John&#8217;s talk at ESCUG is the use of SystemC to model software and RTOS, using the upcoming process control extensions. If you stretch that into the area of software synthesis, it means that SystemC is going to collide with the field of model-driven software development. Will you use SystemC, coming from the hardware world, or UML/MATLAB/Domain-specific languages coming from the software world?  Thorsten makes the interesting point that in order to integrate with that world, SystemC will require some concepts from that world (like pins and clocks enable interaction with RTL). I am not sure that is true, necessarily, I think you can just as well create point adaptors to the same effect.</p>
<h2>Getting to Southampton</h2>
<p>The <a href="http://www.soton.ac.uk/">University of Southampton </a>hosted the event, and it took place in the university lecture halls.  That means that we got free very fast WiFi (unlike any commercial conference venue I have ever seen).  The university campus was full of services (unlike the desolate place that last year&#8217;s FDL/S4D choose).  Housing in the <a href="http://www.soton.ac.uk/accommodation/halls/gleneyre/index.html">Glen Eyre residential halls </a>was a bit spartan but functional. Felt like being back in my days as a student living in student housing.</p>
<p>The instructions from the conference about how to get to the conference was a bit confusing and incomplete. In practice, it is very easy to get to Southampton from both Gatwick (direct train) and Heathrow (NationalExpess bus 203).  At Heathrow, I had a bit of luck with the bus to Southampton. The instructions from the NationalExpress website had me believe that I had to get from Terminal 5 where we landed to the central bus station and then catch the bus at 15.00. As we landed 40 minutes late (14.40), this looked very hopeless&#8230; until I found the NationalExpress counter in the arrivals hall at Terminal 5 and they told me the bus would leave at 15.30. Nice, no stress. The bus to Southampton even had free Wifi on board!</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062.jpg"></a><a href="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062-1.jpg"><img class="aligncenter size-full wp-image-1275" title="P1140062-1" src="http://jakob.engbloms.se/wp-content/uploads/2010/09/P1140062-1.jpg" alt="" width="400" height="246" /></a></p>
<p>Once in Southampton, you then had to take the bus U1A out to the university campus, and finding a bus stop for that was the most difficult part of the journey, actually. Some of the buses from Heathrow stop at Southampton university.</p>
<p>See also &#8220;<a href="http://jakob.engbloms.se/archives/1280">S4D Part 2</a>&#8221; for a few more tidbits from S4D.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1251"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1251" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1251" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1251/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Additional Notes on Transporting Bugs with Checkpoints</title>
		<link>http://jakob.engbloms.se/archives/1231?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1231#comments</comments>
		<pubDate>Wed, 15 Sep 2010 05:38:42 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[articles]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[S4D]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1231</guid>
		<description><![CDATA[This post features some additional notes on the topic of transporting bugs with checkpoints, which is the subject of a paper at the S4D 2010 conference. The idea of transporting bugs with checkpoints is some ways obvious. If you have a checkpoint of a state, of course you move it. Right? However, changing how you [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg"><img class="alignleft size-full wp-image-941" style="margin: 5px 10px;" title="S4D" src="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg" alt="" width="143" height="62" /></a>This post features some additional notes on the topic of transporting bugs with checkpoints, which is the subject of a paper at the <a href="http://www.ecsi.me/s4d">S4D </a>2010 conference.</p>
<p>The idea of transporting bugs with checkpoints is some ways obvious. If you have a checkpoint of a state, of course you move it. Right? However, changing how you think about reporting bugs takes time. There are also some practical issues to be resolved. The S4D paper goes into some of the aspects of making checkpointing practical.</p>
<p><span id="more-1231"></span>In particular, we need the checkpoints to be:</p>
<ul>
<li>Portable &#8211; so that checkpoints can be copied around between computers</li>
<li>Deterministic- so that everyone opening a checkpoint sees the same behavior</li>
<li>Compact &#8211; so that they can actually be moved around without incurring undue pain</li>
<li>Differential &#8211; so that a checkpoint can build on previous state and just contain a set of changes, not the entire state of the target system</li>
</ul>
<p>Most of my paper is spent on how to make checkpoints small enough to be easily transported, and how it fits with development workflows. The requirements above would seem to be common sense, but there are checkpointing systems out there that do not fulfill them. In particular, the portability aspect is hard to get right.</p>
<p>There are other ways to achieve transportation of bugs, and this blog post will fill in on some related work that I could not fit into the paper or which I discovered only after the final version of the paper was submitted.</p>
<h3>Record-Replay Systems</h3>
<p>There seem to be boundless creativity in creating methods to record live systems and replay their inputs/outputs/internal behavior/other interesting behavior on another system to support debug or analysis or other tasks. It shows just how important the replication of bugs is to the development of systems, and just how hard it is to accurately capture a bug in practice.</p>
<p>The company called <strong>Zealcore </strong>was doing some interesting work in software-based recording of &#8220;only the relevant events&#8221;, and then replaying this on a lab machine. Their angle on the problem was to have software record a minimal trace of important events on a live system, and then control the runtime system in a lab to replicate the event trace. Making this efficient and precise was the subject of a sequence of research papers in the early 2000s. Zealcore was acquired by Enea in 2008, and I have not seen much from them since. From what I can tell, the Zealcore fundamental technology for recording on a live system (or at least the ideas) have been continued into a new company called <strong><a href="http://www.percepio.se/">Percepio</a></strong>.</p>
<p>Aa fundamental difference between these recording systems and checkpointing systems is that they do not capture the complete target system state in the way a checkpoint does. The recording is much more compact, but it does not really solve the same problem. It is not based on running the target inside a simulator (other than at the replay end). What the relative success of such recording system indicates, however, is that in many systems, there are &#8220;important&#8221; and &#8220;irrelevant&#8221; aspects of inputs and events and behaviors, and that recording and replaying only &#8220;important&#8221; aspects is often sufficient to trigger bugs.</p>
<p>You can also throw hardware at the problem.</p>
<p>Completely unexpectedly, I also found a reference to a hardware-based record/replay system in a <a href="http://cacm.acm.org/magazines/2010/8/96632-an-interview-with-edsger-w-dijkstra/fulltext">Communications of the ACM interview with Edsger Dijkstra</a> (a rerun of an <a href="http://www.cbi.umn.edu/oh/pdf.phtml?id=296">interview from 2002</a>). Apparently, during the early programming of the <strong>IBM 360</strong>, IBM realized that debugging interrupts was hard. The solution was to create a piece of special hardware which would record interrupts, and later replay them with precise timing. In this way, you achieved repeatable executions of the most difficult code there was. I must quote what Dijkstra says on this &#8220;throw money at the problem&#8221; approach:</p>
<blockquote><p>When IBM had to develop the software for the 360, they built one or two machines especially equipped with a monitor. That is an extra piece of machinery that would exactly record when interrupts took place and from where to where. And if something went wrong, it could replay it again and use the recorded history to control when interrupts would occur. So they made it reproducible, yes, but at the expense of much more hardware than we could afford. Needless to say, they never got the OS/360 right.</p></blockquote>
<p>The final comment is typical for Dijkstra&#8217;s thinking that debugging is just an indication that you did not get the program and design right from the start. That&#8217;s certainly true, and he would likely have considered my little S4D paper as an unnecessarily complicated solution to a problem that should not have existed in the first place.</p>
<p>I, however, find the idea of the monitor interesting. I think that building something like that today would be much more difficult, as chips are very highly integrated and the support for replaying interrupts would have to go right into the heart of an SoC. But it would be interesting if it could be done.</p>
<p>There is also a <a href="http://jakob.engbloms.se/archives/130">paper from 1969 that I wrote about a few years ago </a>that does include the idea of recording and replaying asynchronous external inputs to a simulator.</p>
<h3>Other Checkpointing Systems</h3>
<p>There might be some related use of checkpoints (or snapshots as they are more commonly known) in the development of game emulators.  There is clearly the ability to save game state in a portable way in emulators like MAME.  Such states can be useful to help debug the emulator, but in a different way from the approach that I presented.  In the emulator case, the state is really the state of the emulated target.  It is not the state of the emulator program itself. If game emulator snapshots were used to debug the game code, it would be the same situation as what I describe in the S4D paper.</p>
<p>As I understand it, this is more like a attaching an example document that makes a program crash to a bug report, rather than transporting the state of the emulator itself.</p>
<p>Going down in the level of abstraction, I have also been told that RTL simulators offers a similar ability and that they have used in a similar way. Since I am not at all familiar with that field, I would not comment on this in the paper.</p>
<p>Transporting RTL bugs using checkpoints makes perfect sense. In an RTL simulator, the target state is very clearly described in an unambiguous way with no  relationship to host state. Checkpointing should be easy to implement and checkpoints should be portable, anything else would be a poor implementation.  The simulation is also deterministic, assuming a reasonable implementation of the simulator. The simulated world is also encapsulated with a set of test cases, RTL simulations are too slow to be interfaced to the real world. If an RTL simulator is interfaced to something else, recording the incoming signals should be straightforward since they are at a very low level (bits, clocks, pin states).</p>
<p>The use of checkpointing with RTL also fits with a conversation I had in 2005 when Virtutech introduced reverse execution in Simics. At one of the tradeshows where we showed the technology, an older gentleman approach me and told me that he had done similar things with hardware simulators back in the 1980s. He immediately understood the implementation idea (checkpoints with deterministic replay), and sounded like he felt it was nothing much new.</p>
<p>Finally, at some other event last year, I saw an demonstration of an RTL-level tool where the trace of the execution was generated on one machine, but inspected on a different machine. That amounts to a portable trace, even if the data volumes were rather large (many GB) and essentially required the RTL simulator (or hardware accelerated emulator) to be sharing disks with the investigation machine. Still, nothing prevents such a solution from being remotely used. The main difference from what I describe is that here only the result of the execution (trace of signals) is transported, not an actual state snapshot that can be brought up to continue the execution in a different place.</p>
<p>If you have any other notes on this, please comment!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1231"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1231" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1231" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1231/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog Post: Determinism vs Variability</title>
		<link>http://jakob.engbloms.se/archives/1247?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1247#comments</comments>
		<pubDate>Thu, 09 Sep 2010 13:02:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[repeatability]]></category>
		<category><![CDATA[reverse execution]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1247</guid>
		<description><![CDATA[I have a new post at my Wind River blog, about variability and determinism and how these two concepts interact. In short, even a deterministic simulator can expose great variability in a software workload and target system behavior. Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>I have a new post at my Wind River blog, about <a href="http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html">variability and determinism </a>and how these two concepts interact. In short, even a deterministic simulator can expose great variability in a software workload and target system behavior.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1247"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1247" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1247" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1247/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>S4D Paper on Transporting Bugs with Checkpoints</title>
		<link>http://jakob.engbloms.se/archives/1235?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1235#comments</comments>
		<pubDate>Tue, 31 Aug 2010 18:40:55 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[Wind River Blog]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[S4D]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1235</guid>
		<description><![CDATA[I have a paper about &#8220;Transporting Bugs with Checkpoints&#8221; to be presented at the S4D (System, Software, SoC and Silicon Debug) conference in Southampton, UK, on September 15 and 16, 2010. The core concept presented is to leverage Simics checkpointing to capture and move a bug from the bug reporter to the responsible developer. It [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg"><img class="alignleft size-full wp-image-941" style="margin: 5px 10px;" title="S4D" src="http://jakob.engbloms.se/wp-content/uploads/2009/09/S4D1.jpg" alt="" width="143" height="62" /></a>I have a paper about &#8220;Transporting Bugs with Checkpoints&#8221; to be presented at the <a href="http://www.ecsi.me/s4d">S4D (System, Software, SoC and Silicon Debug) conference </a>in Southampton, UK, on September 15 and 16, 2010. The core concept presented is to leverage <a href="http://www.windriver.com/products/simics/">Simics </a>checkpointing  to capture and move a bug from the bug reporter to the responsible  developer. It is a fairly simple idea, but getting it to work  efficiently does require that some things are done right. See the longer <a href="http://blogs.windriver.com/engblom/2010/08/transporting-bugs-with-checkpoints.html#more">Wind River blog posting </a>about this topic for a few more details.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1235"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1235" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1235" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1235/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

