<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; CoWare</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/coware/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>CoWare SystemC Checkpointing</title>
		<link>http://jakob.engbloms.se/archives/1039?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1039#comments</comments>
		<pubDate>Tue, 29 Dec 2009 12:02:24 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Checkpointing]]></category>
		<category><![CDATA[CoWare]]></category>
		<category><![CDATA[Mambo]]></category>
		<category><![CDATA[Reiner Leupers]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[SimOS]]></category>
		<category><![CDATA[Stefan Kraemer]]></category>
		<category><![CDATA[SystemC]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1039</guid>
		<description><![CDATA[Continuing on my series of posts about checkpointing in virtual platforms (see previous posts Simics, Cadence, our FDL paper), I have finally found a decent description of how CoWare does things for SystemC. It is pretty much the same approach as that taken by Cadence, in that it uses full stores a complete process state [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-735" style="margin: 5px 10px;" title="gears" src="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears.png" alt="gears" width="56" height="57" />Continuing on my series of posts about checkpointing in virtual platforms (see previous posts <a href="http://jakob.engbloms.se/archives/714">Simics</a>, <a href="http://jakob.engbloms.se/archives/817">Cadence</a>, our <a href="http://jakob.engbloms.se/archives/880">FDL paper</a>), I have finally found a decent description of how <a href="http://www.coware.com">CoWare </a>does things for <a href="http://www.systemc.org">SystemC</a>. It is pretty much the same approach as that taken by Cadence, in that it uses full stores a complete process state to disk, and uses special callbacks to handle the connection to open files and similar local resources on a system. The approach is described in a paper called  &#8220;A Checkpoint/Restore Framework for SystemC-Based Virtual Platforms&#8221;, by Stefan Kraemer and Reiner Leupers of RWTH Aachen, and Dietmar Petras, and Thomas Philipp of CoWare, published at the <a href="http://soc.cs.tut.fi/2009/">International Symposium on System-on-Chip, in Tampere, Finland, in October of 2009.</a></p>
<p><span id="more-1039"></span></p>
<p>The approach taken for their checkpointing system is to save the entire state of  the running simulation program (i.e., an entire host operating system process), and later recreate the process in the same state. This gets around the need program all simulation models to explicitly support checkpointing, but it also limits the applicability of checkpointing severely. Of the <a href="http://jakob.engbloms.se/archives/714">checkpointing operations described in my previous post</a>, it only supports the bringing up of a checkpoint into the same model, on the same machine (or a machine that runs a completely identical software stack down to the exact versions of all libraries, drivers, etc., which is not very likely to happen). This is honestly admitted in the paper, I appreciate that.</p>
<p>Process-based checkpointing in this way is also used by Cadence, and just like Cadence&#8217;s solution, the problem that appears when implementing it in practice is how to handle all the OS resources opened by a process. The process itself is not enough. The resources are typically files open for input and output, connections to debuggers, and other things that reach out of the virtual world of the simulation into the real world of the host machine. The solution is just like Cadence&#8217;s solution: provide callbacks for the SystemC side of such connections that can save the state of the connection in some way, and restart the connection when a checkpoint is opened. Doing this right does require some care in what to do in which order, and the paper nicely explains this.</p>
<p>For example, you need to close down all connections before taking a checkpoint, and then restore them after taking it. This is a fairly destructive operation, compared to the Simics-style checkpointing where all you do is interrogate the state of all objects and save that. I had not thought of that before, but it makes perfect sense.</p>
<p>To me, this also points to an important architectural issue in SystemC simulations in general: what are these simulation modules doing opening files on their own anyway? In my opinion, a simulation model should never do such low-level thing, it should operate using only the defined simulation system API and simulation-level connections to other simulation modules. If you need to load test data, do that by putting it into some storage system managed by the simulator itself. In all my professional life, I have considered file I/O to be a fundamental service provided by the application framework I use, nothing that my actual payload code should have to deal with. For example, in Simics, we tend to solve the loading and checkpointing of test data vectors using a memory &#8220;image&#8221;, which supports checkpointing in itself. Then a script just loads a file into the memory image, and a device reads the memory and moves transactions into the simulated system. This means that file I/O is totally absent from the model, and also that the test data can be managed and inspected by existing infrastructure.</p>
<p>The cost of taking a checkpoint is quantified in the paper, and it is not as bad as one could fear. The size of  a checkpoint easily reaches 100s of MB for even small target systems, but saving and loading that today does not take all that long. Still, going to a system with only 16 target processors (8 ARM, 8 DSP, and 8 times 64 MB of target memory = 512 MB) generates a checkpoint taking 1.7 GB to store and 30 seconds to save.</p>
<p>Once again, it is interesting to compare this to the Simics-style approach, where only differences are saved for target memories, and only target memory that is in use needs to be saved at all. This makes for usually far more compact checkpoints, which save and open in a few seconds in most cases. A checkpoint taking 30 seconds to open in Simics is rare indeed, and basically requires a target containing many gigabytes of target memory that is all in use (not host memory).</p>
<p>The paper does raise a novel point for why checkpointing is good: it saves you the time to setup a debugger. It is certainly true that setting up a debug environment for a session does take time, but it is not made clear just how it is recreated. My impression is that really what is going on here is that the debugger remains alive and loaded, and the simulation goes back and forth in time using checkpoints. Which is perfectly valid and nice.</p>
<p>In Simics, we solve this problem in two ways: when it comes to setting up the internal debugger when opening a checkpoint, the solution is to use a script to configure it. On purpose and by design, Simics checkpoints do not store the simulation session state, only the target system state, as that is the only way to be portable over time and across widely different machines. it would be very strange to open a checkpoint a few years after it was created, by some random user, start a simulation, and have it stop just because there was some breakpoint still in place for some long-forgotten reason. In any case, the simulator-side of the debugger has to be adopted in all systems to accept checkpointing.</p>
<p>Another strange point in the paper that I would have liked to ask the authors about is the idea that you can always change the target software in the middle of a session. But what is the value of checkpointing then? Since the idea is saving the cost of booting a machine and setting up a debug session, it is hard to see what is gained by booting a machine, keeping the hardware state, but replacing the software state. The expensive operation is presumably setting up the software state? At least it is in my experience.</p>
<p>I commend the paper for actually running something more than the archetypal &#8220;ARM+DSP&#8221;, by running eight copies of an ARM+DSP subsystem. That is at least starting to look like something interesting to simulate.</p>
<h2>Reviewer Notes for the Paper</h2>
<p>Finally, I have some small critiques on the academic paper itself, of the kind that I tend to provide to paper authors when active as a reviewer for conferences.</p>
<p>The SoC conference paper itself is well-written, but I have point out that the authors for some reason ignore the history of checkpointing in full-system simulation. The seminal work here was the checkpointing system used for changing the level of abstraction from fast functional simulation to cycle-accurate simulation in the SimOS system developed at Stanford <a href="ftp://db.stanford.edu/pub/cstr/reports/csl/tr/94/631/CSL-TR-94-631.pdf">already in the early 1990s</a>. This has later been continued in both IBM Mambo and Virtutech Simics, and remains the most powerful way of doing checkpointing until this day. I don&#8217;t think the authors were completely ignorant of this work, even though finding good references can be a bit difficult. Even so, here are some to add to future versions of that paper:</p>
<ul>
<li><a href="http://portal.acm.org/citation.cfm?id=891451">SimOS: A Fast Operating System Simulation Environment</a>, Stanford CS Technical Report CSL-TR-94-631, 1994. The earliest mention of using checkpointing in a full-system simulator (for switching from fast to detailed simulation).</li>
<li>&#8220;<a href="http://portal.acm.org/citation.cfm?id=1014602">Design and validation of a performance and power simulator for PowerPC systems</a>&#8220;, from the IBM Journal of Research and Development in 2003. Mentions that IBM Mambo does checkpointing like SimOS.</li>
<li><a href="http://parsa.epfl.ch/simflex/publ/per2004.pdf">SIMFLEX: A Fast, Accurate, Flexible Full-System Simulation Framework for Performance Evaluation of Server Architecture</a>, ACM SIGMETRICS Performance Evaluation Review, 2004. Also see the <a href="http://parsa.epfl.ch/simflex/">Simflex </a>homepage. Shows an ambitious use of checkpointing in computer architecture simulations.</li>
</ul>
<p>It is also a bit disingenuous to dismiss the issue of just how the process checkpointing is performed by a reference to an early requirements paper from the <a href="https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml">BLCR project</a>.  All that says is that there are lots of possible variants of implementation, nothing on what was done in this particular instance. It would have been nice to have had some more details: does the approach use a kernel module or not?</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1039"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1039" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1039" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1039/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is Cycle Accuracy a bad Idea?</title>
		<link>http://jakob.engbloms.se/archives/153?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/153#comments</comments>
		<pubDate>Fri, 11 Jul 2008 20:45:02 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Axys]]></category>
		<category><![CDATA[Carbon Technology]]></category>
		<category><![CDATA[clock-cycle models]]></category>
		<category><![CDATA[CoWare]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[DEC]]></category>
		<category><![CDATA[Grant Martin]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[Infineon]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[rtl]]></category>
		<category><![CDATA[scdsource]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=153</guid>
		<description><![CDATA[In a funny coincidence, I published an article at SCDSource.com about the need for cycle-accurate models for virtual platforms on the same day that ARM announced that they were selling their cycle-accurate simulators and associated tool chain to Carbon Technology. That makes one wonder where cycle-accuracy is going, or whether it is a valid idea [...]]]></description>
			<content:encoded><![CDATA[<p>In a funny coincidence, I published an article at SCDSource.com about the need for cycle-accurate models for virtual platforms on the same day that ARM announced that they were selling their cycle-accurate simulators and associated tool chain to Carbon Technology. That makes one wonder where cycle-accuracy is going, or whether it is a valid idea at all&#8230; is ARM right or am I right, or are we both right since we are talking about different things?</p>
<p>Let&#8217;s look at this in more detail.</p>
<p><span id="more-153"></span></p>
<h2>Definitions</h2>
<p>A clock-cycle (CC) model in this discussion is something that attempts to provide a cycle-by-cycle depiction of the behavior of a computer system. Usually, such models are driven by a cycle-by-cycle clock, as that is the easiest way to write and structure them.<br />
A cycle-accurate (CA) model is a CC model where the depiction is &#8220;the same&#8221; as what would happen in the real system provided they both started from the same state. </p>
<h2>What is ARM Doing?</h2>
<p>ARM seems to be passing on the tools and technologies they acquired when they bought Axys back in 2004. These tools are CC-oriented, and are aimed at hardware architects (and some really-low-level software work). They make it possible to evolve a target design cycle by cycle in the simulator to get a very accurate picture of the target behavior. I think this fits very well for Carbon, as they generate cycle-driven very accurate models by essentially compiling the actual RTL implementation of a piece of logic, processor, or device into something a bit faster than plain HDL simulation. Carbon models are a natural fit for the Axys tools.</p>
<p>Basically, it sounds as if ARM decided that manually creating CC level CA models for their latest processors for use in the Axys tools (SoC Designer) was too much work and too hard to validate. Thus, they pass the whole thing on to Carbon and seem to expect Carbon to generate CA models for use with SoC designer straight from the actual ARM implementation RTL. Carbon will have the old CC/CA models written by Axys (and later ARM), and then generate new models for new generations of ARM chips like the Cortex A9. I quote:</p>
<blockquote><p>&#8220;The model generation flow will be optimized and validated using the RTL code, ensuring speed and accuracy. The processor models will also leverage the Carbon model application programming interface (API) to offer a direct connection to the ARM RealView(R) Debugger. Carbon-generated models of ARM IP will offer our customers the fastest, most-accurate path for firmware development and architectural exploration.&#8221; (<a href="http://www.carbondesignsystems.com/Press/20080707%20Carbon%20Press%20Release.pdf">press release</a>)</p></blockquote>
<p>And:</p>
<blockquote><p>ARM made this decision, Cornish said, because it&#8217;s become increasingly difficult and time-consuming to develop cycle-accurate models. &#8220;We recognized it would make more sense to work with a specialist like Carbon that has technology for generating models directly from RTL,&#8221; he said. (<a href="http://www.scdsource.com/article.php?id=264">SCDSource News Piece on the deal</a>)</p></blockquote>
<h2>Feasibility of Construction</h2>
<p>The core argument here is really how easy or feasible it is to build CA models of a processor core (or any other really complex piece of logic). There are several interesting views to consider.</p>
<ol>
<li>The ARM statement is basically saying that building CA models of a processor core is very hard. It is hard to get right, hard to validate, and hard to maintain. So why even try? Better to generate it from the RTL and let experts at doing that do the work.</li>
<li>In my PhD thesis from 2002, I concluded that building an accurate model of a processor from public information and reverse engineering is very very difficult, and cited a number of computer architecture and real-time systems attempts to build models that all turned out to have accuracy issues. I did not know much about EDA then &#8212; and ESL did not really exist. But I think that still holds water: constructing a model of a processor is hard.</li>
<li>In the SCDSource article, I make the statement that &#8220;Building cycle-accurate (CA) models is very difficult, as you need to understand and describe the implementation details of complicated hardware units. &#8230; It is quite easy to end up with something that is essentially an alternative implementation to the actual chip RTL. It is especially difficult for third parties, as it requires access to the device and processor core designers to explain the design.&#8221; Which is essentially saying that you need to get inside the processor design group to get the information.</li>
<li>The common knowledge that all great processor design teams, from the DEC Alpha to Intel x86s to AMD Opterons to IBM Power to Freescale Power to Infineon TriCore to Sun Niagara use internal cycle-detailed simulators as their main design tools to prototype and decide how to design pipelines, memory systems, and system platforms. In this case, the simulator comes before the processor, not the other way around.</li>
<li>Tensilica has, as Grant Martin points out in comments at <a href="http://www.scdsource.com/article.php?id=266">SCDSource</a>, tools that generate both the processor and an accurate model at the same time from the same information base.</li>
<li>CoWare&#8217;s LisaTek tools for describing and generating application-specific processors also claim to generate accurate models from the LISA source files in a way similar to Tensilica but based on a user describing a completely custom design in a third-party tool. In the case of Tensilica, the tool and the design come from the same company.</li>
</ol>
<p>So where does this leave us? It makes it clear that in order to build a good cycle-accurate model you need access to internal information and the processor design/processor design team. The CA model can be built either:</p>
<ol>
<li>By synthesizing from the RTL, Carbon-style.</li>
<li>By synthesizing from some more abstract design description, Tensilica or LisaTek-style.</li>
<li>By the design team as part of the design process.</li>
<li>By some poor guy working after the fact from specs and test cases.</li>
</ol>
<p>I think the ARM-Carbon deal (and all practical experience as well) invalidates the fourth variant. Essentially, that is what Axys had to do: build models after the fact, separate from the CPU design flow. This is a property of how ARM design processors and the fact that Axys began life outside of ARM (my guess, nota bene). It is what computer architecture researchers often want to do but fall down on over and over again. In fact, a common question from computer architecture newbies is if Virtutech Simics has correct models of processors like the Intel Pentium4 or Core 2 available to use as starting points in research. It would be nice, but sorry, we do not.</p>
<p>But the other three variants do make sense, and will all result in some kind of decent model. Which one you end up doing depends on the style of your design and quite likely the complexity of the processor and system design. In the end, any truly revolutionary design (think Sun Rock, for example) will need to write a custom simulator as tools will not have the concepts in them to model all ideas. It seems that simple &#8220;standard&#8221; designs that fit in the categories of &#8220;custom RISC&#8221; or &#8220;custom DSP&#8221; and that do not break new ground in computer architecture can probably be designed using tools that allow processor and simulator generation. I think that most heavy-duty general-purpose processor cores will have to do either the design-model or RTL-generation path, while more accelerator-style cores can use the tools approach.</p>
<p>As a final note, there could really be two different problems being addressed here regarding &#8220;cycle accuracy&#8221;, and that this might contribute to different levels of feasibility:</p>
<ul>
<li>Using the simulator to validate and optimize software performance can tolerate some errors in details as long as errors do not accumulate (see for example the &#8220;<a href="http://moss.csc.ncsu.edu/~mueller/wcet06/accepted/5.html">timing anomalies</a>&#8221; or &#8220;<a href="http://www-emsoft02.imag.fr/Programme/Engblom.pdf">unbounded long timing effects</a>&#8221; found in WCET research). It is about understanding the software behavior versus the processor design (or complex accelerator design versus input data), in small focused spots of execution.</li>
<li>Using the simulator to validate a chip design including buses and other devices that can be bus masters. This ought to require a higher level of accuracy, as the penalty for errors would potentially seem greater. And this is also where ARM&#8217;s SoC designer fit in, rather than as a tool to understand the software behavior. The scope here is larger and there is usually no idea of zooming in on detail at particular points in time.</li>
</ul>
<p>So where does this land us?</p>
<p>I guess that CC/CA models can be built if you have a nice inside track to the design team, and that the only sensible way to use them is as a zoom device for the places in your code where you absolutely need the details. Most of the time (say 90-95-99%) software does not need CC models, but rather something that is functionally accurate and that runs really really fast so that all software can at least be executed. That is something a CC model will never be able to do, at least not for systems using non-trivial operating systems requiring a few billion instructions just to boot&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/153"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/153" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/153" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/153/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

