<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; computer architecture</title>
	<atom:link href="http://jakob.engbloms.se/archives/category/computer-architecture/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Fujitsu Server Fault Injection Robot</title>
		<link>http://jakob.engbloms.se/archives/1530?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1530#comments</comments>
		<pubDate>Sun, 11 Dec 2011 20:53:25 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[fault injection]]></category>
		<category><![CDATA[fujitsu]]></category>
		<category><![CDATA[Masafumi Matsuo]]></category>
		<category><![CDATA[server]]></category>
		<category><![CDATA[Yuichi Kurita]]></category>
		<category><![CDATA[Yuji Uchiyama]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1530</guid>
		<description><![CDATA[Fault Injection is a topic that has fascinated me for a long time. Not just the area of software-to-software fault injection, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a simulator). I just stumbled on a short interesting note about such hardware-actuated fault injection in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/12/fujitsulogga.png"><img class="alignleft size-full wp-image-1531" style="margin: 10px 5px;" title="fujitsulogga" src="http://jakob.engbloms.se/wp-content/uploads/2011/12/fujitsulogga.png" alt="" width="57" height="47" /></a>Fault Injection is a topic that has fascinated me for a long time. Not just the area of <a href="http://en.wikipedia.org/wiki/Fault_injection">software-to-software fault injection</a>, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a <a href="http://blogs.windriver.com/engblom/2010/10/the-virtual-basil-fawlty.html">simulator</a>). I just stumbled on a short interesting note about such hardware-actuated fault injection in a Fujitsu article.</p>
<p><span id="more-1530"></span>The <a href="http://www.fujitsu.com/global/news/publications/periodicals/fstj/">Fujitsu Scientific and Technical Journal </a>is the Fujitsu equivalent of IBM&#8217;s Journal of Research and Development. Thankfully, the FSTJ is still free while IBM erected a paywall around their articles. <a href="http://www.fujitsu.com/global/news/publications/periodicals/fstj/archives/vol47-2.html">Number 2 of 2011 </a>had the theme of servers, and there is an article about <a href="http://www.fujitsu.com/downloads/MAG/vol47-2/paper07.pdf">Quality Assurance for  Stable Server Operation </a>by Masafumi Matsuo, Yuji Uchiyama, and Yuichi Kurita.</p>
<p>The article describes the process of ensuring that the final servers that are shipped to customers (from what seems to be the Sparc-based line of Fujitsu computers, even though it might actually apply equally to their mainframes and x86-based servers) are as stable as possible. Apart from designing things right, this also requires testing the fault handling and recovery operations.</p>
<p>I found it noteworthy that they do a lot of configuration testing, where various hardware and software configurations are played off against each other. In this way, corner cases are explored and coverage of the actual configurations that customers will be using becomes more likely (it is always dangerous to only test on one or a few configurations). They push memory system and processor loads to very high levels to ensure continued operation even in extreme cases, and also try to push the actual chips to make sure they will operate reliably in a wide range of environmental conditions. Indeed, a large focus is placed on pure physical reliability, as that is the basis for system reliability.</p>
<p>The best part, however, is on page four of the article, where they show the physical fault injection robot that is applied to the chips mounted on boards. This robot  shorts out individual pins on chips, clamping them to zero volts. It goes over all pins, and the test system checks what happens to the system in each case. Some kind of exhaustive testing going on here.</p>
<p>Neat. I have heard other stories of physical fault injection, including complex mechanisms like passing computer boards through irradiation chambers all the way to brutally simple tasks like putting an axe into a board to break it or pulling boards out of racks to simulate a sudden catastrophic failure. I would like to see more of just how these things are done in the real world. I suspect there are quite a few interesting robotics setups out there that do fault injection.</p>
<p>In any case, the article offered an interesting glimpse of many of the techniques used to make computer systems robust and reliable. Recommended.</p>
<p>It ends by noting that deeply consolidated SoC designs and aggressive dynamic power management are challenging from a testing and observation perspective, as well as creating more single points of failure. If a single system on chip fails, that system is all gone&#8230;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1530"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1530" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1530" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1530/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU for Instruction-Set Simulation &#8211; Maybe, Maybe not</title>
		<link>http://jakob.engbloms.se/archives/1506?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1506#comments</comments>
		<pubDate>Sat, 08 Oct 2011 19:17:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CCGrid]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1506</guid>
		<description><![CDATA[I just read a quite interesting article by Christian Pinto et al, &#8220;GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms&#8220;, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png"><img class="alignleft size-full wp-image-125" style="margin: 5px 10px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="" width="100" height="100" /></a>I just read a quite interesting article by Christian Pinto et al, &#8220;<a href="http://infoscience.epfl.ch/record/164471">GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms</a>&#8220;, published at the <a href="http://www.ics.uci.edu/~ccgrid11/">CCGRID 2011 </a>conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.</p>
<p><span id="more-1506"></span>The paper describes a simulation for a network-on-chip based homogeneous system containing a &#8220;ARM-subset&#8221; ISS instances with local instruction and data caches, some local RAM, and also some shared RAM. Each core runs its own local software load, there is no SMP operating system. All communication between cores is over shared memory, using explicit operations across the NoC. All cores run a single cycle before they check communications from their neighbors.</p>
<p>This last point is crucial to understanding why this is feasible at all &#8211; in general, simulating a general shared-memory multiprocessor machine on a shared-memory multiprocessor falls down on the synchronization overhead. If your simulation semantics dictate that you synchronize every cycle anyway, and you do not try to optimize each core simulator, there is clearly decent room for parallel execution. By including the cache, they increase scalability, since there is more work per target cycle that can be run in isolation.</p>
<p>After reading the article, I am impressed by their work &#8211; just getting this to work is pretty good work. But there are quite a few questions which are not really answered in the article and which are crucial to understanding just how well GPGPUs could be used for this kind of ISS work.</p>
<ul>
<li>The targeted level of abstraction is a bit confusing. The authors claim it is &#8220;instruction accurate and not cycle accurate&#8221;, but still simulate caches and cycle-based communications across the NoC. If I read the paper right, communications will take a varying number of cycles depending on the distance for messages to travel. This is more detailed than a typical &#8220;instruction accurate&#8221; simulator.</li>
<li>The target system does not run an OS &#8211; that might (but I do not know) be an advantage for their approach, since it probably implies less variation in the instruction flow in cores, potentially enhancing the amount of time that all ISSes in a thread group in the GPU can execute the same instruction. This would seem crucial, as if each ISS was running a totally different program, the instruction execution part of the code would be running serialized.</li>
<li>They should really try to run the same kind of simulation on a high-end x86 CPU like an Intel Sandy Bridge with 8 or more hardware threads. I wonder if their scaling might not work just as well there &#8211; and with a much faster serial execution engine. This should give  a much more relevant point of comparison for GPU vs CPU execution of the simulator than&#8230;</li>
<li>the comparison object they use right now, a JIT-accelerated multicore simulation using OVP seems pretty irrelevant since it is not doing the same thing at all. That simulator does not simulate the caches or NoC, just a large number of isolated processors. They also do not run a parallel program on OVP, but rather a large number of single-core fibonacci and dhrystone programs. Thus, the fact that OVP uses a large temporal decoupling time slice does not matter for semantics. It just does not seem like a very relevant comparison point. OVP and their simulator try to solve different problems &#8211; fast execution of general code vs. performance profiling of massively parallel machines.</li>
<li>As I understand it, the given &#8220;S-MIPS&#8221; numbers in the evaluation tell us the total number of MIPS that we get out across all target cores. That seems to peak around 2000 &#8211; which isn&#8217;t necessarily that fantastic if we compare to high-performance ISS work in general where a few GIPS is definitely achievable. It is pretty good considering the level of detail here, though, where i would expect a normal ISS + cache simulator to produce at most a few MIPS. Once again, the authors need to be a bit more precise as to what they compare to what.</li>
<li>Not having an MMU and not implementing any interrupts or exceptions in the target machines avoids a large part of the complexity of a real ISS. That complexity might well be too much for the quite rigid execution environment of a GPGPU.</li>
<li>They missed that Simics, unique among instruction-accurate mainstream simulators, is <a href="http://jakob.engbloms.se/archives/128">parallel </a>since version 4.0.</li>
</ul>
<p>So, overall, this paper does not really tell us much whether a GPGPU can be used for instruction-set simulation in general. It does tell us that it might be doable, but there are many crucial complications which are not addressed.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1506"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1506" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1506" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1506/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nvidia &#8220;Kal-El&#8221; Variable SMP</title>
		<link>http://jakob.engbloms.se/archives/1496?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1496#comments</comments>
		<pubDate>Fri, 23 Sep 2011 19:16:33 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1496</guid>
		<description><![CDATA[Nvidia recently announced that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg"><img class="alignleft size-full wp-image-1497" style="margin: 5px 10px;" title="nvidia logo" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/nvidia-logo.jpg" alt="" width="48" height="48" /></a>Nvidia <a href="http://blogs.nvidia.com/2011/09/quad-core-kal-el’s-stealth-fifth-core-lets-it-save-on-energy/">recently announced </a>that their already-known &#8220;Kal-El&#8221; quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a &#8220;normal&#8221; quad-core would. They call the architecture &#8220;Variable SMP&#8221;, and it is a pretty smart design. The one where you think, &#8220;I should have thought of that&#8221;, which is the best sign of something truly good.</p>
<p><span id="more-1496"></span>It is common practice in multicore computing today to dynamically change the clock frequency of a processor and turn cores on and off in order to adjust the compute power available to the current workload. Such operations tend to be limited in scope, as processors have minimum clock frequencies that make sense, and often the memory system requires all cores to be at the same frequency. Operating systems also tend to want to work with homogeneous sets of cores, as that makes scheduling reasonably straight-forward. This is probably what has kept the idea of &#8220;small + large&#8221; cores of the same ISA out of the mainstream of SMP design, despite all its advantages in principle.</p>
<p>Now, Nvidia has managed to implement some of that idea in Kal-El.</p>
<p>The key observation is that if you can turn cores on and off, once you get down to a single active core, any system is by definition homogeneous across all cores regardless of what that core is. Changing the nature of this core should then be much easier, since there is only a single core to contend with.</p>
<p>What Nvidia does in Kal-El is to add a fifth low-power core to the main group of four high-performance cores. The fifth core is architecturally identical (ARM Cortex-A9), so that the system state can be moved from the high-performance to the low-performance cores without undue complexities. Indeed, this is all done in hardware, so the OS (typically, Android) thinks it is running on a homogeneous quad-core. When the system is lightly loaded and the OS decides to only have a single core on, the hardware can detect the load is <em>really</em> light, and effectively change the nature of the active core to a low-power-optimized version.</p>
<p>Once more compute power is needed, the hardware invisible slips back to the first high-power core, and then the OS can start increasing clocks and turning on cores as usual. It is effectively the same as a regular ARM Cortex-A9 quad-core setup, but with better low-power performance. The following graph from the Nvidia <a href="http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-whitepaper-0911b.pdf">white paper </a>shows it pretty clearly (red text is my added comment):</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png"><img class="aligncenter size-full wp-image-1498" title="tegra kal-el power curve" src="http://jakob.engbloms.se/wp-content/uploads/2011/09/tegra-1.png" alt="" width="655" height="446" /></a></p>
<p>Note the slope of the green line: that core is not a good one if you want high performance. It is optimized to scale within a range of low compute-power requirements, rather than provide the best performance per watt at the high end. Using Variable SMP, Nvidia lets us have both.</p>
<p>Neat.</p>
<p>More reading:</p>
<ul>
<li><a href="http://arstechnica.com/gadgets/news/2011/09/tegra-3-includes-5th-stealth-core-to-optimize-power-efficiency.ars">ArsTechnica</a> has a short summary</li>
<li>There does not seem to be much more right now, everyone is really just reiterating the points from the white paper.</li>
</ul>
<p>&nbsp;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1496"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1496" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1496" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1496/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>IBM i &#8211; I&#8217;m Impressed</title>
		<link>http://jakob.engbloms.se/archives/1479?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1479#comments</comments>
		<pubDate>Sun, 14 Aug 2011 16:15:38 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[IBM i]]></category>
		<category><![CDATA[Software Engineering Radio]]></category>
		<category><![CDATA[Steve Will]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1479</guid>
		<description><![CDATA[From what little I had heard and read, the IBM AS/400 (later known as iSeries, and now known as simply IBM i) sounded like a fascinating system. I knew that it had a rich OS stack that contained most of the services a program needs, and a JVM-style byte code format for applications that let [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/08/ibm-i-for-business.png"><img class="alignleft size-full wp-image-1480" title="ibm i for business" src="http://jakob.engbloms.se/wp-content/uploads/2011/08/ibm-i-for-business.png" alt="" width="105" height="104" /></a>From what little I had heard and read, the IBM AS/400 (later known as iSeries, and now known as simply <a href="http://www-03.ibm.com/systems/power/software/i/">IBM i</a>) sounded like a fascinating system. I knew that it had a rich OS stack that contained most of the services a program needs, and a JVM-style byte code format for applications that let it change from custom processors to Power Architecture without impacting users at all.  It was supposedly business-critical and IBM-quality rock solid. But that was about it.</p>
<p>So when <a href="http://www.se-radio.net/">Software Engineering Radio </a>episode 177 <a href="http://www.se-radio.net/2011/07/episode-177-ibm-i-os400-operating-system-with-steve-will/">interviewed</a> the i chief architect <a href="http://ibmsystemsmag.blogs.com/you_and_i/">Steve Will, </a>I was hooked. It turned out that IBM i was cooler than I imagined. Here are my notes on why I think that IBM i is one of the most interesting systems out there in real use.</p>
<h2><span id="more-1479"></span>Independence of Processor</h2>
<p>IBM i uses a byte code format for application programs. This byte code (known as technology-independent machine interface, or TIMI) is quite unlike what we have in the JVM or CLR. First of all, it predates the JVM by about 15 years. The first generation of systems, the <a href="http://www-03.ibm.com/ibm/history/exhibits/rochester/rochester_4009.html">IBM Series 38</a>, came out in 1980. Second, the TIMI code contains many higher-level operations like database accesses, making it possible to generate far better executable code than if it was just plain API calls. Third, it is compiled before execution, and not just-in-time.</p>
<p>The TIMI was designed as the designers even in 1972 realized that processors will come and go, but software will remain. I would guess that the IBM experience with the System/360 and migrating software to it from older system had something to do with this.</p>
<p>Over its life time, IBM i has gone from the original System 38 CPU to the AS/400 CPU to a customized 64-bit POWER CPU (still called AS/400), and now to a completely standard Power7 processor. Indeed, I did not realize until now that IBM merged the pSeries and iSeries hardware in 2008. Today, IBM i is just a software stack running on a hardware platform that can just as well run AIX or Linux! That is quite a journey for a system over 30 years, and proves that the original design was amazingly sound.</p>
<p>There seems to have been at least one slight imperfection in continuity. The original design used 48-bit pointers, which was very far-sighted for a design team in 1972, when the biggest machines around used 24-bit pointers (like designing a 128-bit pointer into a system of today as the default). Still, this did become too small, and a change was made to 64-bits in 1995 (when the almost-PowerPC RS64 became the processor for the AS/400).  Apparently, this required some side-band information about a program to allow patching even the TIMI code in the right way.</p>
<h2>Integrated Database</h2>
<p>The i OS (I guess Apple just beat IBM to the trademark, since iOS would seem the natural name for the OS, right?) is an integrated environment that tries to do a lot of things for the user that would normally require third-party software. In particular, it has a database integrated, which can both be accessed from i APIs and lately over SQL. It is branded as &#8220;DB2&#8243; and DB2-targeting programs sees no difference between it and DB2 running on AIX. But according to Steve, the core is not DB2 but the database that was built into IBM i from the start.</p>
<h2>Integral Security</h2>
<p>Where IBM i really stands out is in the decision to forgo the traditional concept of a file system and instead rely on an <a href="http://en.wikipedia.org/wiki/AS/400_object">object storage concept</a>. This has tremendous advantages for security. Both since access rights are powerful and attached to objects, and by avoiding all the dangers of a typical file system. For example, there is no way to make a document executable. Programs are programs, data objects are data objects, and you cannot make a Windows .exe masquerade as a .jpg. All users are associated with a user profile indicating what they can do and work with.</p>
<p>This does require some special treatment for users like programmers. Programmers are always a problem, since they need to create new programs. Same thing with the no-execute protection in recent Windows operating systems and just-in-time compilers. The i solution is to have a special programmer role with special permissions.</p>
<h2>Importing APIs</h2>
<p>Just like IBM zEnterprise (the latest name for  the heritage from System/360), the IBM i system has been modernized in  recent years by adding support for many standard APIs and concepts from  mainstream computing. They can run Java and use JDBC, for example. IBM  does not seem to hesitate to help programmers reuse code written for  other platforms on their heavy-duty machines.</p>
<p>A funny part for i  is that they had to add a virtual file system in order to make Java  happy. Apparently, a JVM cannot work or run most programs without  accessing a file system. So, Java pretty much assumes the machine has a  file system. Typically a true assumption &#8211; except on IBM i.Thus, IBM i  Java machine simulates your average hierarchical file system on top of its  real data storage mechanisms!</p>
<p>It is also interesting to note the choice in programming languages added to the platform. Java is a given, but IBM has made a big splash around PHP! Turns out that many business applications are migrating towards that kind of web-based platform. PHP replacing COBOL? Not sure that is an improvement&#8230;</p>
<h2>Internal Tuning</h2>
<p>One design goal of IBM i from the outset was to create a system that would be easy to use. In particular, the need for system administrators should be minimal. I don&#8217;t know how well this works when it comes to dealing with adding users and things like that, but I guess that if you use roles appropriately, it will be hard to mess that up.</p>
<p>What is more interesting and subtle is the extent to which IBM i tries to avoid needing system administrators around to tune the machine performance. Normally, if you have a large database, you will need to manually tune and tweak the system for maximum performance. In IBM i, the idea is that the system takes care of that for you. There must be a lot interesting algorithms at work in the core of the system for this to work, but apparently it does work.</p>
<p>For example, the handling of the storage hierarchy is transparent to programs. A program allocates an object, but has no idea if it lives in RAM or on disk. The system moves things around as needed to reach the performance needed (you set goals for each subsystem). When solid-state drives were added to i a few years ago, that just introduced yet another level of the storage hierarchy, and the OS core took care of managing it. User programs did not need to change at all. That is pretty cool!</p>
<p>My gut feeling is that this is thanks to the higher-level APIs compared to many other systems, which gives the system a clearer view of what a program is trying to achieve. Working on system-defined objects with known types sure beat trying to make sense of uninterpreted strings of bytes coming out of your typical file-system-oriented program.</p>
<p>Overall, IBM i impresses me by implementing a series of unique and innovative technologies that is largely different from the more well-known UNIX-style of OS design that rose in parallel to the development of IBM i. It demonstrates that there are technical alternatives to the mainstream, and that doing things differently can indeed be a very goo idea. Refreshing, in a world where too many things are me-too designs that just follow the majority herd of thinking.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1479"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1479" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1479" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1479/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SecurityNow on Randomness</title>
		<link>http://jakob.engbloms.se/archives/1424?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1424#comments</comments>
		<pubDate>Wed, 25 May 2011 20:20:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[random number generation]]></category>
		<category><![CDATA[SecurityNow]]></category>
		<category><![CDATA[Steve Gibson]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1424</guid>
		<description><![CDATA[Episodes 299 and 301 of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>Episodes <a href="http://twit.tv/sn299">299 </a>and <a href="http://twit.tv/sn301">301 </a>of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas of predictability, observability, repeatability, and determinism. I have worked and wrangled with these concepts for almost 15 years now, from my research into timing prediction for embedded processors to my current work with the repeatable and reversible Simics simulator.</p>
<p><span id="more-1424"></span>Let&#8217;s start from the top.</p>
<p>When Steve said that computers are deterministic, I jumped. To me, a computer is anything but deterministic. The idea that rerunning a program does the same thing is an ideal state that you can rarely reach, and having an infrastructure like Simics that <a href="http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html">helps you achieve this </a>is huge win for debugging.</p>
<p>Listening closely, what I think Steve <em>really </em>said is that an algorithm like a random number generator is deterministic. If you know its initial state, it will always compute the same result. That is indeed true for code that just converts an input into an output, and does no communication and is not dependent on time or timing. My experience in random and nondeterministic behavior comes from programs that feature multiple threads and often multiple processes, and plenty of asynchronous activity going on. So, same word, different contexts.</p>
<p>However, Steve also several times talk about computers as being deterministic predictable machines. I think that characterizing today&#8217;s computers as being deterministic is untrue. I would rather say that with multiple cores and multiple chips and timing variations all over the place, a computer has become fundamentally <em>nondeterministic </em>and non-repeatable, since there are so many little things going on where a nanosecond difference in time can cause behavior to diverge incredibly quickly. There is a nice paper from 2003 about the divergent behavior from minor differences, &#8220;<a href="http://portal.acm.org/citation.cfm?id=822813">Variability in Architectural Simulations of Multi-threaded Workloads</a>&#8220;, by Alaa R. Alameldeen and David A. Wood.</p>
<p>The <a href="http://jakob.engbloms.se/archives/1374">HAVEGE program I wrote about a while back </a>is essentially an attempt to harness the fundamental unpredictability of modern hardware timing. Nice idea, which at least in theory fulfills the more important property for security of being <em>unobservable</em>. Security doesn&#8217;t really need &#8220;real&#8221; randomness, all you need is something that an attacker cannot predict or observe. The classic <a href="http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html">Netscape SSL lack-of-randomness in the random seed</a> issue from 1996 is the best illustration of this. Certain things about a target can be inferred or observed, but the low-level hardware timing is not one of them, at least not for an x86 or high-end ARM class machine.</p>
<p>The solution that Steve prefers are the Yarrow and <a href="http://en.wikipedia.org/wiki/Fortuna_%28PRNG%29">Fortuna </a>algorithms that collect randomness from the environment of a computer and uses that as a seed to a normal random number generator, creating lots of useful random data from a fairly small seed. This is the same idea as HAVEGE, but with a different entropy source. In both cases the basic idea seems sound and reasonable, but I kind of hoped that Steve would know of some way to evaluate the quality of the entropy pool generated from hardware events.</p>
<p><a href="http://www.grc.com/sn/sn-301.htm">Steve mentioned </a>the NIST randomness test that was used to test HAVEGE. It is certainly an aggressive test, but <a href="http://jakob.engbloms.se/archives/1374">as my testing showed</a>,  it only demonstrates that a random number generator is random in the data  produced. It does not show that it is unpredictable, and it does not measure the benefit gained from using  unobservable local events in hardware as the source of entropy. You need something  else, like comparing repeated collections of randomness over time from  the same system, to build confidence in unobservable and unpredictable  randomness.</p>
<p>With a computer, you do have such a thing as repeatable,  deterministic, and thus predictable randomness. In a modern desktop or server computer, you also have tons of totally unpredictable non-repeatable non-usefully-observable randomness in the low-level hardware timing and concurrent behavior of independent hardware units. Too bad it seems hard to prove this by measurement.</p>
<p>For yet more randomness discussion, especially randomness in embedded systems, I recommend the <a href="http://secworks.se/2011/03/om-slumptal-och-entropikallan-haveged/">blog post </a>by Joachim Strömbergsson. (it is in Swedish).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1424"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1424" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1424" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1424/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Evaluating HAVEGE Randomness</title>
		<link>http://jakob.engbloms.se/archives/1374?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1374#comments</comments>
		<pubDate>Thu, 17 Feb 2011 21:33:14 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[evaluation methodology]]></category>
		<category><![CDATA[HAVEGE]]></category>
		<category><![CDATA[random number generation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1374</guid>
		<description><![CDATA[I previously blogged about the HAVEGE algorithm that is billed as extracting randomness from microarchitectural variations in modern processors. Since it was supposed to rely on hardware timing variations, I wondered what would happen if I ran it on Simics that does not model the processor pipeline, caches, and branch predictor. Wouldn&#8217;t that make the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>I previously blogged about the <a href="http://jakob.engbloms.se/archives/1370">HAVEGE algorithm </a>that is billed as extracting randomness from microarchitectural variations in modern processors. Since it was supposed to rely on hardware timing variations, I wondered what would happen if I ran it on Simics that does not model the processor pipeline, caches, and branch predictor. Wouldn&#8217;t that make the randomness of HAVEGE go away?</p>
<p><span id="more-1374"></span>I got HAVEGE up on a Simics x86 target model with Linux pretty quickly, and ran the two provided tests. <em>Ent</em>, which is a quick entropy test, and <em>nist</em> which supposedly much more thorough.</p>
<p>To my surprise, they both said the randomness we got was totally acceptable. This would seem to invalidate the fundamental assumption of HAVEGE &#8211; that it needs to collect randomness from hardware in order to produce good-quality randomness. To try to understand a bit more of what was going on, I took at look at the execution using <a href="http://blogs.windriver.com/engblom/2010/05/analyzed.html">Simics Analyzer</a> (the dredd.motherboard.processor lines are the processors, and the orange part is the HAVEGE program, yellow is the kernel):</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged.png"><img class="aligncenter size-medium wp-image-1377" title="OS scheduler messing with haveged" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-300x128.png" alt="" width="300" height="128" /></a></p>
<p>Zooming in a bit:</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-closer-look.png"><img class="aligncenter size-medium wp-image-1378" title="OS scheduler messing with haveged closer look" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/OS-scheduler-messing-with-haveged-closer-look-300x128.png" alt="" width="300" height="128" /></a>We can see that the program is regularly interrupted by the OS, which could be  a reason for random timing variations. The instructions run by the OS should vary in count, which would disturb the time stamp counter values read by the HAVEGE program. That could be sufficient to cause random variations, essentially showing that HAVEGE really works well just from OS noise &#8211; even in an otherwise idle machine.</p>
<p>However, at this point I started to have my doubts. Something did not feel right.</p>
<p>So I tried to remove all variations from the HAVEGE program. I replaced the &#8220;HARDTICKS&#8221; macro in HAVEGE with the constant 0 (zero) rather than reading the time stamp counter of the processor. This immediately failed the randomness test.</p>
<p>However, when I used the constant 1 (one) instead, the <em>ent </em>test passed. And even <em>nist </em>almost passed with only a single missed test out of the 426 tests executed.</p>
<p>Thus, the conclusion is that we do not know how well HAVEGE &#8216;s collection of hardware randomness works, since the evaluation software is too weak. In essence, we do not know if the collection of hardware randomness matters or not, as the proposed measurement hides the randomness behind a pretty good PRNG algorithm.</p>
<p>Ideally, we would need a measurement that would evaluate the predictability of the randomness generated. Or at least one that can correctly estimate the impact of the variation of low-level hardware timing on the quality of the final random numbers. Unfortunately, that is not the case here, throwing the entire idea into doubt.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1374"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1374" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1374" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1374/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Execution Time is Random, How Useful</title>
		<link>http://jakob.engbloms.se/archives/1370?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1370#comments</comments>
		<pubDate>Sun, 13 Feb 2011 21:49:18 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Andre Szenec]]></category>
		<category><![CDATA[HAVEG]]></category>
		<category><![CDATA[HAVEGE]]></category>
		<category><![CDATA[random number generation]]></category>
		<category><![CDATA[wcet]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1370</guid>
		<description><![CDATA[When I was working on my PhD in WCET &#8211; Worst-Case Execution Time analysis - our goal was to utterly precisely predict the precise number of cycles that a processor would take to execute a certain piece of code.  We and other groups designed analyses for caches, pipelines, even branch predictors, and ways to take [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png"><img class="alignleft size-full wp-image-1371" title="dice" src="http://jakob.engbloms.se/wp-content/uploads/2011/02/dice.png" alt="" width="86" height="88" /></a>When I was working on my <a href="http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-1832">PhD in WCET &#8211; Worst-Case Execution Time analysis </a>- our goal was to utterly precisely predict the precise number of cycles that a processor would take to execute a certain piece of code.  We and other groups designed analyses for caches, pipelines, even branch predictors, and ways to take into account information about program flow and variable values.</p>
<p>The complexity of modern processors &#8211; even a decade ago &#8211; was such that predictability was very difficult to achieve in practice. We used to joke that a complex enough processor would be like a random number generator.</p>
<p>Funnily enough, it turns out that someone has been using processors just like that.  Guess that proves the point, in some way.</p>
<p><span id="more-1370"></span>I was recently introduced to the concept of the <a href="http://www.irisa.fr/caps/projects/hipsor/">HAVEGE project &#8211; HArdware Volatile Entropy Gathering and Expansion</a>, run at IRISA in Rennes in France from what seems to be around 2002 to 2006.  The main author, Andre Seznec, has also published in the WCET field. Today, the same idea is found nicely packaged in the HAVEGED code base for Linux, found at <a href="http://www.issihosts.com/haveged">http://www.issihosts.com/haveged</a>.</p>
<p>The idea behind HAVEGE is to run a piece of code that is designed to incur cache misses, confuse branch predictors, and generally strain the prediction mechanisms of a processor. In this way, the timing of the code will fluctuate even though it is basically straight-line code with no decision-making. These timing variations can be captured by reading a high-resolution timer such as the x86 processor&#8217;s <a href="http://en.wikipedia.org/wiki/Rdtsc">TSC (Time Stamp Counter), </a>or some other source that can report the execution time of a piece of code.</p>
<p>The key advantage of such a source of randomness is that it is easy to quickly acquire lots of randomness (or <a href="http://en.wikipedia.org/wiki/Entropy_%28computing%29">entropy in crypto language</a>), and it is also impossible to predict the results. For cryptographic applications, this unpredictability from the perspective of an outside observer is very important, as it makes random numbers generated based on this much stronger in the face of an attack.</p>
<p>I think HAVEGE offers a good example of how to make lemonade from lemons.  If we conclude that processor timing cannot be predicted, consider that fact as a feature for cryptography rather than as a problem for WCET.</p>
<p>The first paper on HAVEGE is called &#8220;<a href="http://www.irisa.fr/caps/projects/hipsor/publications/havege-rr.pdf ">Hardware Volatile Entropy Gathering and Expansion: Generating unpredictable random numbers at user level</a>&#8220;, IRISA internal report 1492, October 2002. It presents the core idea a little differently from later papers.  In it, they measure the cache and TLB effects on randomness, assuming the key to randomness being the effects of interrupts where OS code affect the cache and TLB entries used by the program.  An underlying assumption is that if you just run a program in isolation, the caching and speculation mechanism will converge to a good state for the program, with no or little timing variation as a result.</p>
<p>I wonder if that is still true on a modern machine. Their measurements were performed on a mid-1990s UltraSPARC II, which is in-order and much simpler than current Intel Core processors. Even an ARM Cortex-class processor is much more complex.  I would really like to see measurements about the inherent randomness in today&#8217;s processors, without any recourse to interrupts and hardware actions to disturb the picture.  I wonder if you would still see variations in the execution time of a body of code due to the different periods of various hardware mechanisms, or if it all converges to maximum throughput and minimal hardware latencies for all parts of the pipeline. For some reason, I have my doubts that the hardware would be that ideal in practice.</p>
<p>What makes the randomness of the actual hardware hard to evalutate  is that the available codebase is the HAVEGE code, which is an &#8220;expansion&#8221; of the basic HAVEG idea. The expansion being to couple a PRNG to the collection of entropy from the hardware, in order to produce much more random noise (in terms of random bits per second) than just the hardware would provide. While very practical, this also serves to obscure the fundamental randomness of the hardware from direct measurement.</p>
<p>Essentially, HAVEGE generates a ton of random data that appears to be of high quality in the tests provided.  But that data mixes three factors into a single measurement:</p>
<ul>
<li>Hardware low-level random fluctuations (cache, pipeline, branch predictor)</li>
<li>Hardware coarse-grained variation (interrupt timing, the time taken<br />
to perform OS actions in response to interrupts)</li>
<li>The effectiveness of the PRNG code</li>
</ul>
<p>Picking these three apart would be interesting, and it is a shame that there seems to be no recent evaluation of HAVEGE.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1370"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1370" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1370" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1370/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Modeling Endianness</title>
		<link>http://jakob.engbloms.se/archives/1336?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1336#comments</comments>
		<pubDate>Sun, 26 Dec 2010 15:58:19 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[big-endian]]></category>
		<category><![CDATA[endianess]]></category>
		<category><![CDATA[hardware modeling]]></category>
		<category><![CDATA[little-endian]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1336</guid>
		<description><![CDATA[Endianness is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way. This blog post describes how I am used to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/egg.png"><img class="alignleft size-full wp-image-1337" style="margin: 5px 10px;" title="egg" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/egg.png" alt="" width="74" height="66" /></a><a href="http://en.wikipedia.org/wiki/Endianness">Endianness </a>is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way.</p>
<p>This blog post describes how I am used to working with endianness in virtual platforms, and why this approach makes sense to me. There are other ways of dealing with endianness, with different trade-offs and overriding goals.</p>
<h2><span id="more-1336"></span>Fundamentals</h2>
<p>What is endianness? In my way of looking at it, it is the arbitrary solution to the problem you get when a large unit of information (say, a 32-bit word) needs to be stored as a set of smaller units (say, 8-bit bytes). When this happens, you need to split the large unit into smaller units, and decide on how to order the smaller units. There is no objectively better or worse way to do this &#8211; as long as the result is unambiguous and based on positional numerics (i.e., no roman numerals, please), it is hard to claim that one order is better than another.</p>
<p>We use &#8220;endianness&#8221; all time without really thinking about it, when we write regular decimal numbers. In our <a href="http://en.wikipedia.org/wiki/Hindu_numerals">standard </a>base-10 decimal writing system, any value &gt;9 has to be written down using multiple digits. The order we use is a big endian representation: the most significant numbers come first in our reading order (hundreds before tens before single digits, etc.).</p>
<p>In computer architecture, we have three main schools of endianness:</p>
<ul>
<li>No endian, where we never break things down to bytes but always operate on equal-size words (not very common in practice, but certain machines like the Microchip PIC have instruction ROMs as wide as the instructions, and no way to address components of the intructions)</li>
<li>Big endian, BE, where the most significant bytes are put first in order of ascending addresses. I.e., the &#8220;big end&#8221; comes first.</li>
<li>Little endian, LE, where the least significant bytes are put first</li>
<li>&#8220;Middle endian&#8221;, where the ordering differs for different sizes of data (<a href="http://en.wikipedia.org/wiki/Endianness">Wikipedia </a>mentions this, but I have never seen an example). I have heard stories about chips that also used different endianness to store data by different instructions (by misdesign, I am not referring to the Power Architecture load/store byte-reversed instructions).</li>
</ul>
<p>BE is the traditional choice of IBM and the major early RISC chips, with Power Architecture, MIPS, SPARC, and the zSeries as the most important representatives. LE is the choice of x86, and more recently ARM. MIPS also seems to be gravitating towards LE, probably as a way to make x86 software slightly easier to port. Note that even though some processor cores are described as endianness-neutral, that really means that they can run as either LE or BE. In practice, particular chip designs incorporating such cores tend to lean heavily towards one endianness, since devices are designed for a particular endianness.</p>
<h2>The Software View</h2>
<p>For me, the most important view of endianness is how the software sees it. When a program is running on any current architecture, it logically sees memory as an array of bytes. Inside the memory chips, we have a very different physical layout, usually with words much wider than a byte, as well as an addressing scheme that is not one-dimensional. The interconnect (&#8220;bus&#8221;) moving data from a processor to memory and back is a complex system containing caches, buses of different widths (usually 64 bits or more), memory controllers, cache controllers, bus bridges, and other devices. All of this is usually completely invisible to software, as illustrated below:</p>
<p style="text-align: center;"><img class="aligncenter" style="margin-top: 5px; margin-bottom: 5px;" title="endianness 1" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-1.png" alt="" width="504" height="389" /></p>
<p style="text-align: left;">Basically, the bus system is invisible. The important endianness property as far as software is concerned is the order in which bytes are put into memory, and memory is considered as an array of bytes (since a byte is the smallest unit of addressing). If you look at the memory of a computer system using a debugger, this is the view you will get &#8211; both for on-target and off-target debuggers like ICE units and JTAG debuggers. Each memory access (store or load) will logically pass a small array of bytes into some position in the very large array that is memory.</p>
<h2 style="text-align: left;">The Modeling View</h2>
<p style="text-align: left;">Modeling endianness is not optional when building a virtual platform. The software will at some point assume a certain relationship between word layouts and byte addresses in memory (such as overlaying a byte array on an integer in a C union), or when interpreting network packets (which are defined to use BE byte order, and therefore network code has to convert values to native endian to process them).</p>
<p style="text-align: left;">If you start from the software view of endianness and memory, the obvious simulation model for memory operations is to maintain the array of bytes view of memory matching the physical target.</p>
<ul>
<li>Each memory access from a simulated processor gets turned into a transaction in the simulator.</li>
<li>The transaction has variable size, matching the size of the memory access operation issued by the processor.</li>
<li>The transaction contains a sequence of bytes, in the same order as they would end up in target memory on a physical machine. I.e., the order reflects the endianness of the processor.</li>
<li>The transaction has a starting address (byte-based) matching the memory access the processor issues.</li>
<li>The contents of the memory model in the simulation is an array of bytes, and its content matches what you would find on the physical target &#8211; the logical software view of the target.</li>
<li>The bus system connecting the processor to the memory is basically considered as a black box that just moves the transaction to memory.</li>
</ul>
<p>The above is very easy to implement, and actually a very convenient implementation for someone used to the software view of hardware. The only thing that remains to be considered is how a processor simulator is implemented in practice.</p>
<p>In a typical processor simulator, you represent the target system registers using words of the same size as the target processor uses. I.e., for a 32-bit processor, you use 32-bit words on the host to represent the contents of a register. As the processor model is running, the contents of the register might have to stored in data structures internal to the processor (such as an array of words representing the register file). Naturally, such data structures are kept in host endianness since they are just plain compiled C code. As the processor model runs, arithmetic is carried out using host endianness.</p>
<p>Actually, usually no endianness is involved as the values are considered as words. Remember that a word does not have endianness until it is broken down into bytes and someone actually looks at the bytes. In particular, an operation like</p>
<pre>uint8  a;
uint32 b;
a = (b &amp; 0xff)</pre>
<p>will pick up the 8 lowest bits of a word on any processor. The code is logically working inside of registers and is perfectly portable. However, the result of</p>
<pre>uint32 *c;
*c = b;
a = *((uint8 *)c);</pre>
<p>will pick up the first (at the lowest address) byte stored in memory when b was written &#8211; which is the same as the above on an LE processor, but different on a BE processor. The crucial observation here is that the latter variant contains an explicit store of a word, and an explicit load of a byte. Thus, endianness enters as we store the word (the byte load has no endianness, as it is loading the smallest unit of addressability).</p>
<p>What this means is that a processor simulator will have to do an explicit ordering of bytes as it is writing out values to memory. The simulator will need to take a word it has represented in &#8220;host order&#8221; (as it is within the simulator itself) and convert it to the byte order of the target processor. If the two match, such as simulating a little-endian ARM target on a (always little-endian) x86 host, nothing needs to be done. If they do not match, such as simulating a big-endian PPC target on an x86 host, the bytes have to be swapped before being sent to simulated memory.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-2.png"><img class="aligncenter size-full wp-image-1340" title="endianness 2" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-2.png" alt="" width="422" height="368" /></a>When the processor does a load, it similarly has to swap the bytes being read from memory (if using different target and host endianness).</p>
<p>As soon as we leave the processor simulator, the order of bytes in transactions and simulated memory has to defined and managed in a host-independent way. This is crucial to enable  snapshots of memory to be <a href="http://blogs.windriver.com/engblom/2010/08/transporting-bugs-with-checkpoints.html#more">shared across hosts, time, and space</a>, and simply to allow the simulation to work correctly. The semantics of the simulation must be defined by the simulator, not by the nature of the host.</p>
<p>Note that as an optimization, quite often we do not create an explicit transaction, but rather use the optimization of letting the processor simulator write directly to the representation of the target memory in the memory simulator. In this design, the target memory representation is just an array of bytes mirroring the contents that the processor would see on a physical target.</p>
<p>Let&#8217;s go through this with a simple example. We assume we are on an x86 host. Our processor simulator contains a 32-bit register with the value 0&#215;01020304. This value is endianless until we have to send it to simulated memory, it is just a value of 32 bits. We write it to target memory at address 0&#215;100</p>
<p>On a simulated LE target, the memory write will result in a transaction containing the byte sequence (0&#215;04, 0&#215;03, 0&#215;02, 0&#215;01) &#8211; lowest byte comes first. The memory model will store this with 0&#215;04 at address 0&#215;100, 0&#215;03 at 0&#215;101, etc. The processor model can achieve this effect by simply doing a host-native word store to the memory array.</p>
<p>On a simulate BE target, the memory write will result in a transaction containing (0&#215;01, 0&#215;02, 0&#215;03, 0&#215;04). In memory, 0&#215;01 will be stored at address 0&#215;100, 0&#215;02 at 0&#215;101, etc. To store this word correctly, the processor model will have to do a byte swap operation on the word before writing it out to memory. Such a byte swap operation might seem expensive, but the evidence does not indicate that it matters. All the fastest instruction-set simulators use this method internally as far as I know (Wind River Simics, Imperas OVP, Qemu, IBM Mambo), which to me indicates that the design works well on a simulation system level.</p>
<h2>Device Models</h2>
<p>Device models are the main part of a functional simulator for a computer system. They also have endianness, as they expose memory-mapped interfaces to software. To deal with devices in a consistent manner, they will interpret inbound memory transactions using their local register endianness. This makes it simple and reliable to simulate systems where the processor and the devices have different endianness.</p>
<p>Systems with mixed device endianness is very common, mostly thanks to PCI. PCI is defined to use little-endian byte ordering in all memory accesses, as it originated in the x86 world. PCI is still being used in almost all computer systems, and thus LE PCI devices are being connected to BE processors.</p>
<p>Internally, a device model will also use words to represent data. When data is written to a device, it will interpret the bytes in the write transaction using its local order. When data is read from a device, it will fill in the data in the read transaction using its local order.This makes device drivers that byte-swap incoming data from an LE PCI device on a BE processor work just like they do on physical hardware.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-3.png"><img class="aligncenter size-full wp-image-1341" title="endianness 3" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-3.png" alt="" width="473" height="414" /></a>This makes endianness a local property of the device. The same device model can be used without change in both an LE and a BE target system. This mirrors reality: PCI devices are used in all kinds of systems, and the devices do not change, and neither do the models have to.</p>
<p>In some systems, the designers try to hide the RISC-processor-to-PCI endianness mismatch by making the hardware swap bytes around as they move from the memory bus into the PCI subsystem. If this is the case in a target system, the simplest simulation method is to insert an byte-swapping intermediary on the path from the processor to the devices. This will do an extra byte swap on all transactions passing by, and things will work correctly (note that this byte swap has to be defined to work on a certain word length, and if transactions are bigger than this length, you will also have to order the words).</p>
<p>Note that as long as all units involved on the path from a device to a processor use the same word length, you can replace all the byte swapping operations with a simple flag. This flag will indicate if a transaction has been swapped or not. For example, when we have a BE processor talking to a BE device, on an LE host. The BE processor will flag the transaction as &#8220;wrong-endian&#8221; as it sends it out but actually store the bytes in LE order in the transaction. The BE device will check the flag and realize that it is wrong-endian too. And since two wrongs make a right, it does not have  to swap the bytes either but can copy the transaction contents directly into its internal registers.</p>
<h2>Dealing with Data</h2>
<p>There are other things you want to do with a memory image in a virtual platform apart from reading and writing it from a processor. One particular task is to move data into and out of memory model in order to load code and data, as well as to save the state of the system. The representation of a memory as an array of bytes works very well for this approach, since it corresponds naturally to how software files are created on the host. Since most software files are intended to be loaded by the target into target memory, they are prepared in target byte order. Another advantage of using a byte-based memory representation is that file formats like ELF can be loaded straight into virtual memory without having to convert addresses.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-5.png"><img class="aligncenter size-full wp-image-1344" title="endianness 5" src="http://jakob.engbloms.se/wp-content/uploads/2010/12/endianness-5.png" alt="" width="495" height="395" /></a>The representation is also host-independent, which facilitates moving memory images from one host to another, a key part of <a href="http://jakob.engbloms.se/archives/1235">using virtual platforms as a communications mechanism</a>. Another benefit of viewing memory as an array of bytes as accessed from a processor is that debuggers can look at memory in the same way as they would when running on the same host.</p>
<h2>Summary</h2>
<p>This long post (WordPress tells me it is more than 2500 words) really only starts to scratch the surface of this fascinating topic. It has described one approach to endianness modeling, and some of the subtleties involved. There are many more subtleties that we could go into.</p>
<h2>Footnote: SystemC TLM-2.0</h2>
<p>There are other ways to model endianness. In particular, the approach described here is not used in the SystemC TLM-2.0 standard. In TLM-2.0, all data is stored in a transaction in <em>host</em> order, not target order. To model the target endianness, you instead change a descriptor array that tells the simulator about how to interpret the bytes when viewed from the target.</p>
<p>As I see it, this means that TLM-2.0 is better suited for modeling the ins and outs of a bus system, including discovering how data ends up at a target from the actions of the various components of the bus system. It models byte lanes and the width of buses, and uses host byte order for all transfers of data. In contrast, the approach described in this blog post works by modeling the documented (or intended) effect of the hardware at the software level.</p>
<p>Overall, I would say that TLM-2.0 is slightly more geared towards the &#8220;<a href="http://jakob.engbloms.se/archives/1083">design&#8221; use of modeling, rather than &#8220;describe</a>&#8220;. By modeling bus widths, actual byte lanes, and other concepts, the simulator will discover the shape and endianness of data as it arrives at a target memory or device.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1336"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1336" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1336" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1336/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wind River Blog: Interview with a Virtualization Researcher</title>
		<link>http://jakob.engbloms.se/archives/1223?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1223#comments</comments>
		<pubDate>Sun, 29 Aug 2010 07:44:15 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[Wind River Blog]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1223</guid>
		<description><![CDATA[Past Friday, I posted a new blog post in my Wind River blog. It is an interview the PhD student Girish Venkatasubramanian from the University of Florida. He is doing research on virtual machines/hypervisors and how they can be implemented more efficiently by making fairly small changes to the architecture of memory management units. The [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png"><img class="alignleft size-full wp-image-1122" style="margin: 5px 10px;" title="Wind River Logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/04/button-quicklink-blogs.png" alt="" width="46" height="46" /></a>Past Friday, I posted a new blog post in my Wind River blog. It is an <a href="http://blogs.windriver.com/engblom/2010/08/interview-with-girish-venkatasubramanian.html">interview the PhD student Girish Venkatasubramanian </a>from the University of Florida. He is doing research on virtual machines/hypervisors and how they can be implemented more efficiently by making fairly small changes to the architecture of memory management units.</p>
<p><span id="more-1223"></span></p>
<p>The area of virtualization is one that I would definitely have looked  at as an opportunity had I started out as a PhD student today. The work  of his group is a good example of how Simics is being used for <a href="http://blogs.windriver.com/engblom/2010/07/academic-simics.html">research and teaching in universities</a> around the world.</p>
<p>Going one level up in abstraction, I note that this is probably the first time I have published an actual interview. I have been active in writing things since high school, but it has pretty much always been direct writing, not interviewing.  However, I really hope that this is not the last. Having a series of user interviews on the Wind River blog could be really neat, as a way to dive deeply into some particular areas of technology. Will be interesting to see if any other university user is interested in being featured.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1223"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1223" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1223" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1223/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Microsoft + ARM = ARM64?</title>
		<link>http://jakob.engbloms.se/archives/1204?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1204#comments</comments>
		<pubDate>Tue, 27 Jul 2010 19:57:05 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[business issues]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1204</guid>
		<description><![CDATA[The recent news that Microsoft has taken out an ARM architectural license has caused a lot of speculation about just what this might mean. There are several quite well reasoned ideas around the web, and I have one idea of my own: sixty-four bits. Here is a list of some of the ideas I have [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/07/windows-phone-logo.png"><img class="alignleft size-full wp-image-1205" title="windows phone logo" src="http://jakob.engbloms.se/wp-content/uploads/2010/07/windows-phone-logo.png" alt="" width="66" height="58" /></a>The recent news that Microsoft has taken out an ARM architectural license has caused a lot of speculation about just what this might mean. There are several quite well reasoned ideas around the web, and I have one idea of my own: sixty-four bits.</p>
<p><span id="more-1204"></span></p>
<p>Here is a list of some of the ideas I have seen floating around:</p>
<ul>
<li>They are going to put <a href="http://armdevices.net/2010/07/23/the-secrets-behind-microsofts-new-arm-license/">Windows 7 on ARM </a>to <a href="http://gigaom.com/2010/07/23/4-actions-microsoft-can-take-with-an-arm-license/">defend against Linux in the sub-netbook segment</a>. Certainly a useful move &#8211; but do you need an architecture license for that? It would not make much sense to put Windows on ARM in a way that requires you to have a Microsoft-designed processor core to run it.</li>
<li>It is a case of Apple &#8220;A4&#8243; envy &#8211; if Apple can build their own custom ARM chip for a consumer devices in the tablet space, <a href="http://gigaom.com/2010/07/23/4-actions-microsoft-can-take-with-an-arm-license/">so can Microsoft</a>. I think this makes a lot of sense. But once again, what use is a license to design your own ARM core and ARM instructions? <a href="http://unplugged.rcrwireless.com/index.php/20100723/news/2226/why-microsoft-acquired-an-arm-license/">Apple does not seem to have created a custom processor core</a>, instead using a standard ARM Cortex-A8 core in the A4 chip as far as I understand.</li>
<li><a href="http://arstechnica.com/microsoft/news/2010/07/microsoft-should-cut-out-the-middle-men-and-build-its-own-phones.ars/2">Handheld gaming, building their own phone</a>. It is essentially the A4 envy argument, and I cannot see how an architecture license will help.</li>
<li><a href="http://www.computerweekly.com/Articles/2010/07/23/242079/Microsoft-licenses-ARM-tech-in-bid-to-own-39internet-of.htm">Deeply embedded devices</a>. Microsoft wants to ride the ARM wave into things like power meters and other truly embedded systems. Once again, that might certainly be a good idea &#8211; but why an architecture license?</li>
<li>As a defensive move in the Microsoft-Intel relationship. If Intel is toying with Linux, Microsoft can be toying with ARM. In that context, a few million dollars to acquire a license to do nothing much with it might make sense as negotiation leverage.</li>
</ul>
<p>The best idea I think is the <a href="http://gigaom.com/2010/07/23/4-actions-microsoft-can-take-with-an-arm-license/">server angle</a>. ARM is making inroads into servers where single-thread processor speed is not that important. The power consumption advantage over x86 is significant, as long as you do not need utter speed or very many cores sharing memory.</p>
<p>Power-efficient servers is an area where Microsoft can certainly see the potential for truly revolutionary changes in the IT field. There is no reason why x86 would be the best architecture there, and the fact that x86 is controlled by Intel and AMD makes it very hard for Microsoft to really take part in hardware-software innovation. Essentially, Intel and AMD design the processors, and the software has to adapt. This model is not optimal if you want to really see what you can do with the hardware-software boundary.</p>
<p>Thus, if I think Microsoft might do with the ARM license is to start tinker with the OS interface of the hardware. The beneficiary would be both Microsoft and ARM, since Microsoft innovations for ARM might well become standard in ARM land. For example, an OS might benefit tremendously from small changes in the processor in areas like:</p>
<ul>
<li>Memory management &#8211; a hardware scheme tailored to what Microsoft sees being done in their operating systems and their third-party server software could certainly be very beneficial to efficiency.</li>
<li>Multicore and multithreading &#8211; ARM has some special support in their MPCore designs for communicating between cores in a multicore system. Such support could be extended to help the OS manage both threads and processors across multicore designs. Also, ARM might want some help from Microsoft to design a scalable OS interface that works for tens or hundreds of tightly-coupled cores (rather than the current limit of 4 in ARM Cortex-A9 MP).</li>
<li>More bits &#8211; the biggest architectural problem I see with ARM in servers is that ARM is currently a 32-bit architecture. 32 bits are not enough for even medium servers today. Having Microsoft help ARM design a 64-bit version of the ARM architecture sounds far-fetched, but it is certainly something where a license to change the ARM instruction set would help.</li>
<li>ARM instruction set additions to help build an emulator for x86 instructions on ARM, to allow current  x86-Windows applications to run on an ARM-based device. This is an idea which has been tried in the past and which has never been very successful in the market. ARM currently has a set of instructions which help accelerate virtual machines such as .net and JVM, and I think that is as far as this idea has been proven useful.</li>
</ul>
<p>In summary, the only reasonable use I can see that Microsoft would make of a license to build custom ARM processor cores and tweak instruction sets would be to build better servers, by improving multicore communications mechanisms, upping ARM to 64 bits, tweaking the MMU design, and possibly creating some kind of x86 emulation support.</p>
<p>That said, I think the most likely result is a custom ARM-based chip to power a phone or tablet or other consumer electronics device, running a Microsoft software stack (derived from Win CE/Windows Phone) on Microsoft hardware, all sold as a Microsoft product. Essentially, doing the <a href="http://arstechnica.com/microsoft/news/2010/07/microsoft-should-cut-out-the-middle-men-and-build-its-own-phones.ars/2">Apple all-integrated solution.</a></p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1204"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1204" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1204" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1204/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Pipeline Performance Simulator Anno 1960</title>
		<link>http://jakob.engbloms.se/archives/1126?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1126#comments</comments>
		<pubDate>Mon, 03 May 2010 19:56:50 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[clock-cycle models]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[Frederick Brooks]]></category>
		<category><![CDATA[Harwood Kolsky]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[IBM 7030]]></category>
		<category><![CDATA[ISCA]]></category>
		<category><![CDATA[pipeline]]></category>
		<category><![CDATA[Tensilica]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1126</guid>
		<description><![CDATA[I have just found what almost has to be the first cycle-accurate computer simulator in history. According to the article &#8220;Stretch-ing is Great Exercise &#8212; It Gets You in Shape to Win&#8221; by Frederick Brooks (the man behind the Mythical Man-Month) in the January-March 2010 issue of IEEE Annals of the History of Computing, IBM [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/05/4506VV3073.jpg"><img class="alignleft size-full wp-image-1128" style="margin: 5px 10px;" title="IBM Stretch panel" src="http://jakob.engbloms.se/wp-content/uploads/2010/05/4506VV3073.jpg" alt="" width="83" height="79" /></a>I have just found what almost has to be the first cycle-accurate computer simulator in history. According to the article &#8220;<a href="http://dx.doi.org/10.1109/MAHC.2010.26">Stretch-ing is Great Exercise &#8212; It Gets You in Shape to Win</a>&#8221; by Frederick Brooks (the man behind <a href="http://en.wikipedia.org/wiki/Mythical_man_month">the Mythical Man-Month</a>) in the January-March 2010 issue of IEEE Annals of the History of Computing, IBM created a simulator of the pipeline for the <a href="http://en.wikipedia.org/wiki/IBM_Stretch">IBM 7030 &#8220;Stretch&#8221; computer </a>developed from 1956 to 1961 (<a href="http://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV3073.html">photo from IBM.com</a>).</p>
<p><span id="more-1126"></span></p>
<p>For those unfamiliar with the Stretch machine, it was a supercomputer developed by IBM which introduced many of the performance techniques and basic computer technologies that we all use today (most of them handed down to us via the IBM System/360). For example, it was the first to use 8-bit bytes and 64-bit floating point. It also introduced memory protection, memory interleaving, and instruction prefetching.</p>
<p>More relevant for my blog is the fact that the Stretch used the world&#8217;s first pipelined main processor, complete with interlocks to maintain program-order semantics. When developing this pipeline, Frederick Brooks claims that IBM developed a program to simulate the pipeline. This simulator was used to test the performance of the pipeline design on various test programs (this was before they were called benchmarks), and tune the design accordingly. The simulator was created by <a href="http://archive.computerhistory.org/resources/text/FindingAids/102658131.Kolsky.pdf">Harwood Kolsky</a>. There is no firm date for the pipeline simulator, but based on the development time of the Stretch, it can be dated somewhere around 1960.</p>
<p>Thus, the simulation-driven approach to computer architecture is about 50 years old by now. Should have gone to ISCA and used this as an excuse for a party I guess&#8230;</p>
<p>It is also interesting to note that the Stretch computer acquired a co-processor in 1962, to do cryptology work. This machine was the one-off <a href="http://en.wikipedia.org/wiki/IBM_7950">IBM 7950 &#8220;Harvest&#8221; </a>and was tailored for the needs of the NSA in the US. It was a seriously special-purpose hardware unit adding a few instructions to the Stretch machine, and beating any other machine at the time by about 50 to 200 on the particular NSA workloads.  Sounds like the kind of performance claims that Tensilica and other application-customized processors claim. 50 years ago.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1126"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1126" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1126" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1126/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Power Architecture Rip Van Winkle</title>
		<link>http://jakob.engbloms.se/archives/1026?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1026#comments</comments>
		<pubDate>Sun, 06 Dec 2009 20:07:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[power architecture]]></category>
		<category><![CDATA[Rip van Winkle]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1026</guid>
		<description><![CDATA[For some reason (I guess it is the job&#8230;) I was browsing through the Power ISA version 2.06 specification last week and hit the following gem of an instruction: &#8220;rvwinkle&#8220;. It is named after a short story I had never heard about, but which apparently is sufficiently well-known in the US literary canon to warrant [...]]]></description>
			<content:encoded><![CDATA[<p>For some reason (I guess it is the job&#8230;) I was browsing through the <a href="http://www.power.org/resources/downloads/PowerISA_V2.06_PUBLIC.pdf">Power ISA version 2.06 specification </a>last week and hit the following gem of an instruction: &#8220;<tt>rvwinkle</tt>&#8220;. It is named after a <a href="http://en.wikipedia.org/wiki/Rip_Van_Winkle">short story I had never heard about</a>, but which apparently is sufficiently well-known in the US literary canon to warrant a sleep mode being named after it.<br />
<span id="more-1026"></span></p>
<p>Anyway, here is a screenshot of the manual:</p>
<p><img class="aligncenter size-full wp-image-1027" title="ripvanwinkle ppc mode" src="http://jakob.engbloms.se/wp-content/uploads/2009/12/ripvanwinkle-ppc-mode.png" alt="ripvanwinkle ppc mode" width="568" height="848" /></p>
<p>It is one of four thread-sleep-state control instructions in the 64-bit server variant of the Power ISA. Essentially, it is an IBM extension for their POWER series machines, as well as the Cell and Xenon CPUs I guess. See the <a href="http://www.power.org/devcon/07/Session_Downloads/PADC07_Frey_PowerISA.pdf">Power ISA tutorial from the Power Architecture Developer&#8217;s Conference 2007</a> for some more on this.</p>
<p>I like this kind of whimsicalness in technical systems. It makes them human and more approachable. Sometimes, big companies (and small companies) once they are mature end up trying a bit too hard to sound business-wise and &#8220;professional&#8221;&#8230; ending up being plain boring and stone-faced and cold. There is no contraction between a chuckle and a professional system for most people.</p>
<p>Some people would put the Power Architecture &#8220;eieio&#8221; instruction in the same category of slightly funny. However, the limit for all assembly languages I have ever encountered seems to be the natural name for an instruction to Sign EXtend something. It is never called what it &#8220;should&#8221; be.</p>
<p>Note that this instruction is not new, it has been around since 2005 at least, probably longer. There are no history notes in the manual, and I have no intention of reading through lots of old manuals to find the first when this one did <em><span style="text-decoration: underline;">not </span></em>appear.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1026"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1026" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1026" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1026/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MCC 2009 Presentations Online</title>
		<link>http://jakob.engbloms.se/archives/1023?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1023#comments</comments>
		<pubDate>Thu, 03 Dec 2009 08:29:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[Andras Vajda]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[Ericsson]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[keynote]]></category>
		<category><![CDATA[LTE]]></category>
		<category><![CDATA[MCC]]></category>
		<category><![CDATA[UpMarc]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1023</guid>
		<description><![CDATA[The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the program page for the workshop. Let me add some comments on the workshop per se. This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1016" style="margin-top: 5px; margin-bottom: 5px;" title="UPMARC_700x150" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/UPMARC_700x150.gif" alt="UPMARC_700x150" width="122" height="45" />The presentations from the 2009 Swedish Workshop on Multicore Computing (MCC 2009) are now online at the <a href="http://www.it.uu.se/research/upmarc/MCC09/prog">program page for the workshop</a>. Let me add some comments on the workshop per se.</p>
<p><span id="more-1023"></span>This was the first multicore event that I have been to where we did not have a keynote speaker or technical paper from a hardware company. So there was really nothing here directly about how to build multicore chips. Rather, the workshop tended to be about how to program, use, measure performance on, verify software for, and generally work with multicore chips. From the perspective of software people, rather than hardware designers.</p>
<p>Obviously, hardware aspects enter into such talks, but it is the perspective of a user, not a designer. For example, a hardware designer could explain how an atomic compare-and-swap is optimized in a multicore device. But here, we saw measurements on the actual operation latencies observed on real machines using such operations. Quite refreshing, and closer to my personal interests.</p>
<p>The keynote by <a href="http://a-vajda.eu/blog/">Andras Vajda</a> of Ericsson was quite interesting. The slides are not online, but the main points that I picked up and that I might not have considered before:</p>
<ul>
<li>Software development costs can mean that the cheapest, fastest, most efficient hardware is not necessarily the most economic. Too hard to code for means the software development time and effort removes the advantage. Obvious, but worth reiterating. Software is king.</li>
<li>The workload on a cellular basestation can sometimes be highly linear and single-threaded. For example, serving a single terminal with a very high bandwidth LTE connection. And suddenly shift to a massively parallel workload as a crowd of a thousand all suddenly appear and start doing data downloads. And then go back to serial again. This means that the age-old argument that signal processing naturally &#8220;<a href="http://www.edn.com/blog/980000298/post/50023005.html">conveniently concurrent</a>&#8221; (<a href="http://www.scdsource.com/article.php?id=87">and here</a>) is not always true. Nice point!</li>
<li>Thus, we need adaptable architectures that can trade serial and parallel performance over time, and rebalance quite quickly. In the same chip.</li>
<li>He is a firm believer that homogeneous systems will win out in the end, I still hold on to a belief in accelerators and offload engines and DSPs. This is partially because of an admitted focus on servers and services processors, and not on the baseband and signalling side. Makes sense.</li>
<li>Domain-specific languages (DSL) are the future of efficient programming. Agree.</li>
</ul>
<p>On the topic of DSLs, there was a question about the cost to support them. To me, that is a non-issue. In the organizations that I have worked, it seems that maintaining a useful DSL requires at most one engineer. Developing one, a few good computer scientists for a fairly limited time. In any case, they tend to appear organically when good programmers <a href="http://jakob.engbloms.se/archives/747">generalize repeated tasks</a>.</p>
<p>I gave a keynote about how multicore has impacted virtual platforms (in particular, <a href="http://www.virtutech.com/products/simics">Virtutech Simics</a>) with the following main points:</p>
<ul>
<li>Multicore targets increase the performance pressure on a virtual platform, as more processors will have to be simulated.</li>
<li>Multicore hosts means that sequential performance of the host is going down compared to the aggregate parallel performance demands from the targets.</li>
<li>To handle large target systems, the virtual platform itself has to run multithreaded on a multicore host. Getting this in place is a major, interesting, and sometimes painful process.</li>
<li>Once you have a parallel virtual platform, multicore hosts provide a very nice boost in scalability and the manageable system sizes. A single multithreaded virtual platform process is also a bit easier to manage from a user perspective.</li>
<li>All features in the virtual platform have to be multicore and multimachine-aware&#8230; meaning that they often get a bit harder to use initially, as there is no &#8220;default processor&#8221; you can fall back to for debugging setups etc. Everything has to be explicitly targeted.</li>
<li>Multicore targets have proven to  be a great sales driver for virtual platforms, as debugging software on a physical multicore, multichip, multiboard system is just too painful.</li>
</ul>
<p>Overall, this was a fun event, looking forward to next year at Chalmers!</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1023"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1023" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1023" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1023/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Freescale Online Fault-Tolerance &#8220;Demo&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1011?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1011#comments</comments>
		<pubDate>Wed, 18 Nov 2009 12:21:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[fault injection]]></category>
		<category><![CDATA[fault tolerance]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[MPC564XL]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1011</guid>
		<description><![CDATA[I just spotted a fun little application on Freescale&#8217;s homepage: an interactive demo of the fault tolerance functions of the MPC564XL dual-core microcontroller. The demo is really just an interactive presentation, it runs no code or anything, but it is still a very nice way to educate about how the chip deals with errors. Tweet]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png"><img class="alignleft size-full wp-image-878" style="margin: 10px 5px;" title="freescale-logo-icon" src="http://jakob.engbloms.se/wp-content/uploads/2009/08/freescale-logo-icon.png" alt="freescale-logo-icon" width="80" height="80" /></a>I just spotted a fun little application on Freescale&#8217;s homepage: an <a href="http://www.freescale.com/files/graphic/flash/training/32bit/MPC564XL_Safety_Demo.html">interactive demo of the fault tolerance functions of the MPC564XL dual-core </a>microcontroller.</p>
<p><span id="more-1011"></span>The demo is really just an interactive presentation, it runs no code or anything, but it is still a very nice way to educate about how the chip deals with errors.</p>
<p><a href="http://jakob.engbloms.se/wp-content/uploads/2009/11/FSL-fault-injection-demo-app.png"><img class="aligncenter size-full wp-image-1012" title="FSL fault injection demo app" src="http://jakob.engbloms.se/wp-content/uploads/2009/11/FSL-fault-injection-demo-app.png" alt="FSL fault injection demo app" width="482" height="339" /></a></p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1011"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1011" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1011" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1011/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU &#8211; a new type of DSP?</title>
		<link>http://jakob.engbloms.se/archives/930?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/930#comments</comments>
		<pubDate>Fri, 11 Sep 2009 14:35:18 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[GPGPU]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=930</guid>
		<description><![CDATA[My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, [...]]]></description>
			<content:encoded><![CDATA[<p>My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.</p>
<p><span id="more-930"></span>The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that &#8220;everyone has anyway&#8221; &#8212; i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.</p>
<p>But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.</p>
<p>How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.</p>
<p>The reasons for GPU raw performance seems to be three-fold:</p>
<ul>
<li>Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.</li>
<li>There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.</li>
<li>Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.</li>
<li>And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans,  and 300W power draws are OK&#8230;</li>
</ul>
<p>What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.</p>
<p>If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs &#8212; even if they can be painful compared to general-purpose processors.</p>
<p>So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.</p>
<p>Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/930"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/930" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/930" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/930/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Cavium Octeon II: Short Notes</title>
		<link>http://jakob.engbloms.se/archives/811?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/811#comments</comments>
		<pubDate>Sat, 13 Jun 2009 19:40:41 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[Cavium]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[Octeon]]></category>
		<category><![CDATA[Octeon II]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=811</guid>
		<description><![CDATA[About two months ago, Cavium Networks launched their second generation of Octeon chips, the Octeon II. The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though. First, they launch with 2 [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-812" title="octeon-ii" src="http://jakob.engbloms.se/wp-content/uploads/2009/06/octeon-ii.jpg" alt="octeon-ii" width="76" height="78" />About two months ago, <a href="http://www.caviumnetworks.com">Cavium Networks </a>launched their second generation of Octeon chips, the <a href="http://www.caviumnetworks.com/OCTEON_II_MIPS64.html">Octeon II. </a>The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though.</p>
<p><span id="more-811"></span>First, they launch with 2 to 6 cores in typical chips, far short of the 32 core maximum. That probably indicates that system builders have a hard time adopting and getting good use from manycore architectures currently.</p>
<p>It is also a system that is full of accelerator units! In a 6-core chip, you find some 75 accelerator units according to Cavium. That is ten times as many accelerators as main cores, indicating where a large part of the work is actually being performed. To me, this validates that heterogeneous architectures and accelerators are still useful and valuable for networking applications, and that the idea of a homogeneous sea of identical processor cores with no specialization and no fixed-function hardware accelerators is still distant (I think it will never happen, but you never know).</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/811"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/811" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/811" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/811/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Hardware-Software Interface is where the Action Is</title>
		<link>http://jakob.engbloms.se/archives/799?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/799#comments</comments>
		<pubDate>Sun, 07 Jun 2009 19:52:47 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[Brian Cantrill]]></category>
		<category><![CDATA[Gary Stringham]]></category>
		<category><![CDATA[hardware design]]></category>
		<category><![CDATA[hardware-software interface]]></category>
		<category><![CDATA[Keith Adams]]></category>
		<category><![CDATA[Steve Gibson]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=799</guid>
		<description><![CDATA[When I started out doing computer science &#8220;for real&#8221; way back, the emphasis and a lot of the fun was in the basics of algorithms, optimizing code, getting complex trees and sorts and hashes right an efficient. It was very much about computing defined as processor and memory (with maybe a bit of disk or [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-800" title="pn4_quad-gigaswift-utp-adapter" src="http://jakob.engbloms.se/wp-content/uploads/2009/06/pn4_quad-gigaswift-utp-adapter.gif" alt="pn4_quad-gigaswift-utp-adapter" width="100" height="73" />When I started out doing computer science &#8220;for real&#8221; way back, the emphasis and a lot of the fun was in the basics of algorithms, optimizing code, getting complex trees and sorts and hashes right an efficient. It was very much about computing defined as processor and memory (with maybe a bit of disk or printing or user interface accessed at a very high level, and providing the data for the interesting stuff). However, as time has gone on, I have come to feel that this is almost too clean, too easy to abstract&#8230; and gone back to where I started in my first home computer, programming close to the metal.</p>
<p><span id="more-799"></span>As I dig deeper into operating systems and the hardware-software interface layer (mostly with the help of virtual platforms), I have come to appreciate just how hard and interesting that part of the computing stack is. I guess it is partially because that is the level where most of the nice thick layers of middleware and API software we use these days (and which to be frank I find fairly boring) just break down and have to start dealing with the real world. For some reasons, web servers and their programming feels barren and boring compared to dealing with interrupts, memory maps, and bit twiddling.</p>
<p>Several things I have read and heard about recently touch on this subject in various ways. All of them point to the fact that hardware-software interface design is important, and that there is a lot of right and wrong ways of doing it&#8230; which are rarely taught in universities and rarely approached in computing literature.</p>
<p>First, <a href="http://blogs.sun.com/bmc/entry/concurrency_s_shysters">Brian Cantrill of Sun wrote a blog post blasting transactional memory </a>in November of 2008, which I recently reread and got a bit of a epiphany from in this paragraph:</p>
<blockquote><p>&#8230; Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction <strong>because it does not merely operate on memory</strong>. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process &#8212; or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock &#8212; and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.</p></blockquote>
<p>In the same style, <a href="http://x86vmm.blogspot.com/2008/11/cantrill-and-bonwick-get-all-concurrent.html">Keith Adams at VMWare </a>picked up on the above and applied it to the microkernel idea:</p>
<blockquote><p>It&#8217;s interesting to me that, as with microkernels, one of the principle reasons TM will fail is the messy, messy reality of peripheral devices. One of the claims made by microkernel proponents is that, since microkernel drivers are &#8220;just user-level processes&#8221;, they&#8217;ll survive driver failures. And this is almost true, for some definition of &#8220;survive.&#8221; Suppose you&#8217;re a microkernel, and you restart a failed user-level driver; the new driver instance has no way of knowing what state the borked-out driver left the actual, physical hardware in. Sometimes, a blind reset procedure can safely be carried out, but sometimes it can&#8217;t. Also, the devices being driven are DMA masters, so they might very well have done something horrible to the kernel even though the buggy driver was &#8220;just a user-level app.&#8221; And if there were I/Os in flight at failure time, have they happened, or not? Remember, they might not be idempotent&#8230; I&#8217;m not saying that some best-effort way of dealing with many of these problems is impossible, just that it&#8217;s unclear that moving the driver into userspace has helped the situation at all.</p></blockquote>
<p>So what this shows is that the hardware-software interface is where the really hard and interesting problems start to pop up. I am big fan of abstraction and layers of indirection as programming methodologies, I am not a <a href="http://www.grc.com">Steve Gibson</a> who feels that programs are best written in assembly&#8230; but the abstractions do have to allow for the truth that is underneath the system. Bad abstractions or too simple abstractions make things more complex, rather than less.</p>
<p>Moving on from the software side of things to the hardware design side,<a href="http://www.garystringham.com/newsletter.shtml">Gary Stringham is running a nice series of tips for hardware design</a>. Here, there are lots of interesting issues to confront as well to make hardware easy or worthwhile to use. He recently ran a link to a <a href="http://www.microsoft.com/whdc/resources/MVP/xtremeMVP_hw.mspx">2004 Microsoft article on how hardware should be designed</a>, based on the experience of the Windows driver team at Microsoft.</p>
<blockquote><p>If every hardware engineer just understood that write-only registers make debugging almost impossible, our job would be a lot easier. Many products are designed with registers that can be written, but not read. This makes the hardware design easier, but it means there is no way to snapshot the current state of the hardware, or do a debug dump of the registers, or do read-modify-write operations. Now that virtually all hardware design is done in Verilog or VHDL, it takes only a tiny bit of additional effort to make the registers readable.</p>
<p>Another typical hardware trick is registers that automatically clear themselves when written. Although this is sometimes useful, it also makes debugging difficult when overused.</p></blockquote>
<p>I guess it is kind of sad that even five years later, this same issues do seem to crop up in new products and merit volumes of venom from driver developers&#8230; On the other hand, some companies do seem to be getting it. To me, the Freescale designs of recent years do seem to be fairly easy to configure and debug, and not feature write-only bits in any large number.</p>
<p>The article about <a href="http://jakob.engbloms.se/archives/770">hardware acceleration for TCP/IP by Mike Odell </a>that I discussed in a previous blog post is also relevant: when do the complexity of hardware interfacing negate any performance benefit from an accelerator?</p>
<p>(<em>for some reason, the initial posting of this post had an incomplete last paragraph, something weird in WordPress updates happened</em>)</p>
<p>To sum up, I think the interaction of hardware and software in the context of full opreating systems and device driver stacks is a really interesting topic that seems to have not gotten very much academic coverage. I hope to be able to help remedy some of this, once I get the Simics setup used in my <a href="http://jakob.engbloms.se/archives/709">experiments with hardware accelerators </a>packaged and available for academia. Full-system virtual platforms make for a very good experimental system, especially those where you use some third-party or standard operating system rather than just your own controlled code.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/799"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/799" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/799" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/799/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>When does Hardware Acceleration make Sense in Networking?</title>
		<link>http://jakob.engbloms.se/archives/770?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/770#comments</comments>
		<pubDate>Sat, 16 May 2009 06:45:47 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[history of computing]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[accelerators]]></category>
		<category><![CDATA[ethernet]]></category>
		<category><![CDATA[hardware-software interface]]></category>
		<category><![CDATA[Mike Odell]]></category>
		<category><![CDATA[networking]]></category>
		<category><![CDATA[tcp]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=770</guid>
		<description><![CDATA[Yes, when does hardware acceleration make sense in networking? Hardware acceleration in the common sense of &#8220;TCP offload&#8221;. This question was answered by a very nicely reasoned &#8220;no&#8221; in an article by Mike Odell in ACM Queue called &#8220;Network Front-End Processors, Yet Again&#8220;. The article is highly recommended for its long historical look at network [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-771" style="margin: 0px 15px;" title="q_stamp" src="http://jakob.engbloms.se/wp-content/uploads/2009/05/q_stamp.gif" alt="q_stamp" width="38" height="65" />Yes, when does hardware acceleration make sense in networking? Hardware acceleration in the common sense of &#8220;TCP offload&#8221;. This question was answered by a very nicely reasoned &#8220;no&#8221; in an article by Mike Odell in <a href="http://queue.acm.org/">ACM Queue </a>called &#8220;<a href="http://queue.acm.org/detail.cfm?id=1530828">Network Front-End Processors, Yet Again</a>&#8220;.</p>
<p><span id="more-770"></span></p>
<p>The article is highly recommended for its long historical look at network processing and network processing offload. As the balance  between speeds of networks, processors, memory, and interconnects between network cards and the rest of the system has changed over the years, it is an idea that occasionally (four or five times since the 1970s) has made sense. However, in the end, Mike thinks that it usually does not, and for a machine with multiple cores and a modern fast interconnect, it is hard to see how a hardware accelerator can actually help speed things up much when the coordination between the hardware and the software is accounted for. Even if there would appear to be a big bottleneck somewhere today, we can be sure that it wil be removed in the next generation of hardware, rendering the market window for an accelerator quite short.</p>
<p>I read this article as another great motivation for the need to carefully consider the functional design of the hardware-software interface for acceleration devices. For simple data-pumping or media-processing units, this looks easy. For something as complex as TCP/IP processing, it is not. I think the key is that for TCP, we have something that is much more like control-plane processing than data-plane processing, and that is harder to efficiently integrate between hardware and software. Also, there is not really that much work left to offload once data copies have been architected in the right way (and I read Mike&#8217;s article to say that we now know how to do this in a sufficently few-copies way that software is close to optimal in architecture).</p>
<p>From a market perspective, it would also indicate that the acceleration circuits that are in common use today are by definition those that make sense. Having hardware-accelerated graphics and video decoders does seem to help build more efficient and attractive computer systems, as do cryptography accelerators. With this view, it will be interesting to see which of all the accelerators found in modern networking SoCs like those from Freescale and Cavium will survive the test of time. I am willing to put a small bet that pattern-matching engines for traffic inspection is one of them. Apart from that, hard to say.</p>
<p>So go read that article before you start designing your next brilliant accelerator for a common expensive operation.</p>
<p>It also reminds me of a <a href="http://www.virtutech.com/whitepapers/wp-system_arch_spec.html">whitepaper I wrote early this year </a>on how to evaluate performance of a hardware accelerator in the context of a full system with a full software stack, considering the details of the hardware-software interface.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/770"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/770" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/770" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/770/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cool Obscure Hardware: Sun SCC and Software License Protection</title>
		<link>http://jakob.engbloms.se/archives/619?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/619#comments</comments>
		<pubDate>Wed, 28 Jan 2009 20:12:27 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[business issues]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[SCC]]></category>
		<category><![CDATA[smart card]]></category>
		<category><![CDATA[software licensing]]></category>
		<category><![CDATA[Sun]]></category>
		<category><![CDATA[System Configuration Card]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=619</guid>
		<description><![CDATA[In a very roundabout way, I recently got to hear about a cool Sun server feature introduced sometime back in 2003 or 2004: the SCC System Configuration Card. This is a smart card that stores the system hostid and Ethernet MACs, along with other info, and which can be transferred from one server to another. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-620" style="margin: 5px;" title="sunlogo" src="http://jakob.engbloms.se/wp-content/uploads/2009/01/sunlogo.png" alt="sunlogo" width="97" height="60" />In a very roundabout way, I recently got to hear about a cool Sun server feature introduced sometime back in 2003 or 2004: the SCC System Configuration Card. This is a smart card that stores the system hostid and Ethernet MACs, along with other info, and which can be transferred from one server to another.</p>
<p><span id="more-619"></span></p>
<p>Finding information on this card was very hard, and here is the best that I could find:</p>
<blockquote><p>With front and back LEDs and a removable system configuration card, the Sun Fire V120 server maximizes system availability by allowing system administrators to concentrate on scheduled service through easy installation and management. The removable system configuration card allows you to store a system&#8217;s host ID, MAC address, and NVRAM settings to another server while you perform routine maintenance. As a result, system downtime is minimized.</p></blockquote>
<p>Why I find this interesting is that it is also a nod to commercial software companies relying on hostids for licensing. In this way, you can maintain the same hostid even when a server has issues, and without compromising the integrity of licensing. Sun&#8217;s hostids are unusually safe and reliable, unlike the common x86 anchors like Ethernet MAC addresses (which are easy to change) and disk IDs (which are not available on Linux typically).</p>
<p>Making the ID physical in this way is usually the best way to handle identity in general. A GSM/UMTS SIM card is another example of a physically represented identity, which is way preferable to virtual identies that are just software. Much easier to handle, and safer for all involved.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/619"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/619" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/619" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/619/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DNS: Hardware Accelerator Time!</title>
		<link>http://jakob.engbloms.se/archives/222?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/222#comments</comments>
		<pubDate>Sat, 16 Aug 2008 21:21:50 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[podcast commentary]]></category>
		<category><![CDATA[SecurityNow]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=222</guid>
		<description><![CDATA[In Episode 157 of Security Now,Steve Gibson and Leo Laporte discuss the recently discovered security issues with DNS. In particular, the cost of making a good fix in terms of bandwidth and computation capacity. Fundamentally, according to Steve, today&#8217;s DNS servers are running at a fairly high load, and there is no room to improve [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.twit.tv/sn157"><img class="size-medium wp-image-225 alignleft" style="margin: 5px 10px;" title="Security Now smaller" src="http://jakob.engbloms.se/wp-content/uploads/2008/08/podcast_2_31.jpg" alt="" width="70" height="70" /></a> In <a href="http://www.twit.tv/sn157">Episode 157 of Security Now</a>,Steve Gibson and Leo Laporte discuss the recently discovered security issues with DNS. In particular, the cost of making a good fix in terms of bandwidth and computation capacity. Fundamentally, according to Steve, today&#8217;s DNS servers are running at a fairly high load, and there is no room to improve the security of DNS updates by for example sending extra UDP packets or switching to TCP/IP. As this theoretically means a doubling or tripling of the number of packets per query, I can believe that. The &#8220;real solutions&#8221; to DNS problems should lie in the adoption of a truly secured protocol like <a href="http://en.wikipedia.org/wiki/DNSSEC">DNSSEC</a>. As this uses public key crypto (PKC), it would add a processing load to the servers that would kill the DNS servers on the CPU side instead&#8230;</p>
<p><span id="more-222"></span></p>
<p>Since Steve is a general PC guy, he seems to have a hard time acknowledging that you need anything but an x86 processor (or a few). However, in this episode he did note that this would greatly benefit from special-purpose acceleration hardware for PKC. So here is a clear-cut case where the addition of specialized accelerators make sense even in what is considered &#8220;general&#8221; computing. This is a favorite theme of mine, see previous blog posts like the <a href="http://jakob.engbloms.se/archives/157">Kunle Olukotun Interview</a>, <a href="http://jakob.engbloms.se/archives/80">IBM z10 accelerators</a>, and my <a href="http://jakob.engbloms.se/archives/44">Niagara 2 writeup</a>.</p>
<p>So here we have it: special-purpose acceleration will save the Internet, and the only architecture missing processors with good crypto accelerators seems to be x86. SPARC, Power Arch, and zSeries all have chips with accelerators on them. One would presume that either AMD or Intel &#8212; maybe more likely AMD who are now working hard on integrating things like GPUs on their chips &#8212; will soon release an x86 with this kind of support. It is also a case where general multicore use does not really make much sense, as using an additional general-purpose core is going to have much worse performance per energy or per area than a dedicated accelerator.</p>
<p>The future is heterogeneous and full of accelerators, I still believe that is the case.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/222"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/222" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/222" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/222/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

