<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; AMD</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/amd/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Coding Horror on Big Iron Hardware</title>
		<link>http://jakob.engbloms.se/archives/841?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/841#comments</comments>
		<pubDate>Wed, 15 Jul 2009 19:41:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[CodingHorror]]></category>
		<category><![CDATA[HP]]></category>
		<category><![CDATA[Jeff Atwood]]></category>
		<category><![CDATA[server]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=841</guid>
		<description><![CDATA[In a post from late June, Jeff Atwood at Coding Horror discusses the horrible cost of a large HP server (scaling up to 32 processor cores in eight AMD x86 sockets), compared to a bunch of simple single-socket basic servers. There are some interesting notes on relative costs of small-and-simple servers, including things like administration [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-654" title="opinion" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/opinion.png" alt="opinion" width="91" height="69" />In a post from late June, Jeff Atwood at Coding Horror <a href="http://www.codinghorror.com/blog/archives/001279.html">discusses the horrible cost of a large HP server </a>(scaling up to 32 processor cores in eight AMD x86 sockets), compared to a bunch of simple single-socket basic servers. There are some interesting notes on relative costs of small-and-simple servers, including things like administration and power. There is an undercurrent to the post and the comments that the big HP machine is &#8220;overpriced&#8221;. I don&#8217;t think it is. If you have ever had <a href="http://user.it.uu.se/~eh/">Erik Hagersten </a>as a teacher in computer architecture, you will know why.</p>
<p><span id="more-841"></span></p>
<p>Essentially, the cost of connecting a bunch of processors goes up exponentially as the number of processors increase. I think this is just as true for Hypertransport-connected AMD 4-way chips as it was for Sun 10000 servers ten years ago. The backplane takes over as the cost driver, from the processors and memories and other obviously useful stuff. Scaling up beyond the commodity space (which is a moving target over time, certainly) requires a lot of engineering and custom hardware design. This makes the cost exponentially higher, but for a good reason.</p>
<p>Note that this is one of the reasons that the Sun Niagara/UltraSparc T-line machines are compelling: with 32 or 64 threads per socket, getting to 100+ hardware threads is way cheaper using that architecture than anything else in the server space (in deep embedded, 100+ cores is a yawn).</p>
<p>Just a small rant, while on vacation.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/841"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/841" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/841" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/841/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Three Cores make a Crowd &#8212; or a Problem</title>
		<link>http://jakob.engbloms.se/archives/633?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/633#comments</comments>
		<pubDate>Sat, 07 Feb 2009 21:12:38 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[device tree]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=633</guid>
		<description><![CDATA[A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-634" style="margin: 10px;" title="mpc8640d_pp" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8640d_pp.jpg" alt="mpc8640d_pp" width="130" height="130" />A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.</p>
<p>Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores &#8212; even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible &#8220;in principle&#8221;.</p>
<p>Here is the story of how I convinced Linux to handle more than two cores in a virtual <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8641D&amp;nodeId=0162468rH3bTdG8653">MPC8641D </a>machine.</p>
<p><span id="more-633"></span>In principle, adding more cores to the MPC8641 is easy. The interrupt controller that connects the cores together is the eminently scalable OpenPIC design, which can do at least 32 cores. During run-time this is only addressing that really matters. The Linux SMP support seems sufficiently scalable using the OpenPIC driver as well (and aside here: OpenPIC appears to be a design originally created by AMD or Cyrix for x86-SMP, but that reached common use with the PowerPC CHRP reference design &#8212; however, Internet sources are murky on this).</p>
<p>But the interrupt controller is just the first hurdle. There is another limit in the MPC8641 hardware: the multicore controller module, MCM, has a register that despite a strange name (Port Control Register, or PCR) is essentially what is used to enable and disable processors. PCR has room for only eight cores,. Since the real MPC8641D only has two cores, there is actually a set of six &#8220;reserved&#8221; bits. The Linux board support package has thankfully use a generic scheme based on processor core numbers. So adding in more cores just sets bits in the &#8220;reserved&#8221; field:</p>
<p><img class="aligncenter size-full wp-image-636" title="mpc8641d-mcm-room-for-extension1" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8641d-mcm-room-for-extension1.png" alt="mpc8641d-mcm-room-for-extension1" width="630" height="200" /></p>
<p>Thus, this processor scales to eight cores without recoding the Linux support  package or having to modify the register layout of the hardware.</p>
<p>The next issue was then how to communicate the number of cores to the software stack. There is no standard probing available, so the core count has to be a parameter given to the kernel. In all modern Linux versions, the &#8220;powerpc&#8221; architecture uses an OpenFirmware device tree data structure to obtain the hardware setup: cores, devices, addresses, interrupt routing, and anything else that is not explicitly probed (like PCI or USB, for example).</p>
<p>Once I got a <a href="http://www.jdl.com/software/">device tree compiler </a>installed this was surprisingly straight-forward. Just add a few more cores to the description file, compile, and use the new binary blob (the representation used by the kernel is the dtb, or &#8220;device tree blob&#8221;) instead of the standard one. In a virtual setup, changing this is trivial: just load a different file to memory before booting the system.</p>
<p>However, this did not work. The boot froze after core 2 (the third core) was enabled. Figuring out why and how to fix it took some time, since it turned out not to be a kernel problem at all&#8230; I spent a lot of time tracing and debugging the Linux kernel boot, including reversing back and forth over a hung loop, forcing interupts to be enabled just to see what would happen, and similar standard virtual platform tricks.</p>
<p>The problem turned out to be that the kernel was using processor numbers as a way to check which processors were coming online, and this processor number was read from the &#8220;PIR&#8221; special-purpose register (SPR) on the newly activated core. And this PIR value was set to one for all cores except core zero &#8212; some distance into the boot.</p>
<p>By single-stepping the first few instructions of the reset vector code I finally saw what was happening: code put in place by U-Boot (not the Linux kernel, really) was reading a magical MMU configuration register, and using the single bit it contained for determining the current processor as the processor ID. Thus, here was a piece of hardware with a single architected bit for IDs, and it is not even clear to me that this bit is supposed to be used in the way it is here. This was also a bit that could not be extended: putting data in neighboring (reserved, not used for other purposes) bits in that register just to see what would happen broke page table lookups with very high reliability.</p>
<p>In the end, the solution was just to remove the assembler instruction that wrote the PIR register. There was no other way around the problem. I guess this is &#8220;cheating&#8221;, but if changing a single line of code in the boot loader is what it takes to make Linux work with one to eight processor cores, I am fine with that. It is far less invasive than making changes to the Linux kernel, or creating a new system support package from scratch.</p>
<p>Which has finally provided me with a machine I can provide to <a href="http://www.virtutech.com/products">Simics </a>users that need a easy-to-change embedded SMP machine for multicore studies. I have tested that it works with 2, 3, 4, 6, and 8 cores. Five and seven would be easy to add as well, as it is just a matter of replacing the device tree.</p>
<p>This exercise also told me that the device tree is an interesting data structure that has significant power once you understand how it works. Until now, I have just seen it as a daunting weird thing that you could not do much about&#8230; but that is not the right attitude.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/633"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/633" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/633" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/633/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Is Cycle Accuracy a bad Idea?</title>
		<link>http://jakob.engbloms.se/archives/153?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/153#comments</comments>
		<pubDate>Fri, 11 Jul 2008 20:45:02 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Axys]]></category>
		<category><![CDATA[Carbon Technology]]></category>
		<category><![CDATA[clock-cycle models]]></category>
		<category><![CDATA[CoWare]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[DEC]]></category>
		<category><![CDATA[Grant Martin]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[Infineon]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Modeling]]></category>
		<category><![CDATA[rtl]]></category>
		<category><![CDATA[scdsource]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=153</guid>
		<description><![CDATA[In a funny coincidence, I published an article at SCDSource.com about the need for cycle-accurate models for virtual platforms on the same day that ARM announced that they were selling their cycle-accurate simulators and associated tool chain to Carbon Technology. That makes one wonder where cycle-accuracy is going, or whether it is a valid idea [...]]]></description>
			<content:encoded><![CDATA[<p>In a funny coincidence, I published an article at SCDSource.com about the need for cycle-accurate models for virtual platforms on the same day that ARM announced that they were selling their cycle-accurate simulators and associated tool chain to Carbon Technology. That makes one wonder where cycle-accuracy is going, or whether it is a valid idea at all&#8230; is ARM right or am I right, or are we both right since we are talking about different things?</p>
<p>Let&#8217;s look at this in more detail.</p>
<p><span id="more-153"></span></p>
<h2>Definitions</h2>
<p>A clock-cycle (CC) model in this discussion is something that attempts to provide a cycle-by-cycle depiction of the behavior of a computer system. Usually, such models are driven by a cycle-by-cycle clock, as that is the easiest way to write and structure them.<br />
A cycle-accurate (CA) model is a CC model where the depiction is &#8220;the same&#8221; as what would happen in the real system provided they both started from the same state. </p>
<h2>What is ARM Doing?</h2>
<p>ARM seems to be passing on the tools and technologies they acquired when they bought Axys back in 2004. These tools are CC-oriented, and are aimed at hardware architects (and some really-low-level software work). They make it possible to evolve a target design cycle by cycle in the simulator to get a very accurate picture of the target behavior. I think this fits very well for Carbon, as they generate cycle-driven very accurate models by essentially compiling the actual RTL implementation of a piece of logic, processor, or device into something a bit faster than plain HDL simulation. Carbon models are a natural fit for the Axys tools.</p>
<p>Basically, it sounds as if ARM decided that manually creating CC level CA models for their latest processors for use in the Axys tools (SoC Designer) was too much work and too hard to validate. Thus, they pass the whole thing on to Carbon and seem to expect Carbon to generate CA models for use with SoC designer straight from the actual ARM implementation RTL. Carbon will have the old CC/CA models written by Axys (and later ARM), and then generate new models for new generations of ARM chips like the Cortex A9. I quote:</p>
<blockquote><p>&#8220;The model generation flow will be optimized and validated using the RTL code, ensuring speed and accuracy. The processor models will also leverage the Carbon model application programming interface (API) to offer a direct connection to the ARM RealView(R) Debugger. Carbon-generated models of ARM IP will offer our customers the fastest, most-accurate path for firmware development and architectural exploration.&#8221; (<a href="http://www.carbondesignsystems.com/Press/20080707%20Carbon%20Press%20Release.pdf">press release</a>)</p></blockquote>
<p>And:</p>
<blockquote><p>ARM made this decision, Cornish said, because it&#8217;s become increasingly difficult and time-consuming to develop cycle-accurate models. &#8220;We recognized it would make more sense to work with a specialist like Carbon that has technology for generating models directly from RTL,&#8221; he said. (<a href="http://www.scdsource.com/article.php?id=264">SCDSource News Piece on the deal</a>)</p></blockquote>
<h2>Feasibility of Construction</h2>
<p>The core argument here is really how easy or feasible it is to build CA models of a processor core (or any other really complex piece of logic). There are several interesting views to consider.</p>
<ol>
<li>The ARM statement is basically saying that building CA models of a processor core is very hard. It is hard to get right, hard to validate, and hard to maintain. So why even try? Better to generate it from the RTL and let experts at doing that do the work.</li>
<li>In my PhD thesis from 2002, I concluded that building an accurate model of a processor from public information and reverse engineering is very very difficult, and cited a number of computer architecture and real-time systems attempts to build models that all turned out to have accuracy issues. I did not know much about EDA then &#8212; and ESL did not really exist. But I think that still holds water: constructing a model of a processor is hard.</li>
<li>In the SCDSource article, I make the statement that &#8220;Building cycle-accurate (CA) models is very difficult, as you need to understand and describe the implementation details of complicated hardware units. &#8230; It is quite easy to end up with something that is essentially an alternative implementation to the actual chip RTL. It is especially difficult for third parties, as it requires access to the device and processor core designers to explain the design.&#8221; Which is essentially saying that you need to get inside the processor design group to get the information.</li>
<li>The common knowledge that all great processor design teams, from the DEC Alpha to Intel x86s to AMD Opterons to IBM Power to Freescale Power to Infineon TriCore to Sun Niagara use internal cycle-detailed simulators as their main design tools to prototype and decide how to design pipelines, memory systems, and system platforms. In this case, the simulator comes before the processor, not the other way around.</li>
<li>Tensilica has, as Grant Martin points out in comments at <a href="http://www.scdsource.com/article.php?id=266">SCDSource</a>, tools that generate both the processor and an accurate model at the same time from the same information base.</li>
<li>CoWare&#8217;s LisaTek tools for describing and generating application-specific processors also claim to generate accurate models from the LISA source files in a way similar to Tensilica but based on a user describing a completely custom design in a third-party tool. In the case of Tensilica, the tool and the design come from the same company.</li>
</ol>
<p>So where does this leave us? It makes it clear that in order to build a good cycle-accurate model you need access to internal information and the processor design/processor design team. The CA model can be built either:</p>
<ol>
<li>By synthesizing from the RTL, Carbon-style.</li>
<li>By synthesizing from some more abstract design description, Tensilica or LisaTek-style.</li>
<li>By the design team as part of the design process.</li>
<li>By some poor guy working after the fact from specs and test cases.</li>
</ol>
<p>I think the ARM-Carbon deal (and all practical experience as well) invalidates the fourth variant. Essentially, that is what Axys had to do: build models after the fact, separate from the CPU design flow. This is a property of how ARM design processors and the fact that Axys began life outside of ARM (my guess, nota bene). It is what computer architecture researchers often want to do but fall down on over and over again. In fact, a common question from computer architecture newbies is if Virtutech Simics has correct models of processors like the Intel Pentium4 or Core 2 available to use as starting points in research. It would be nice, but sorry, we do not.</p>
<p>But the other three variants do make sense, and will all result in some kind of decent model. Which one you end up doing depends on the style of your design and quite likely the complexity of the processor and system design. In the end, any truly revolutionary design (think Sun Rock, for example) will need to write a custom simulator as tools will not have the concepts in them to model all ideas. It seems that simple &#8220;standard&#8221; designs that fit in the categories of &#8220;custom RISC&#8221; or &#8220;custom DSP&#8221; and that do not break new ground in computer architecture can probably be designed using tools that allow processor and simulator generation. I think that most heavy-duty general-purpose processor cores will have to do either the design-model or RTL-generation path, while more accelerator-style cores can use the tools approach.</p>
<p>As a final note, there could really be two different problems being addressed here regarding &#8220;cycle accuracy&#8221;, and that this might contribute to different levels of feasibility:</p>
<ul>
<li>Using the simulator to validate and optimize software performance can tolerate some errors in details as long as errors do not accumulate (see for example the &#8220;<a href="http://moss.csc.ncsu.edu/~mueller/wcet06/accepted/5.html">timing anomalies</a>&#8221; or &#8220;<a href="http://www-emsoft02.imag.fr/Programme/Engblom.pdf">unbounded long timing effects</a>&#8221; found in WCET research). It is about understanding the software behavior versus the processor design (or complex accelerator design versus input data), in small focused spots of execution.</li>
<li>Using the simulator to validate a chip design including buses and other devices that can be bus masters. This ought to require a higher level of accuracy, as the penalty for errors would potentially seem greater. And this is also where ARM&#8217;s SoC designer fit in, rather than as a tool to understand the software behavior. The scope here is larger and there is usually no idea of zooming in on detail at particular points in time.</li>
</ul>
<p>So where does this land us?</p>
<p>I guess that CC/CA models can be built if you have a nice inside track to the design team, and that the only sensible way to use them is as a zoom device for the places in your code where you absolutely need the details. Most of the time (say 90-95-99%) software does not need CC models, but rather something that is functionally accurate and that runs really really fast so that all software can at least be executed. That is something a CC model will never be able to do, at least not for systems using non-trivial operating systems requiring a few billion instructions just to boot&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/153"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/153" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/153" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/153/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>More Odd Targets</title>
		<link>http://jakob.engbloms.se/archives/91?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/91#comments</comments>
		<pubDate>Thu, 27 Mar 2008 14:21:39 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[odd core count]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/91</guid>
		<description><![CDATA[PC Perspective writes in its IDF writeup about Intel&#8217;s upcoming Dunnington device with six cores on a die. Another break from the 2-4-8 powers-of-two progression. Feels oddly refreshing, even if it is really nothing very strange about it. Just like AMDs triple-core. Tweet]]></description>
			<content:encoded><![CDATA[<p><a title="Intel Core 2 Quad" href="http://jakob.engbloms.se/wp-content/uploads/2008/03/2q_62.gif"><img src="http://jakob.engbloms.se/wp-content/uploads/2008/03/2q_62.thumbnail.gif" alt="Intel Core 2 Quad" hspace="10" vspace="10" width="53" height="67" align="left" /></a><a href="http://www.pcper.com/article.php?aid=534">PC Perspective writes in its IDF writeup</a> about Intel&#8217;s upcoming <em>Dunnington </em>device with six cores on a die. Another break from the 2-4-8 powers-of-two progression. Feels oddly refreshing, even if it is really nothing very strange about it. Just like <a href="http://jakob.engbloms.se/archives/27">AMDs triple-core</a>.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/91"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/91" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/91" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/91/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>AMD three-core Phenom</title>
		<link>http://jakob.engbloms.se/archives/27?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/27#comments</comments>
		<pubDate>Wed, 19 Sep 2007 18:00:22 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[EETimes]]></category>
		<category><![CDATA[odd core count]]></category>
		<category><![CDATA[Phenom]]></category>
		<category><![CDATA[Xenon]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/27</guid>
		<description><![CDATA[Spotted at EETimes.com &#8211; Odd move: AMD plans three-core CPU . Interesting that someone in the mainstream finally breaks the 1-2-4-8 progression that seems to be the norm. The reason for using three cores is an interesting one &#8212; AMD claims that that is fairly optimal for users using many desktop applications at once. Four [...]]]></description>
			<content:encoded><![CDATA[<p>Spotted at <a href="http://www.eetimes.com/news/latest/showArticle.jhtml;jsessionid=U4WPZN02QG224QSNDLSCKHA?articleID=201806702">EETimes.com &#8211; Odd move: AMD plans three-core CPU </a>. Interesting that someone in the mainstream finally breaks the 1-2-4-8 progression that seems to be the norm.</p>
<p><span id="more-27"></span>The reason for using three cores is an interesting one &#8212; AMD claims that that is fairly optimal for users using many desktop applications at once. Four cores apparently bring a diminishing benefit currently, and two cores can be fully utilized. So at the moment, three cores might represent a sweet-spot for many normal users. I am thinking of future write-up on capacity computing vs capability computing, and this design does have something to say about capacity computing.</p>
<p>I don&#8217;t know of any technical or engineering reason that you have to have cores in numbers that are power-of-two. For shared-memory machines, this does seem to be the norm, even for optimized embedded designs. The only exception I can think off immediately is the Xenon 3 core design used in the Xbox 360. Most odd-core-count chips that I have seen are assymetric or local-memory designs with several independent ARM cores or combinations of ARM and DSPs.</p>
<p>I would venture a guess that the three-core parts are actually four-core chips where one core has been disabled or went bad in the manufacturing process (this is also the opinion of <a href="http://arstechnica.com/news.ars/post/20070917-amds-triple-threat-the-tri-core-phenom.html">ArsTechnica</a>). That is common practice for embedded SoCs, where the same die is used for a range of products by simply disabling various functions and devices. In this way, you get a number of different products at different cost points, while still obtaining economies of scale in manufacturing and avoiding the verification cost of a variant die.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/27"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/27" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/27" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/27/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SICS Multicore Day August 31</title>
		<link>http://jakob.engbloms.se/archives/17?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/17#comments</comments>
		<pubDate>Sun, 02 Sep 2007 20:13:50 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[embedded systeme]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[uncategorized]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hardware debug support]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Joe Armstrong]]></category>
		<category><![CDATA[Niagara]]></category>
		<category><![CDATA[QuviQ]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[Sun]]></category>
		<category><![CDATA[transactional memory]]></category>
		<category><![CDATA[UltraSPARC]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/17</guid>
		<description><![CDATA[The SICS Multicore Day August 31 was a really great event! We had some fantastic speakers presenting the latest industry research view on multicores and how to program them. Marc Tremblay did the first presentation in Europe of Sun&#8217;s upcoming Rock processor. Tim Mattson from Intel tried hard to provoke the crowd, and Vijay Saraswat [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.sics.se/node/1854">SICS Multicore Day August 31 </a>was a really great event! We had some fantastic speakers presenting the latest industry research view on multicores and how to program them. Marc Tremblay did the first presentation in Europe of Sun&#8217;s upcoming Rock processor. Tim Mattson from Intel tried hard to provoke the crowd, and Vijay Saraswat of IBM presented their X10 language. Erik Hagersten from Uppsala University provided a short scene-setting talk about how multicore is becoming the norm.</p>
<p><span id="more-17"></span><br />
The Rock is a very interesting piece of work. It tries to be both a throughput-oriented design like the Niagara/Ultrasparc T machines, and a single-thread high-performance design. Even though on balance, it is more skewed towards the throughput computing aspect. What is very cool is how they use additional threads to help boost the performance of a main thread using &#8220;scout threads&#8221; (a concept I saw presented back at ISCA 2004). This makes it possible to use threads to either boost single-thread performance OR do throughput, creating a more flexible design than is usually the case. It is also the first commercial implementation of <a href="http://research.sun.com/spotlight/2007/2007-08-13_transactional_memory.html">transactional memory</a>. And 16-way. And due for next year.</p>
<p>So far, Rock seems like a very successful and very visionary project that is trying in yet another way to gain momentum by pure hardware innovation. Just like the UltraSparc T line, Sun is trying to out-invent IBM and Intel/AMD. Who seem to be mostly progressing by just piling on more of the same old features. I really hope this play goes well, if we were down to just IBM/PPC &amp; System Z and Intel-AMD/x86-64 on the server and desktop side, the world would just be too boring.</p>
<p>The Intel and IBM talks on programming were both grounded in the idea that to make people accept a new programming language/API, it has to be an evolution of what the programmers already know. Which pretty much ties us down to C/C++/Java/C# with extensions and modified semantics.</p>
<p>X10 is basically Java with some nicely considered features to support local and global memories and programs that can scale to BlueGene-style massively clustered machines. Tim basically tells everyone to stop inventing new languages and focus on improving existing frameworks like MPI and OpenMP in collaboration with industry. Presented in a very funny style, Tim is a great presenter, and tries hard to get the audience to react. In this crowd, most people agreed. Except the Erlang people, who feel that they do have a better solution to multithreading and multicore than any patched-up language in the C-Java family. I must agree with them, and I do feel that <a href="http://www.erlang.org/">Erlang </a>today is mature enough to serve that purpose.</p>
<p>The panel session at the end was very entertaining, where some people (including myself and <a href="http://armstrongonsoftware.blogspot.com/">Joe Armstrong</a>) tried to ask tough questions to the keynote speakers (and Ulf Wiger of Ericsson). Quite engaging and a rare chance to directly engage with some industry heavyweights who otherwise tend to sit on the other side of the Atlantic.</p>
<p>I think the prize for coolest tech of the day goes to <a href="http://www.quviq.com/">QuviQ</a>, a spin-off from <a href="http://www.chalmers.se/">Chalmers </a>doing automated testing tools that really work well for parallel and distributed systems.  Their method of minimizing the trace of a failed test case is really interesting, and finds things that no human tester would ever find.</p>
<p>I also presented a talk on &#8220;Debugging Multicore Software using Virtual Hardware&#8221;, in the breakout sessions. I guess our Tools track was the least visited of the three tracks, but the audience asked some good questions. And there were some good discussions afterwards.</p>
<p>However, to summarize the day, I am a bit disappointed that not more is being done on the hardware side to help people debug their multicore and multiprocessor parallel programs.  Transactional memory is all nice and dandy and can help simplify low-level locking primitives for threaded programs. But I would like to see much more in terms of smart tracing, hardware breakpoints and triggers, massive synchronized stops, and similar features. And instructions and features that make parallel expressions simpler. Here, the embedded folks doing things like <a href="http://www.arm.com/products/solutions/CoreSight.html">ARM CoreSight</a> seems to have been much more successful than the server-class designers at Sun, Intel, and IBM. But even ARM do not spend more than 10-15% of the chip area on debug support.</p>
<p>I think it would be interesting to  see what would happen if you could spend 25-30% of the chip on some seriously powerful debug features. Full support for remote control of all cores at the same time, lots of bandwidth for debug data and commands, and fat traces of all traffic on and off the chip. Performance and event counters everywhere. That would make the peak performance of chip likely less than a competing chip not spending as much space on debug support &#8212; but it would make achieving a high utilization much easier, and that might actually make the debug-intense chip more economical. Would be interesting to try. But I guess nobody would dare to buy such a design.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/17"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/17" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/17" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/17/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

