<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; multicore</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/multicore/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 05 Sep 2010 06:08:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Simulation Determinism: Necessary or Evil?</title>
		<link>http://jakob.engbloms.se/archives/734?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/734#comments</comments>
		<pubDate>Sun, 19 Apr 2009 20:36:02 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[determinism]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[repeatability]]></category>
		<category><![CDATA[reverse execution]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[VMWare]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=734</guid>
		<description><![CDATA[In my series (well, I have one previous post about checkpointing) about misunderstood simulation technology items, the turn has come to the most difficult of all it seems: determinism. Determinism is often misunderstood as meaning &#8220;unchanging&#8221; or &#8220;constant&#8221; behavior of the simulation. People tend to assume that a deterministic simulation will not reveal errors due [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-735" style="margin-left: 10px; margin-right: 10px;" title="gears" src="http://jakob.engbloms.se/wp-content/uploads/2009/04/gears.png" alt="gears" width="56" height="57" />In my series (well, I have one previous post about <a href="http://jakob.engbloms.se/archives/714"><em>checkpointing</em></a>) about misunderstood simulation technology items, the turn has come to the most difficult of all it seems: <em>determinism.</em> Determinism is often misunderstood as meaning &#8220;unchanging&#8221; or &#8220;constant&#8221; behavior of the simulation. People tend to assume that a deterministic simulation will not reveal errors due to nondeterministic behavior or races in the modeled system, which is a complete misunderstanding. Determinism is a necessary feature of any simulation system that wants to be really helpful to its users, not an evil that hides errors.</p>
<p><span id="more-734"></span></p>
<h2>What?</h2>
<p>Determinism really means this:</p>
<ul>
<li>Given a certain initial state</li>
<li>And a certain sequence of external inputs</li>
<li>The end result and state of the simulation will always be the same</li>
</ul>
<p>The key to note is that you need to require both the starting state and the sequence of external inputs to be the same in order to get the same result. If either of these change, you can well get a different result. Implementing a deterministic simulator requires all internal events and activities in the simulator to be performed in the same order and at the same time in each simulation run. It means that the host computer environment state cannot be allowed to affect the simulator execution, and that in turn means that all sorting of internal events have to be done in defined orders in all instances.</p>
<p>I have a story about how hard that can be in practice. I once talked to some compiler developers who had the issue that when recompiling the same program with the same set of compiler options, the results might come out different, even on the same machine. The problem was that each run of the compiler was done in a different overall system state, and this might affect how the OS memory allocation functions allocated items in memory. It turned out that in some cases, the precise value of the <em>pointers </em>to the items in a complex data structure were used by standard libraries to handle iteration over nodes in the data structures. Thus, a different memory allocation pattern gave a different iteration order and a different traversal order of nodes, and in the end an almost arbitrarily different result. The correct solution they had to implement was to use a defined lexical ordering to traverse and iterate, not anything dependent on the state of the host machine. It is nothing different in a simulator: define the order of <em>everything</em>, in order to be deterministic.</p>
<h2>Why?</h2>
<p>The crucial benefit that determinism brings to a simulation in general and a virtual platform in particular is <em>repeatable debugging</em>. With determinism and an appropriate recording mechanism (and most practically <a href="http://jakob.engbloms.se/archives/714">checkpointing</a>) you can rely on being able to repeat a run resulting in a bug any number of times with the precise same sequence of events in the simulation. In particular, the same sequence and timing and timing relative to instructions executed for events visible to and relevant for the software running on the virtual platform. Especially for multicore and parallel computing systems this is incredibly powerful, and something that just cannot be achieved on physical hardware (due to its inherent randomness and chaotic behavior, see my 2006 and 2007 ESC Silicon Valley talks for more on this, at my <a href="http://www.engbloms.se/jakob_publications.html">publications </a>and <a href="http://www.engbloms.se/jakob_presentations.html">presentations </a>pages).</p>
<p>If you assume stability of the simulation infrastructure and the simulation platform, determinism also makes debugging the simulation itself easier. Often, a bug in a simulation model is repeatable, and with determinism, it is easy to repeat the same external stimulus sequence to the module and debug it repeatably.</p>
<p>Determinism also makes it easy to detect change in the behavior of a simulation: if the same simulation setup results in a different result or final simulation state, you know something in the setup (model, model parameters, or software) changed. There is no randomness that cause changes without some fundamental parameter being changed. Such boring reliable behavior is generally exactly what you want when testing and debugging large, complex systems.</p>
<p>Obviously, once determinism becomes a requirement, missing determinism in a model is a bug in itself &#8212; and finding such bugs can certainly be interesting exercises.</p>
<h2>Why Not?</h2>
<p>Just like for checkpointing, one reason not do to determinism is that it is hard, as discussed above.</p>
<p>The most common reason that people claim to want to avoid determinism is that they want to explore alternatives within their simulation. Basically, there is a need for <em>variability </em>that would seem to be at odds with determinism. The typical argument is that &#8220;if my simulation model contains a non-deterministic choice, I want the simulation to expose that and not just make the same decision every time&#8221;. This is where determinism tends to be considered <em>evil</em>. However, this argument is not correct.</p>
<p>If we take the case that at some point P in a simulation run there are two different events <em>E</em> and <em>F</em> that can fire (since they are both posted to the same point in virtual time), a deterministic simulator will always select one and the same. This is necessary to reap the system-level benefits discussed above. However, nothing prevents us from programming a change from this behavior into our system explicitly, <em>introducing controlled and repeatable variation. </em>In such a setup, we will have a random decision being made in each simulation run, but one where the outcome in any particular run can be repeated by setting the same random seed parameter.</p>
<p>This brings the best of both worlds: variation to expose issues where there is potential non-determinism or lack of synchronization in the model, and perfect repeatability of the issues this poses in terms of target software and simulation system behavior. The reason for the simultaneous readiness can be considered to be lacking synchronization in the model, in general, and such a randomizer of behavior will expose that at several different levels. But uncontrolled randomness is not the answer.</p>
<p>Another common misconception is that at a higher level, determinism in a virtual platform means that target software will always run in the same way. That is not true, and misses the importance of state in the deterministic behavior equation. If the initial state when a program starts is different, a different execution will result. If software is run on top of any non-trivial operating system, there is plenty of such variation. In one of our simplest Simics demos, we show this by running an intentionally buggy race-condition-ridden program. Each time it is run, it hits a different number of race conditions. But thanks to determinism (best demoed using reverse execution), we can repeat each run perfectly.</p>
<p>Thus, determinism is not equal to constant behavior or lack of variation.</p>
<h2>The reverse argument</h2>
<p>Finally, determinism is the simplest way to implement reverse execution: if you have recording, determinism, and checkpointing, you can easily virtually reverse the execution by going back to a checkpoint and replay the execution from that point. If you stop one instruction before the current instruction, you have in essence stepped backwards one step in time. This is how both VMWare and Simics implement reverse execution and debugging. And it could not happen without determinism.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/734/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Threading or Not as a Hardware Modeling Paradigm</title>
		<link>http://jakob.engbloms.se/archives/485?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/485#comments</comments>
		<pubDate>Thu, 01 Jan 2009 08:31:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[EDA]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[Reactive programming]]></category>
		<category><![CDATA[sampalib]]></category>
		<category><![CDATA[Simics]]></category>
		<category><![CDATA[SystemC]]></category>
		<category><![CDATA[Threading]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=485</guid>
		<description><![CDATA[Traditional hardware design languages like Verilog were designed to model naturally concurrent behavior, and they naturally leaned on a concept of threads to express this. This idea of independent threads was brought over into the design of SystemC, where it was manifested as cooperative multitasking using a user-level threading package. While threads might at first [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-486" style="margin: 5px 10px;" title="gears-modeling" src="http://jakob.engbloms.se/wp-content/uploads/2008/12/gears-modeling.png" alt="gears-modeling" width="62" height="65" />Traditional hardware design languages like <a href="http://en.wikipedia.org/wiki/Verilog">Verilog </a>were designed to model naturally concurrent behavior, and they naturally leaned on a concept of threads to express this. This idea of independent threads was brought over into the design of <a href="http://www.systemc.org">SystemC</a>, where it was manifested as cooperative multitasking using a user-level threading package. While threads might at first glance look &#8220;natural&#8221; as a modeling paradigm for hardware simulations, it is really not a good choice for high-performance simulation.</p>
<p>In practice, threading as a paradigm for software models of hardware circuits connected to a programmable processor brings more problems than it provides benefits in terms of &#8220;natural&#8221; modeling.</p>
<p><span id="more-485"></span></p>
<p>As I see it, the main alternative modeling paradigm is to use a classic event-driven system, where all activity is triggered by events and run the associated code to completion. This makes execution occur in a series of simulation steps in various part of the system, rather than as a set of (pseudo) concurrent tasks.</p>
<h2>Threaded Problems</h2>
<p>The most common complaint with threading is <strong>performance</strong>. This has become very clear in the case of using SystemC for transaction-level modeling. All advice in how to do good and fast TLM coding tells us to use SC_METHODs, which are essentially callbacks that are not active objects in their own right. Note that SystemC models found in the wild are often built on SC_THREADs despite this advice, as that is the &#8220;easiest&#8221; way to do things. Some convenience systems part of the OSCI TLM-2.0 library also rely on threads to convert between AT-style asynchronous and LT-style synchronous function calls (which is pretty unavoidable, but not applicable in the realm of high-performance simulation for virtual platforms).</p>
<p>Furthermore, using threading as a paradigm (even cooperative single-active-thread cooperative threads like in SystemC or classic MacOS) bring with it the <strong>problems of concurrent programming</strong>, in that you suddenly need to care about protecting data structures against conflicting accesses, worry about deadlocks, and similar concurrent programming issues. Without threads, all such issues go away.</p>
<p>Note that using threading as a modeling paradigm with truly concurrent execution of models will make the execution have all the problems of parallel programs, especially non-deterministic execution and hard-to-find bugs. At least a cooperative multitasking system tends to be deterministic in the way it goes wrong.</p>
<p>Threading as a hardware model programming style therefore makes concurrent multithreaded simulation harder rather than easier to achieve. Especially if the semantics of the simulation system specifies an interleaved model of execution as the semantics, which is the case for SystemC. In this cases, there is no way to really make SystemC parallel without adding parallelism as some extra library.</p>
<p>However, one of the biggest practical problems with threading is the problem of <strong>inspecting, changing, and checkpointing simulation state</strong>. With threads, you end up having state stored in local variables on the stacks in the system, as well as in processor registers, the program counter, and other places that are hard to get to from the outside.  This is not just me saying this, I found this well said in the <a href="http://www.sampalib.org/doc/papers/A%20Sampalib%20and%20SystemC%20comparison.pdf">sampalib white paper </a>:</p>
<blockquote><p>Using threads means that part of the simulation state is in stacks, which may limit the ability to persist the state of the simulation in checkpoints.</p>
<p>Using wait() implies context switch which are costly in terms of simulation speed, and thus often discouraged in guidelines for modeling SystemC™ models</p></blockquote>
<p>To furthermore drive this point, all librariesfor general program state serialization that I have seen (for C++ and Java, for example) also rely on explicit state stored in objects, and explicitly do not support the &#8220;transient&#8221; state held in local variables and the program counter. Essentially, only heap-allocated objects are handled in serialization solutions.</p>
<h2>Event-Driven Solutions</h2>
<p>An event-driven transaction-level hardware simulation is coded in a different way from a naive threaded implementation (but not that differently from a more sophisticated threaded program).</p>
<p>Each device model has to make its state explicit as a set of variables, and preferably also declare these for access for an external tool using something like <a href="http://www.greensocs.com/en/projects/GreenControl">GreenSocs GreenControl </a>or <a href="http://www.virtutech.com/whitepapers/modeling.html">Simics Attributes</a>. It also has to expose a set of functions to be called when events happen or other devices in the simulation system send a transaction into the device model.</p>
<p>Additionally, you should encapsulate all state in a model inside the model object and not expose it for direct access from the outside. A pure object-oriented style with accessor functions for everything is required for best modularity.</p>
<p>The advantages of this model are clear:</p>
<ul>
<li>Concurrency problems are reduced, since each function call will run to completion before any other object or function is activated. There is no need to worry about shared data variables, as they should not exist.</li>
<li>Checkpointing and inspection is facilitated, since all state is now explicit and declared.</li>
<li>Performance is typically increased, since there is no need to do context switches between threads. Locality is also increased by having functions run to completion before returning.</li>
<li>True concurrency is easier to achieve, since each model can quite easily be considered a local-state, shared-nothing, explicit message-passing component similar to Erlang threads. This makes it possible for the simulation scheduler to run multiple models concurrently on multiple host threads. For more on this topic, see my <a href="http://jakob.engbloms.se/archives/246">SiCS Multicore Days 2008 </a><a href="http://www.engbloms.se/presentations/engblom-multicore-sics-2008.pdf">presentation on how Simics was threaded</a>.</li>
</ul>
<p>The downside is that some people consider the programming more complicated. Which is really a matter of appearance over substance: event-driven programming tends to be more robust and easier to follow in the long run, since threaded programming makes things a bit too implicit.</p>
<p>Here is the basic example of a thread that does some periodic work.</p>
<p>Threaded style:</p>
<blockquote>
<pre>Thread_for_D():
  loop forever:
    do work...
    wait(some time)</pre>
</blockquote>
<p>Event-driven style, where we just repost an event each time we are called:</p>
<blockquote>
<pre>Time_callback():
  do work...
  post event(some time, Time_callback)</pre>
</blockquote>
<p>Another advantage of event-driven models is that such a paradigm makes it clear that you need to be able to accept any call into the model at any time. This makes for more robust code, since it is quite easy to (intentionally or by mistake) encode an expectation on the sequence of activity in a threaded that might not be what actually happens at run-time. In particular, the state of any protocol being acted on will need to be explicitly rather than implicitly represented.</p>
<p>There is much more to be said on how to code in this style, but there are long papers out there to read on this.</p>
<h2>High-Performance Event-Driven Simulation</h2>
<p>Note that in high-performance virtual platform-style simulation, processors will usually be a special case in both threaded and event-driven styles. That is since the flow of instructions that they execute constitute very many very small actions that cannot affort a context switch between each. Here, the advantage of the event-driven model is even clearer, given some special-casing of processors. This is another long story that I will not reiterate here, but basically, most events as discussed above will be memory accesses from a processor to read and write device registers, and each such memory access can be handled in a single simulation step. No need to switch context or do anything but handle a simple function call. By not having a wait() call to deal with, this mechanism can be kept simple and cheap &#8212; which is essentially using an SC_METHOD in SystemC. But in the complete absence of SC_THREADs and their ilk, many other things can be optimized even better.</p>
<h2>The End</h2>
<p>What I wanted to provide in this almost-article-length post was an idea for the problems that I see threads cause as a modeling paradigm for hardware models, and the advantages offered by a reactive event-driven style. For some reason, this is misunderstood in the modeling community at large, probably because most operating systems and simulation systems in common use today present various forms of threads as the way to model concurrent behavior. However, threads as a prominent user-level programming model are known to be bad in many ways&#8230; and modeling is no exception to this rule.</p>
<p>Note that I realize that threads are needed at some level in order to take advantage of multicore hardware, but I think they are best hidden inside a simpler framework that presents a simpler understandable semantics to the user.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/485/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SiCS Multicore Days: The Debate Points</title>
		<link>http://jakob.engbloms.se/archives/283?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/283#comments</comments>
		<pubDate>Fri, 19 Sep 2008 20:14:24 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[heterogeneous]]></category>
		<category><![CDATA[homogeneous]]></category>
		<category><![CDATA[memory bandwidth]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[panel discussion]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=283</guid>
		<description><![CDATA[It is a week ago now, and sometimes it is good to let impressions sink in and get processed a bit before writing about an event like the SiCS Multicore Days. Overall, the event was serious fun, and I found the speakers very insightful and the panel discussion and audience questions added even more information. [...]]]></description>
			<content:encoded><![CDATA[<p>It is a week ago now, and sometimes it is good to let impressions sink in and get processed a bit before writing about an event like the SiCS Multicore Days. Overall, the event was serious fun, and I found the speakers very insightful and the panel discussion and audience questions added even more information.</p>
<p><span id="more-283"></span></p>
<p>What was quite striking this year was the greater difference of opinion between the speakers. I guess that in 2007, most of the discussion was on the level of &#8220;ouch, here comes multicore and what are we going to do about it&#8221;. This year, we got a bit deeper and with one more year of experience and massive research work, the collective world of multicore have made some progress and gained insights. And that&#8217;s when the differences start to show up; the fact that we have differences of opinion tells us that we are starting to dig into details and turning up different answers due to different viewpoints and user experiences.</p>
<p>So where were the differences this time?</p>
<ul>
<li>Heterogeneous vs homogeneous cores (on a single chip). Kunle Olukotun clearly supported the heterogeneous style (which is what you with Sun&#8217;s Niagara that he designed the basis for). Erik Hagersten was more interested in the difference between thin and fat cores of the same basic ISA, and Anant Agarwal was strongly in favor of completely homogeneous systems (which is what they build at Tilera). In my biased view, I think the argument for heterogeneous in pure energy efficiency is always going to prevail. See some of my previous blog posts on this topic, for some background:
<ul>
<li><a href="http://jakob.engbloms.se/archives/222">DNS Hardware Acceleration</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/157">Interview with Kunle Olukotun at the Register</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/44">Homogeneous vs heterogenous</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/90">Homogeneous vs heterogeneous, continued</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/80">IBM Z6 accelerators</a>.</li>
<li><a href="http://jakob.engbloms.se/archives/77">Montalvo and heterogeneous x86</a>.</li>
</ul>
</li>
<li>Domain-specific vs general-purpose programming languages. The same sides here, with Kunle advocating domain-specific languages, and Anant and David Padua more in the general-purpose camp. I like domain-specific better, it seems to rhyme more with what I see people actually doing today to increase programming productivity overall.</li>
<li>Memory bottleneck or not? The most interesting discussion came when memory bandwidth and cache sizes were discussed. One quite common school of thought over the past few years teach that caches per core will shrink, and bandwidth to get data into and out of a chip is going to be a severe restriction on what can be done. Not all in the panel agreed with this, there was the idea (mostly from Kunle) that in some way the massive bandwidths and low latencies achievable within a chip (compared to between chip in a classic discrete-processors multiprocessor) could make this less of a problem. Personally, I think this is going to be some kind of problem, but maybe not as much as passing data around faster might reduce the need to store it temporarily. Despite the need for more bandwidth, nobody really agreed with Erik&#8217;s thought that maybe it makes sense to build chips that do not max out on the number of cores they contain, but rather try to balance core count with achievable IO bandwidth. That idea has some merit.</li>
<li>Core counts. Moore&#8217;s law tells us there are going to be thousands of cores on a chip fairly soon&#8230; but if we do not manage to make good use of them, maybe the growth in core counts will slow soon. Putting four or six or eight cores into a general-purpose system makes sense today, but more than that might turn out to be a waste for the vast majority of users that do not have problems to solve and programs to run that can make of more than that. In the same sense, maybe it is better with slightly fewer more powerful cores than a maximum amount of minimalistic cores, considering the state of software available today. So it sounds like a fairly divergent future here.</li>
<li>Shared memory or local memories? Most of the seemed to be in the camp proposing that shared memory is too convenient not to have, even when it really is bad for you. Several bad jokes comparing shared memory to alcohol, and the moderator of the panel suggesting that a good way to avoid the hangover of shared memory is to stay drunk&#8230; whatever that means in practice.</li>
</ul>
<p>Somethings were generally agreed upon, though.</p>
<ul>
<li>Programming is an issue, shared-memory or local-memory or whatever. the idea for the solution varied, however, as discussed above.</li>
<li>Cores will still be plentiful and that operating-systems focusing on sharing time on a single very valuable core is an idea of the past. The keyword for the future is spatial sharing and reducing the overhead of management (I have some previous blog posts on this topic, especially on the <a href="http://jakob.engbloms.se/archives/58">subject of IMA</a> and <a href="http://jakob.engbloms.se/archives/123">real-time control when cores are free</a>).</li>
<li>Virtualization and isolating partitions of a multicore chip from each are necessary mechanisms. Running multiple different operating systems on a single chip will be quite normal, probably under the control of some global hypervisor.</li>
</ul>
<p>Any comments on this from my small audience? I think the topics under discussion are quite fascinating and the kind of issues on which the success of major chip design projects will be decided. A good architecture with a good programming model has a great chance of success (as long as it looks like a continuation of something existing <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ).</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/283/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The JVM as Universal Parallel Glue?</title>
		<link>http://jakob.engbloms.se/archives/264?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/264#comments</comments>
		<pubDate>Fri, 12 Sep 2008 20:45:08 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Domain-specific languages]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jvm]]></category>
		<category><![CDATA[kunle olukotun]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[SiCS Multicore days]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=264</guid>
		<description><![CDATA[The two days of the SiCS Multicore Days is now over, and it was a really fun event this year too. I will be writing a few things inspired by the event, and here is the first. Kunle Olukotun&#8216;s presentation on the work of the Stanford Pervasive Parallelism lab included a diagram where they showed [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-265" style="margin: 5px 10px;" title="javalogo" src="http://jakob.engbloms.se/wp-content/uploads/2008/09/javalogo.png" alt="" width="40" height="74" /><em>The two days of the SiCS Multicore Days is now over, and it was a really fun event this year too. I will be writing a few things inspired by the event, and here is the first. </em></p>
<p><a href="http://ogun.stanford.edu/~kunle/">Kunle Olukotun</a>&#8216;s presentation on the work of the <a href="http://ppl.stanford.edu/">Stanford Pervasive Parallelism lab </a>included a diagram where they showed a range of domain-specific languages (DSL) being compiled to a universal implementation language. That language is currently Scala, and in the end all applications end up being compiled into Scala byte codes, which are then optimized and dynamically reoptimized and executed on a particular hardware system based on the properties of that system. Fundamentally, the problem of creating and compiling a DSL, and combining program segments written in different DSLs, is solved by interposing a layer of indirection.</p>
<p>But this idea got me thinking about what the best such intermediary might be for large-scale general deployment.</p>
<p><span id="more-264"></span></p>
<p>And my conclusion is that the Java Virtual Machine might be the best candidate. Not the JVM as it is today, though. Here is my idea:</p>
<ul>
<li>The Sun Java JDK and its optimized HotSpot VM is now open-source, thanks to the <a href="http://www.openjdk.org/">OpenJDK</a>. This opens the door to new innovation based on solid technology.</li>
<li>The HotSpot is a pretty good VM, and therefore other languages are starting to use it as a potential backend. For example, Python can be compiled to the JVM, as can <a href="http://en.wikibooks.org/wiki/Ada_Programming/Platform/VM/Java">Ada</a>, and I expect many other language environments to follow suit. The reason is that developing and optimizing a VM is hard work, and if there already is a good one in existence, targeting that is easier than doing you own.</li>
<li>I think that long-term, this might well replace C as the universal language that you target when you do special-purpose code generators from custom languages&#8230; which are really DSLs.</li>
<li>Thanks to this foreseen ubiquity of the OpenJDK JVM as a universal byte-code execution machine, it will provide a single point of leverage across a large range of applications in a multitude of programming languages.</li>
<li>As demonstrated by the work of the PPL and the approach taken by RapidMind, the idea of using an abstract byte code for software delivery makes very much sense in a heterogeneous and networked environment. It also provides a good infrastructure for analysis and optimization. It simply is very sensible.</li>
</ul>
<p>However, the JVM as it stands today is not really suitable for this. It will need some extensions, which I am not the man to invent. With an open-source common JVM, such innovation will be easier to do. Thanks to Sun for opening up Java! For example:</p>
<ul>
<li>Support for dynamic languages like Ruby and Python: not the same dependence on Java-type static typing and Java types. They work well for Java, but less so for other languages. It would be nice with lists for real as well, and not just as a library container.</li>
<li>Support for threads. Not OS threads, but the typical very light-weight threads used in environments like <a href="http://www.erlang.org/">Erlang</a> and <a href="http://www.mozart-oz.org/">OZ</a>. Or even lighter, like the serial units of computations in the kernels of RapidMind and CUDA and similar GPGPU efforts.</li>
<li>Support for SIMD operations, to express data-level parallelism which is often pretty easy to find on a source-code level.</li>
<li>Support for data blocking, locality, tiling of some kind, to control data locality. Maybe this already exists in X10 (which I heard about at last year&#8217;s <a href="http://jakob.engbloms.se/archives/17">Multicore Day</a>).</li>
<li>Support for communication using messages, and I assume that the best model for expressing the threads is through local data, share-nothing, message passing. With a special case for sharing large data blocks.</li>
<li>Some kind of data sharing mechanism that is more structured and understandable for a runtime system than pure locks &amp; shared data.</li>
<li>And a system that takes such an advanced byte code and makes it run well on any particular machine, be it a Tilera Tile, an 8-core P4080, a 128000-core BlueGene, a GPGPU, or a plain middle-of-the-road UltraSparc T2.</li>
</ul>
<p>So there is some work to be done. But I really think this idea has some merit&#8230; if only I had research funding and some good students. Or a crazy VC. <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/264/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>FLOSS Weekly: Drizzle: Aggressive Push to Multicore</title>
		<link>http://jakob.engbloms.se/archives/213?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/213#comments</comments>
		<pubDate>Sun, 10 Aug 2008 19:07:45 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[open-source]]></category>
		<category><![CDATA[PC software]]></category>
		<category><![CDATA[podcast commentary]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=213</guid>
		<description><![CDATA[I listened to episode 35 of FLOSS Weekly that interviewed Brian Aker, creator of the Drizzle fork from MySQL. As most recent episodes of FLOSS Weekly, it is pretty good technical material. What I found interesting was the technical vision behind Drizzle, and how they are aggressively going for quite wide multicore hosts. Drizzle is [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.twit.tv/"><img class="alignleft size-full wp-image-214" style="border: 0pt none; margin: 5px 10px;" title="flossweekly" src="http://jakob.engbloms.se/wp-content/uploads/2008/08/flossweekly.jpg" alt="" width="70" height="70" /></a>I listened to <a href="http://www.twit.tv/floss35">episode 35 of FLOSS Weekly </a>that interviewed Brian Aker, creator of the Drizzle fork from MySQL. As most recent episodes of FLOSS Weekly, it is pretty good technical material. What I found interesting was the technical vision behind Drizzle, and how they are aggressively going for quite wide multicore hosts.</p>
<p><span id="more-213"></span></p>
<p>Drizzle is really aiming at what they see as the long-term future: 64 cores or more. MySQL now being owned by Sun, it sounds like they caught the Niagara bug&#8230; but of course they are also looking at x86 hosts. They are doing this in part by tackling a type of problem that is mostly read access, not intense on write activity. Drizzle also removes some heavy-duty DB features like stored procedures to focus on applications that make plain simple SQL queries. Only loads of them. Examples are typical web properties like Yahoo, Google, and eBay.</p>
<p>What was also interesting was the general attitude to be modern in the code. All old 16-bit (x86 yuck) code is gone, as well as 32-bit support (any serious DB server is 64-bit machine anyway). Also, C99 is used throughout. Quite a refreshing take on something that did start with a decade-old code.</p>
<p>The project is hosted at <a href="https://launchpad.net/drizzle">https://launchpad.net/drizzle</a>, where there is more information, code, etc. I have not really checked that part out.</p>
<p>But if you have time, do listen to the podcast. FLOSS weekly is one of my regular listens.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/213/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>GPU Programming: a Good Pattern to Follow?</title>
		<link>http://jakob.engbloms.se/archives/209?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/209#comments</comments>
		<pubDate>Sun, 10 Aug 2008 18:40:12 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore]]></category>
		<category><![CDATA[PC software]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=209</guid>
		<description><![CDATA[In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC [...]]]></description>
			<content:encoded><![CDATA[<p>In the March/April 2008 issue of ACM Queue, there is an <a href="http://mags.acm.org/queue/20080304/?pg=20">article on GPU Programming</a> by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.</p>
<p><span id="more-209"></span></p>
<p>The fundamentals of the GPU model are the following:</p>
<ul>
<li>It presents a fixed pipeline of stages through which data is streamed and transformed into final output.</li>
<li>Some stages are fixed-function (with some parameters), some are fully programmable.</li>
<li>The programmable stages are programmed in a local state, simple input-to-output transformation style with no access to global variables or any way to affect other computations. In this respect, the model is similar to DSP programming with its DMA-in, compute, DMA-out style. If a bit more automated.</li>
<li>There is shared global state &#8212; but it is read-only, for parameters (textures, etc.)</li>
<li>Parallelism is present in two dimensions: each stage operates on lots of data in parallel, and all stages execute concurrently.</li>
<li>The really tricky transformations of the input data stream that involve dependences between data items are encapsualted inside the fixed-function stages. In essence, this lets a few experts take care of the hard part of programming, and presents a streamlined simple model .</li>
<li>It is possible to users to destroy performance with badly written programs, but the typical use case and hardware design rests on users doing sensible things within a fairly narrow domain.</li>
<li>Code is compiled into byte codes, which are then translated and optimized for a particular GPU by the driver in the final PC running the application. This two-stage just-in-time compilation (or dynamic recompilation, or whatever we want to call it) technique is a known good way to combine performance with portability.</li>
</ul>
<p>This model is an interesting pattern as it has been extensively proven in practice. There are large numbers of programmers doing graphics programs, and they seem to have not too big problems in getting GPUs to run massively parallel computations. If you think about it, that is a pretty major success story! It also validates that the idea of &#8220;no shared state&#8221; that keeps being brought up does simplify programming. The model above is quite similar to what you have in <a href="http://www.erlang.org">Erlang/OTP </a>for example.</p>
<p>The lesson that can be drawn from this for other domains is likely that you need to create a framework (both in concept and in implementation) for processing that makes the code users write simple, single-threaded, and straightforward. The framework then automatically runs lots of little sequential snippets in parallel, and takes care of resource scheduling and the data flow.</p>
<p>Obviously, there are domains where this does seems harder to do than in other domains, but I think this is the pattern for the future. Unless most parallel applications are &#8220;easy&#8221; to create, we will not make much use of parallelism. And I think that this can usually be the case.</p>
<p>In particular can see this kind of framework being quite possible for things like packet processing in various network transforms like firewalls, routing, switching, and virus scanning. A smart hardware manufacturer in the networking market should provide these frameworks, just like the graphics chips providers are today.</p>
<p>Finally, here is a nice-looking link to the article, as generated by ACM&#8217;s online magazine publishing system:</p>
<table style="margin: 10px 0pt;" border="1" cellspacing="0" cellpadding="0" align="center" bordercolor="#000000">
<tbody>
<tr>
<td>
<table border="0" cellspacing="0" cellpadding="0" width="100%" background="http://mags.acm.org/queue/20080304/include/icons/nav_bg.gif">
<tbody>
<tr height="35" valign="middle">
<td align="left"><a title="View March/April 2008" href="http://mags.acm.org/queue/20080304/" target="_blank"><img style="margin-left: 5px; margin-right: 5px;" src="http://mags.acm.org/queue/20080304/include/icons/navbar_logo.gif" border="0" alt="" height="28" /></a></td>
<td id="topBar" style="text-align: right;"><span style="font-size: xx-small; font-family: Comic Sans MS,Arial,Helvetica;">Link to article at ACM </span></td>
</tr>
</tbody>
</table>
<table border="0" cellspacing="0" cellpadding="0" width="240" align="center">
<tbody>
<tr id="snippetThumbs" align="center">
<td colspan="2" align="center"><a title="View Magazine" onclick="name='w'+Math.round(Math.random()*(1000));w=screen.width-10;h=screen.height-40;window.open('http://mags.acm.org/queue/20080304/?pg=20',name,'toolbar=no,menubar=no,resizable=yes,scrollbars=yes,left=0,top=0,width='+w+'height='+h);return false;" href="http://mags.acm.org/queue/20080304/?pg=20" target="_blank"><img src="http://mags.acm.org/tcprojects/acm/queue/inbox/49239/imgpages/tn/queue20080304_0020.gif" border="0" alt="" /></a></td>
</tr>
</tbody>
</table>
<table border="0" cellspacing="0" cellpadding="0" width="100%" background="http://mags.acm.org/queue/20080304/include/icons/nav_bg.gif">
<tbody>
<tr height="28" valign="middle">
<td id="bottomBar" style="text-align: center;"><span style="font-size: xx-small; font-family: Comic Sans MS,Arial,Helvetica;">March/April 2008 issue of Queue</span></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<p>asdf</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/209/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>EETimes Article on Multicore Debug</title>
		<link>http://jakob.engbloms.se/archives/154?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/154#comments</comments>
		<pubDate>Sun, 20 Jul 2008 08:35:05 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[embedded software]]></category>
		<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[debugging]]></category>
		<category><![CDATA[EETimes]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=154</guid>
		<description><![CDATA[I have another short technical piece published about Multicore Debug at the EETimes (and their network of related publications, like Embedded.com). Pretty short piece, and they cut out some bits to make it fit their format. Nothing new to fans of virtual platforms for software development, basically we can use virtual platforms to reintroduce control [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-medium wp-image-155 alignleft" style="margin: 10px;" title="eetimes logo" src="http://jakob.engbloms.se/wp-content/uploads/2008/07/eetimes.png" alt="" width="127" height="56" />I have another short technical piece published about <a href="http://www.eetimes.com/news/design/showArticle.jhtml?articleID=209100262">Multicore Debug at the EETimes </a>(and their network of related publications, like <a href="http://www.embedded.com/design/209101250">Embedded.com</a>). Pretty short piece, and they cut out some bits to make it fit their format. Nothing new to fans of virtual platforms for software development, basically we can use virtual platforms to reintroduce control over parallel and for all practical purposes chaotic hardware/software systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/154/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Simon Kågström, PhD</title>
		<link>http://jakob.engbloms.se/archives/119?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/119#comments</comments>
		<pubDate>Sat, 10 May 2008 07:08:19 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[blog commentary]]></category>
		<category><![CDATA[books]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=119</guid>
		<description><![CDATA[Yesterday, I had the honor of being the opponent at the PhD defense of Simon Kågström at Blekinge Tekniska Högskola (BTH, Blekinge University of Technology in English). His PhD thesis deals mainly with the multiprocessor port of an industrial in-house operating system, and a secondary theme was the design of the Cibyl C-programs-to-JVM translator. All [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-thumbnail wp-image-120" style="float: left;" title="bthsmall" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/bthsmall-150x148.png" alt="BTH logo" width="150" height="148" />Yesterday, I had the honor of being the opponent at the PhD defense of <a href="http://www.ipd.bth.se/ska/">Simon Kågström</a> at <a href="http://www.bth.se">Blekinge Tekniska Högskola</a> (BTH, Blekinge University of Technology in English). His <a href="http://www.ipd.bth.se/ska/phd.html">PhD thesis</a> deals mainly with the multiprocessor port of an industrial in-house operating system, and a secondary theme was the design of the <a href="http://code.google.com/p/cibyl/">Cibyl </a>C-programs-to-JVM translator. All of his papers are very well-written and a joy to read, and the engineering work behind it is very solid.</p>
<p>The most important data in the PhD thesis is really just how much work it is to do an SMP port of an OS kernel. And how hard it is to get performance up to good levels even with several years of work. Really emphasizes the point that hard work and perseverance and just lots of calendar time is what it takes to create a good SMP OS. That&#8217;s why Solaris and AIX are still years ahead of Linux in this respect &#8212; you just need to hit the snags, fix them, retest, and hit the next snag. It takes time to polish, basically.</p>
<p>So, if you have any interest in multiprocessor operating systems, Simon&#8217;s work is well-worth a read. Also check out his blog at <a href="http://simonkagstrom.livejournal.com/">http://simonkagstrom.livejournal.com/</a>.  And by the way, he did pass.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/119/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Grant Martin on Manycore Multicore MPSoC AMP SMP Multi-X&#8230;</title>
		<link>http://jakob.engbloms.se/archives/114?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/114#comments</comments>
		<pubDate>Sat, 03 May 2008 19:23:45 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[blog commentary]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=114</guid>
		<description><![CDATA[Grant Martin is a nice fellow from Tensilica who has a blog at ChipDesignMag. In a recent post, he raises the question of nomenclature and taxonomy for multicore processor designs: &#8230;the discussion, and the need to constantly define our terms (and redefine them, and discuss them when people disagree) makes me wish that the world [...]]]></description>
			<content:encoded><![CDATA[<p>Grant Martin is a nice fellow from Tensilica who has a <a href="http://www.chipdesignmag.com/martins/">blog at ChipDesignMag</a>. In a <a href="http://www.chipdesignmag.com/martins/?p=5">recent post</a>, he raises the question of nomenclature and taxonomy for multicore processor designs:</p>
<blockquote><p>&#8230;the discussion, and the need to constantly define our terms (and redefine them, and discuss them when people disagree) makes me wish that the world of electronics, system and software design had some agreement on what the right terms are and what they mean&#8230;</p></blockquote>
<p>I think this is a good idea, but we need to keep the core count out of it&#8230;</p>
<p><span id="more-114"></span></p>
<p>The reason for the confusion of terms and the strong will to create new terms all the time is really that people feel that there is a real difference between a dual-core x86 processor used in a laptop and a highly integrated 100-core-or-more embedded design for traffic processing in a large switch. And for that reason, they want to define a term to define themselves out of the mainstream desktop/server space with a few large cores.</p>
<p>But the number of cores is probably the least useful parameter to use as a differentiator. If 4 cores is multicore and 32 cores manycore today, in a few years time the decrease in feature width will have moved 32 cores into multi and 128 cores into many&#8230; etc. So that is really something is bound to change over time.</p>
<p>I think that rather we need to look at other aspects of a chip design, in particular those that are not just straight multiplication of features. Those aspects that really matter to the kinds of programs the chip takes nicely to, and that architects have to think hard about.</p>
<p>Programming models are not the right answer to this. As Grant says, programming models need to be put in a taxonomy of its own:</p>
<blockquote><p>A kind of taxonomy of multicore related terms, together with a taxonomy of programming models (SMP, AMP, etc.) that everyone could be referred to when these discussions are held and that everyone could begin to build a consensus around would be of great value to all.</p></blockquote>
<p>If nothing else, we all know that any programming model can be put onto pretty much any piece of silicon, given a sufficiently thick layer of middleware. It might not be the most efficient way to program any particular hardware in terms of hardware resources used, but someone is going to do it anyway.</p>
<p>So what is left in the chip taxonomy?</p>
<p>I think we need to look at things like where memories are located (global, local to each core, shared by a small group), number of levels of memories, whether they are caches or program-controlled. How interrupts and IO are routed is another interesting aspect. Can any core do anything, or do we have master nodes that can do more things? Are all cores equal in terms of performance and computational ability, or do they differ?</p>
<p>As Grant says, a great subject for academia to dig into.</p>
<p>The comments at the end of the post about some secret activities from the Multicore Association by Markus Levy makes me agree with Grant: please get the ideas and drafts out into the open, and make sure to get the widest input possible!</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/114/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Sun buys Montalvo</title>
		<link>http://jakob.engbloms.se/archives/113?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/113#comments</comments>
		<pubDate>Mon, 28 Apr 2008 10:14:57 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[blog commentary]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=113</guid>
		<description><![CDATA[Sun just bought Montalvo whose hardware I blogged about some while ago. And just like the Apple acquisition of PA Semi, the question of &#8220;why&#8221; appears. Some analysts blame the simple fact that both Montalvo and PA Semi simply needed to be acquired, since their venture capitalists did not want to put in the next [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-thumbnail wp-image-78" style="float: left; margin: 10px;" src="http://jakob.engbloms.se/wp-content/uploads/2008/02/montalvo-fg.gif" alt="" width="150" height="94" /></p>
<p>Sun just bought Montalvo whose hardware I blogged about some while ago. And just like the Apple acquisition of PA Semi, the question of &#8220;why&#8221; appears. Some analysts blame the simple fact that both Montalvo and PA Semi simply needed to be acquired, since <a href="http://venturebeat.com/2008/03/20/montalvo-seeking-a-hoard-of-cash/">their venture capitalists did not want to put in the next 100 million USD needed to go to silicon (Montalvo)</a> or really expand on the opportunity already at hand (PA Semi). Here is my crazy guess.</p>
<p><span id="more-113"></span></p>
<p>Look at the following:</p>
<ul>
<li>Sun has seen great success with the UltraSparc T line of processors, which are basically &#8220;lots of simple cores on a single chip for thread-parallel applications&#8221;.</li>
<li>Sun is investing in Solaris for x86 and has great success with its x86-based servers (based on AMD processors).</li>
<li>Montalvo is building something quite similar to &#8220;lots of simple cores on a single chip&#8221; for x86. Which should run Solaris-x86 and most other x86 operating systems.</li>
<li>Sun has been buying companies and key components for a while now (AMD processors, Fujitsu processors, the company Afara that created the UltraSparc T line).</li>
</ul>
<p>So my guess is&#8230; based purely on technological similarities and no indirect approaches and conspiracy theories. It assumes that Sun does want to make use of Montalvo&#8217;s tech as it currently stands:</p>
<ul>
<li>Sun buys Montalvo to build x86-based UltraSparc T-style machines for throughput computing. Nice complement to the current high-single-thread-performance AMD-based x86 machines.</li>
</ul>
<p>Note 1: The indirect approach theory here is that Sun wants to use <a href="http://www.silobreaker.com/DocumentReader.aspx?Item=5_842234139">Montalvo to put cost pressure on AMD,</a> just like there is <a href="http://valleywag.com/382944/steve-jobs-buys-pa-semi-for-a-chip-++-a-bargaining-chip">speculation that Apple is going to use PA Semi to put cost pressure on Intel.</a></p>
<p>Note 2: This would also put Sun into direct chip competition with Intel Atom-based designs&#8230; which might be slightly less clever. Never mind, do it anyway <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Note 3: I have no insider on anything on this, this is totally based on speculation from public tech facts.</p>
<p>Note 4: Similar ideas are bandied in a <a href="http://venturebeat.com/2008/04/03/sun-microsystems-could-use-montalvo-as-a-strategic-lever-against-intel/">rumor comment at VentureBeat from early April 2008:</a></p>
<blockquote><p>But Sun could also choose to avoid a fight with Intel, using the patents to protect itself and to employ the techniques for power savings in its own future SPARC microprocessor offerings. Sun’s most ambitious processors already employ many equal-sized cores on a single chip; the asymmetric architecture of Montalvo’s chips might add interesting capabilities to Sun’s SPARC line-up. In any case, Sun could be picking up the assets at a fire sale price and using them for strategic leverage.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/113/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multicore Expo US 2008</title>
		<link>http://jakob.engbloms.se/archives/89?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/89#comments</comments>
		<pubDate>Mon, 24 Mar 2008 18:43:30 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[trade shows]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/89</guid>
		<description><![CDATA[The Multicore Expo US 2008 is taking place next week (April 1-3) in Santa Clara, CA. I was originally slated to talk there, but since I am going to the Embedded Systems Conference a few weeks later it was too much travel in too short a time frame to do. I happy that Ross Dickson, [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.multicore-expo.com/common/agenda.php?expo_seq=6">Multicore Expo US 2008</a> is taking place next week (April 1-3) in Santa Clara, CA. I was originally slated to talk there, but since I am going to the <a href="http://jakob.engbloms.se/archives/75">Embedded Systems Conference</a> a few weeks later it was too much travel in too short a time frame to do. I happy that Ross Dickson, a senior technology specialist at Virtutech could take my place. He will do just as good a job as I would, and he also has his own session to present at the Expo.</p>
<p><a href="http://www.multicore-expo.com/common/session.php?pres_seq=367">Our talk will be on how approximate you can be in simulating multicore computers</a>, and still get useful results out from the software running on the simulator. It is something that we at Virtutech have spent a lot of time working on, and we want to bring our results to a wider community. Really exciting to present, and it is a pity that I could not be there myself.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/89/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DATE 2008 Panel on Multicore Programming</title>
		<link>http://jakob.engbloms.se/archives/87?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/87#comments</comments>
		<pubDate>Sun, 16 Mar 2008 20:56:48 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[appearances]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[simulation]]></category>
		<category><![CDATA[software tools]]></category>
		<category><![CDATA[trade shows]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/87</guid>
		<description><![CDATA[I attended a DATE 2008 open exhibition panel discussion on multicore programming, organized by Gary Smith EDA. The panel was a few people short, and ended up with just Simon Davidmann of Imperas, Grant Martin of Tensilica, and Rudy Lauwereins of IMEC. A user representative from Ericsson was supposed to have been there but he [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://jakob.engbloms.se/wp-content/uploads/2008/03/date08log.thumbnail.GIF" alt="date2008" hspace="10" vspace="10" align="left" />I attended a <a href="http://www.date-conference.com/conference/2008/prog/progdetail_exhibition.php?dateID=11">DATE 2008</a> open exhibition panel discussion on multicore programming, organized by <a href="http://www.garysmitheda.com/">Gary Smith EDA</a>. The panel was a few people short, and ended up with just Simon Davidmann of Imperas, Grant Martin of Tensilica, and Rudy Lauwereins of IMEC. A user representative from Ericsson was supposed to have been there but he never arrived. Overall, the panel was geared towards data-plane processing-type thinking, and a bit short on internal dissonance.</p>
<p><span id="more-87"></span> Any case, the panel said the following, including feedback from the audience (including the author of this post):</p>
<ul>
<li>Gary Smith laid out some data indicating that software is overtaking hardware as the main effort even for the core SoC designers for multicore. Today, some 50% is software, by 2012, 75% of the design cost will be software.</li>
<li> Rudy lamented the difficulty of doing anything with regular unrestrained C code, and<br />
described some IMEC research where they restrict down C to make it palatable. Probably related to their work on &#8220;2D&#8221; VLIW architectures. Their &#8220;clean C&#8221; can easily be analyzed and create statically scheduled code (my interpretation) that runs well without caches and cache coherency. The audience asked how you could actually write a program without using pointers&#8230; good point. I think it can work nicely for data crunching, but fails horribly for control-oriented and dynamic codes.</li>
<li>Grant made the point that <em>there is no need to panic</em>. Today, we can see people actually building successful multicore systems with today&#8217;s tools. It is a bit of muddling through, but it does get through. There is a lot of truth to that, but it misses the issue of converting or containing all legacy code that has a hard time moving from a single to multiple processors.</li>
<li>Grant also said that the applications he had seen the most success with where those typically called &#8220;embarrassingly parallel&#8221;. But why should anyone be embarrassed that they have such nice problems? <em>There is nothing to be embarrassed about, rather you should be proud of having such a nice system/algorithm.</em></li>
<li>Simon Davidmann echoed my favorite theme that simulation is a key tool to develop software for multicore, as it gives you insight and control.</li>
<li>When asked about hardware debug support, Simon was downright negative and wanted it all in simulation. Grant said that the proper solution was a mix of hardware debug and software simulators, which I agree with (see http://jakob.engbloms.se/archives/17 for some more thinking on this topic).</li>
<li>Someone pointed out that hardware debug is sometimes taken out of volume chips and is only used in development versions &#8212; apparently, that is common in automotive, as the cost of each shipping chip is of utmost importance. Less practical for larger machines, though, where you cannot easily build a development version of your rack/router/switch/server&#8230;</li>
<li>When asked about how to handle billions of lines of legacy code in a mix of C, C++, Java, and other languages, Rudy sounded downright exasperated. He seemed to be most comfortable with expressing algorithms and mapping them down to hardware using tools, rather than trying to deal with managing a zoo of legacy code&#8230; that is what I mean with the data-plane/data-crunching mindset of this panel.</li>
</ul>
<p>Overall, an interesting panel, but a bit disappointing in the lack of large-scale software thinking. Would have been nice to mix in someone with an HPC, server, or large-scale control-plane embedded-systems background in the panel.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/87/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multicore Denial-of-Service Attack</title>
		<link>http://jakob.engbloms.se/archives/83?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/83#comments</comments>
		<pubDate>Tue, 04 Mar 2008 11:16:08 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/83</guid>
		<description><![CDATA[In a paper from USENIX 2007 by Microsoft Researchers Onur Mutlu and Thomas Moscibroda present a working &#8220;denial of service&#8221; attack for multicore processors. The idea is simple: since there is no fairness or security designed into current DRAM controllers, it is quite feasible for one program in a multicore system to hog almost all [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://research.microsoft.com/~onur/pub/mph_usenix_security07.pdf">paper from USENIX 2007 by Microsoft Researchers Onur Mutlu and Thomas Moscibroda</a> present a working &#8220;<span style="font-weight: bold">denial of service&#8221; attack for multicore processors</span>.  The idea is simple: since there is no fairness or security designed into current DRAM controllers, it is quite feasible for one program in a multicore system to hog almost all memory bandwidth and thus reduce or deny service to the others. There is no direct attack on software programs, just stealing the resources that they all need to share for all to work.<br />
<span id="more-83"></span>The attack is baesd on the following:</p>
<ul>
<li>Several cores share the memory controller(s) &#8212; quite likely, since there is not room for more one controller per core in &gt;2 core machines. The limitation is both because memory controllers are large and complex beasts, and because the pins needed for each memory interface makes it hard to have more than a few on a single chip. While the real estate for processors easily lets us put 4, 8, or 16 cores on a chip today.</li>
<li> Modern DRAM controllers are not a strict fifo queue, but attempt to optimize memory bandwidth by prioritizing accesses that are directed to the currently open rows in the banks in the available DRAMs.</li>
<li>The scheduling strategy used today (as they claim) can be easily monopolized by a thread with a high rate of memory accesses and good sequential locality.  There is no attempt to provide fairness between cores or programs.</li>
</ul>
<p>A simple stream benchmark doing a sequential read through a large array is a simple example of what they term an MPH &#8212; Memory Performance Hog.  In experiments on real hardware and in simulation they show how it can kill the performance of simultaneously executing programs with somewhat more random access patterns.</p>
<p>So what to make of this?</p>
<p>First of all, this is a real attack, in the respect that this sort of thing can and do happen on current hardware with current software out in the field.  How dangerous it is in practice is hard to tell, but it could be an issue for various cases where users are sharing a computer. A bit like the old &#8220;<a href="http://en.wikipedia.org/wiki/Fork_bomb">fork bomb</a>&#8221; on Unix systems. I remember being thrown out of shared Solaris machines a few times due to these (several times unintended by beginning Unix programmers making honest mistakes).</p>
<p>It is more interesting in the context of embedded systems and integrated modular avionics (IMA). As I stated in a few earlier blog posts(<a href="http://jakob.engbloms.se/archives/63">63</a> and <a href="http://jakob.engbloms.se/archives/58">58</a>) I think that the best way to host multiple different applications on a multicore processor is to partition applications sparially across cores.  This should be more efficient, simpler, and safer than sharing all the cores across partitions using time sharing.<br />
However, this attack does reflect critically on that idea: if it is this simple to hog the memory and thus kill performance of other cores and applications, it might not be particularly safe to have each core run an independent set of applications of different criticalities.  It does mean that in order to ensure performance isolation between applications, you will need additional hardware support of one form or the other.  Could be a better DRAM scheduler (as the paper proposes), or a static allocation of a DRAM controller to each core (which is likely infeasible due to pin constraints), or DRAM controllers that do a slightly inefficient but safe allocation of a portion of their bandwidth to each core.</p>
<p>In the meantime, maybe the ugly temporal sharing of the entire chip is the &#8220;best&#8221; way ahead, as it at least is proof against this kind of attack based on parallel execution of partitions.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/83/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Something to make use of all those cores: Raytracing</title>
		<link>http://jakob.engbloms.se/archives/79?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/79#comments</comments>
		<pubDate>Fri, 22 Feb 2008 12:47:40 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[games]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/79</guid>
		<description><![CDATA[In the INQUIRER article &#8220;Intel pushes Raytracing again&#8220;, they have an example of an application with almost insatiable appetite for processor cycles and processor cores. Real-time raytracing. With 16 full-size x86 cores, they can match the framerate of a regular mid-range GPU &#8212; but with picture quality of raytracing rather than the approximations of rasterizers. [...]]]></description>
			<content:encoded><![CDATA[<p>In the INQUIRER article &#8220;<a href="http://www.theinquirer.net/gb/inquirer/news/2008/02/21/intel-pushes-raytracing-again">Intel pushes Raytracing again</a>&#8220;, they have an example of an application with almost insatiable appetite for processor cycles and processor cores. Real-time raytracing. With 16 full-size x86 cores, they can match the framerate of a regular mid-range GPU &#8212; but with picture quality of raytracing rather than the approximations of rasterizers. So, better quality, using something like 5 to 10 times as many transistors as the GPU would. This application can certainly use almost any amount of hardware, good for Intel <img src='http://jakob.engbloms.se/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/79/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Montalvo: Heterogeneous x86 Multicore</title>
		<link>http://jakob.engbloms.se/archives/77?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/77#comments</comments>
		<pubDate>Tue, 19 Feb 2008 15:35:05 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/77</guid>
		<description><![CDATA[CNET (of all places) have a short article on what Montalvo Systems are up to: Secret recipe inside Intel&#8217;s latest competitor &#124; CNET News.com. The article is a bit short on details, but it sounds like it is finally an example of a same-ISA, different-powered-cores heterogeneous multicore device in the mainstream. The idea has a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/02/montalvo-fg.gif" title="montalvo-fg.gif"><img src="http://jakob.engbloms.se/wp-content/uploads/2008/02/montalvo-fg.thumbnail.gif" alt="montalvo-fg.gif" align="left" vspace="10" /></a>CNET (of all places) have a short article on what <a href="http://www.montalvosystems.com">Montalvo Systems</a> are up to:<a href="http://www.news.com/The-secret-recipe-inside-Intels-latest-competitor/2100-1006_3-6230748.html"> Secret recipe inside Intel&#8217;s latest competitor | CNET News.com</a>. The article is a bit short on details, but it sounds like it is finally an example of a same-ISA, different-powered-cores heterogeneous multicore device in the mainstream. The idea has a lot of merit, and it will be very interesting to see the final results once silicon ships. I really believe is heterogeneous designs.</p>
<p>To be critical, trying to compete with Intel might not be the best idea around&#8230; but it never hurts to try. Also, the name is not unique, there is already a montalvo.com that is not montalvosystems.com. I think the old name &#8220;Memorylogix&#8221; was more interesting and less prone to website name collisions (yes, it seems to be the same company that briefly surfaced with some stripped-down x86 processor back in 2002 &#8212; I have an <a href="http://mpronline.com">MPR </a>article to prove it). <a href="http://www.news.com/The-secret-recipe-inside-Intels-latest-competitor/2100-1006_3-6230748.html"><br />
</a></p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/77/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multithreading Game AI</title>
		<link>http://jakob.engbloms.se/archives/64?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/64#comments</comments>
		<pubDate>Tue, 01 Jan 2008 13:01:23 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[blog commentary]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[multicore]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/64</guid>
		<description><![CDATA[Over at an online publication called AI Game Dev, there is an elucidating post on how to do multithreading of game AI code (posted in June 2007). Basically, the conclusion is that most of the CPU time in an AI system is spent doing collision detection, path finding, and animation. This focus of time in [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://files.aigamedev.com/MASCOT.jpg" align="left" height="120" hspace="10" width="96" />Over at an online publication called AI Game Dev, there is an <a href="http://aigamedev.com/questions/multi-threading-strategies">elucidating post on how to do multithreading of game AI code</a> (posted in June 2007). Basically, the conclusion is that most of the CPU time in an AI system is spent doing collision detection, path finding, and animation. This focus of time in a few domain-given hot spots turns the problem of parallelizing the AI into one of parallelizing some core supporting algorithms, rather than trying to parallelize the actual decision making itself. The key to achieving this is to make the decision-making part able to work asynchronously with the other algorithms, which is not trivial but still much easier than threading the decision making itself. The threading of the most time-consuming parts turns into classic algorithm parallelization, which is more familiar and easier to do than threading general-purpose large code bases.  A good read, basically, that taught me some more about parallelization in the games world.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/64/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wayne Wolf on &#8220;The Good News and the Bad News&#8221; of Embedded Multiprocessing</title>
		<link>http://jakob.engbloms.se/archives/63?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/63#comments</comments>
		<pubDate>Thu, 27 Dec 2007 09:42:21 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/63</guid>
		<description><![CDATA[In a column called The Good News and the Bad News in IEEE Computer magazine (November 2007 issue), Prof. Wayne Wolf at Georgia Tech (and a regular columnist on embedded systems for Computer magazine) talks about the impact of multiprocessing systems (multicore, multichip) on embedded systems. In general, his tone is much more optimistic and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.computer.org"><img src="http://www.computer.org/portal/cms_docs_cs/csdl/images/dllogo_mco.gif" align="left" height="74" vspace="10" width="375" /></a>In a column called <a href="http://www.computer.org/portal/site/computer/menuitem.eb7d70008ce52e4b0ef1bd108bcd45f3/index.jsp?&amp;pName=computer_level1&amp;path=computer/homepage/Nov07&amp;file=embedded.xml&amp;xsl=article.xsl&amp;;jsessionid=Hzvv6WQZ0VTrPdDFQQW8GDJf4KPQQ8Ls8lpZQS4Pml4SnTGPb9gf!306574509">The Good News and the Bad News </a>in IEEE Computer magazine (November 2007 issue), <a href="http://www.ece.gatech.edu/faculty-staff/fac_profiles/bio.php?id=151">Prof. Wayne Wolf at Georgia Tech</a> (and a regular columnist on embedded systems for Computer magazine) talks about the impact of multiprocessing systems (multicore, multichip) on embedded systems. In general, his tone is much more optimistic and upbeat than most pundits.</p>
<p><span id="more-63"></span> Basically, he echoes my sentiment that multiprocessing systems often offer clear performance and efficiency advantages, especially heterogeneous systems. Programming such systems are a bit more complicated, admittedly, but the resulting system advantages are clear. Quote:</p>
<blockquote><p>A surprising number of embedded computing systems use multiple processors. There are several good reasons for doing this. Segregating different real-time tasks onto different CPUs makes it easier to determine whether they&#8217;ll meet their deadlines. Thanks to the yield characteristics of VLSI, using several smaller CPUs also can be cheaper than using one big CPU.</p></blockquote>
<p>And later:</p>
<blockquote><p> Like multiprocessing, heterogeneous multiprocessing has some downsides. If you think it&#8217;s hard to debug a program that runs on two identical CPUs, you&#8217;ll be totally<br />
mystified by a program that runs on a reduced-instruction-set computing processor and a digital signal processor. Programming heterogeneous systems generally requires using multiple sets of programming tools and being extra careful at the boundary between the two systems.</p>
<p>But heterogeneity builds on the advantages of multiprocessors to provide further benefits. Many design studies have shown that specializing the instruction set of a CPU to the task that it runs saves substantial energy and improves performance (see Chris Rowen&#8217;s &#8220;Reducing SoC Simulation and Development Time,&#8221; Computer, Dec. 2002, pp. 29-34). If you start with a uniform multiprocessor and replace CPUs with new processors that are specialized to tasks that actually run in that part of the system, you can gain performance without substantial changes to your software.</p></blockquote>
<p>This might be simplifying the problem a bit, but overall I think that he is right in his assessment.</p>
<p>Thanks for an optimistic and realistic piece!</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/63/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mark Nelson&#8217;s Multicore Non-Panic and Embedded Systems</title>
		<link>http://jakob.engbloms.se/archives/59?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/59#comments</comments>
		<pubDate>Fri, 07 Dec 2007 20:44:46 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[blog commentary]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/59</guid>
		<description><![CDATA[Via thinkingparallel.com I just found an interesting article from last Summer, about the actual non-imminence of the end of the computing world as we know it due to multicore. Written by Mark Nelson, the article makes some relevant and mostly correct claims, as long as we keep to the desktop land that he knows best. [...]]]></description>
			<content:encoded><![CDATA[<p>Via <a href="http://thinkingparallel.com">thinkingparallel.com</a> I just found an interesting article from last Summer, about the actual non-imminence of the end of the computing world as we know it due to multicore.  <a href="http://marknelson.us/2007/07/30/multicore-panic/">Written by Mark Nelson, the article makes some relevant and mostly correct claims</a>, as long as we keep to the desktop land that he knows best.  So here is a look at these claims in the context of embedded systems.<br />
<span id="more-59"></span><br />
1. On the desktop, current few-way multicore solutions do seem to give immediate benefit thanks to the vast number of threads executing as background tasks and similar in a modern Windows installation.</p>
<p>2. A few more ways of multicores will be gobbled up by eyecandy work as Linux, Windows, and OS X keep fighting on what OS looks the best. And this means lots of easily parallelized threads.</p>
<p>3. Long-term, things look bleaker. How do we make use of a 32-way or even 128-way general-purpose machine?</p>
<p>In the embedded systems that I know and love, claim 1 certainly holds up in many cases.  Control-plane applications in core network and telecom systems do feature piles of threads today, and can quite easily be scaled out onto a few cores using SMP.  This is what ARM has also been advocating is the case for most of the mobile phone workloads that today run on single ARM cores.  Using an ARM multicore will work fine up to four cores, since there is ample threads to go around inside a modern phone.  All you need is the OS to be SMP capable, and that seems to be finally happening with the last big RTOSes announcing SMP versions this fall.</p>
<p>Note that there is a different way of using initial multicores in the embedded world, by consolidating what used to be several processors onto a single chip.  Basically, using a dualcore processor as a natural replacement for two singlecore processors, pretty much running the same workload. This scenario uses two (or more) different operating systems in AMP mode (see <a href="http://jakob.engbloms.se/archives/22">http://jakob.engbloms.se/archives/22</a>).  In this way, it is quite likely that quite a few systems can take advantage 2, 3, 4, and maybe even 8-way systems without much work.</p>
<p>Claim 2 makes no sense in the embedded field.  At least not in the sense that &#8220;your platform software provider will add end-user benefits that eat up more CPU and that does not require you to update your own code&#8221;.  Maybe you could claim this for mobile phones, but mainstream mobile phone OSes like Symbian have not exactly been aggressive on this front.  People don&#8217;t seem to be looking for eye candy of that kind in phones &#8212; currently at least (update: see comment on this, the iPhone could be changing this tenet).</p>
<p>Claim 3 is applicable.  At least for control-oriented applications that run on general-purpose shared-memory machines.  Unless you count on a continuation of the consolidation trend: imagine a system where you combine more and more boards from a current rack onto a single chip, or add &#8220;more boards&#8221; by adding in more AMP operating system instances.  It makes sense, since in many cases the actual applications feature ample parallelism that today is exploited by using multiple boards or discrete processing units working in close cooperation to handle the volumes of work present.</p>
<p>For media and radio interface applications, you have a real easy time to use &#8220;any&#8221; amount of parallelism.  But that is more similar to the GPUs used in current PCs than the case for the main processor(s) which is being discussed here.</p>
<p>Long-term, PC/desktop/server computing and embedded computing do have some common challenge of using many cores effectively.  But the advantage of embedded computing is that most application domains are effectively parallel by nature, and &#8220;all&#8221; you have to do is find a way to move that parallelism onto a single chip.</p>
<p>His final statement is that:</p>
<blockquote><p> Our industry press thrives on a good crisis. The switch to multicore processors has presented the brain trust with the opportunity to drum up a convincing one, and they haven’t let us down. Just try to take it with a grain of salt. The crises we’ve had in the past have mostly been resolved with boring, step-wise evolution, and this one will be no different. Maybe 15 or 20 years from now we’ll be writing code in some new transaction based language that spreads a program effortlessly across hundreds of cores. Or, more likely, we’ll still be writing code in C++, Java, and .Net, and we’ll have clever tools that accomplish the same result.</p></blockquote>
<p>I think he is right about this, and that the end result will be a set of fairly ugly domain-specific frameworks that makes parallel programming reasonably easy. Just like GUI coding frameworks popped up when GUIs were new, relieving you of the tediousness of writing all the plumbing code. But it took a few years to nail down what was to go into a framework and their</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/59/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Register reporting from SC&#8217;07</title>
		<link>http://jakob.engbloms.se/archives/55?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/55#comments</comments>
		<pubDate>Tue, 20 Nov 2007 19:31:35 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/55</guid>
		<description><![CDATA[The Register has a pretty good report from the Supercomputing (SC) 2007 conference.  Quite knowledgeable, and mostly about the thorny issue of programming massively parallel fairly homogeneous machines likes GPUs and floating-point accelerators. Of course, their commentary has to be commented on. Read on for more. The following quotes on programming for the Clearspeed chips [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://sc07.supercomputing.org/images/sc07_logo.jpg" align="left" height="94" hspace="10" width="88" />The Register has a <a href="http://www.theregister.co.uk/2007/11/20/accelerators_fpga_gpu_sc07/">pretty good report </a>from the <a href="http://sc07.supercomputing.org/">Supercomputing (SC) 2007 conference.</a>  Quite knowledgeable, and mostly about the thorny issue of programming massively parallel fairly homogeneous machines likes GPUs and floating-point accelerators. Of course, their commentary has to be commented on. Read on for more.</p>
<p><span id="more-55"></span></p>
<p>The following quotes on programming for the <a href="http://www.clearspeed.com">Clearspeed </a>chips describes a solution that I find very attractive: use higher-level programming languages that describe an actual problem or computation to be performed, and then let a compiler/code generator take care of generating a suitable parallel implementation:</p>
<blockquote><p><em>For one, it notes that a number of applications such as Matlab and Mathematica can run on the CSX600 chips without any changes to the underlying code thanks to work done by ClearSpeed and the software makers and the presence of friendly ClearSpeed libraries.</em></p></blockquote>
<p>Using libraries is one solution suitable for certain classes of problems, and can cover a fair amount of the supercomputing market where the number of kernel algorithms used tend to be fairly limited. I wouldn&#8217;t try using these tools for programming a telecom switch, but that is not what the hardware is designed for either. The CSX600 chip contains 96 fully floating-point coprocessors, and there are solutions out there using massive numbers of the chips (144, according to The Register, for a total of 13824 processors). Scalability like that is pretty cool.</p>
<p>So it seems that by targeting a selected set of applications, Clearspeed does manage to produce a decently programmable solution. Also, one has to presume that the value to the end users of the solution is great enough that the time spent programming is worth its while.</p>
<p>Another example of a domain-limited solution with great power is what <a href="http://www.acceleware.com/">Acceleware </a>is doing:  take the hardware and the software from Nvidia for using GPUs as accelerators, and code a solution applying it to a certain problem domain. This makes it very easy for the end-users to pick up, since they basically buy a package targeted to their problem, with all the hard parts already taken care of. In the case of Acceleware, the problem is electromagnetic simulations, and the customers are big companies like Nokia and Samsung. These end customers only need a month or so to incorporate the acceleration effect into their custom programs. Everybody wins, and the value of Acceleware is in letting a group of users for GPU acceleration share the cost of creating a platform to work from. Classic play in high-tech.</p>
<p>Finally, The Register looks at a few players working with FPGAs as their acceleration platform. This has theoretical immense performance and performance/power, but also a much steeper learning curve for programmers.  In my favorite field of embedded, I rather see FPGAs and on-chip FPGAs being used to create accelerated peripherals or simple algorithms embedded in hardware rather than as general-purpose math accelerators. The difference might not be that big in theory, but it does impact what the style and target programs for the programming tools are. And there is a big difference between floating-point math and the operations needed to decode video or do parallel table lookups.</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/55/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Virtualization and Linux on a DSP Processor</title>
		<link>http://jakob.engbloms.se/archives/46?&amp;owa_from=feed&amp;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/46#comments</comments>
		<pubDate>Sun, 04 Nov 2007 10:40:43 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[uncategorized]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[software tools]]></category>
		<category><![CDATA[virtualization]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/archives/46</guid>
		<description><![CDATA[A small tidbit that I found interesting due to the targeted platform. LinuxDevices reports that the VirtualLogix VLX-NI virtualization layer that used to run only on x86 platforms now also run on TI DSPs in the C64+ series. Basically, you put their virtualization layer on the DSP, and you can then on the same core [...]]]></description>
			<content:encoded><![CDATA[<p>A small tidbit that I found interesting due to the targeted platform. <a href="http://www.linuxdevices.com/news/NS3172173373.html">LinuxDevices reports</a> that the <a href="http://www.linuxdevices.com/news/NS3172173373.html">VirtualLogix </a>VLX-NI virtualization layer that used to run only on x86 platforms now also run on TI DSPs in the C64+ series. Basically, you put their virtualization layer on the DSP, and you can then on the same core run both a Linux kernel and a DSP/BIOS kernel. Thus supporting traditional DSP development and Linux-style development on the same core.</p>
<p><span id="more-46"></span><br />
On x86, the virtualization uses the virtualization extensions in recent editions of Intel and AMD processors. There is no such support on the C64+ DSP series, so they fall back to paravirtualization, modifying the operating systems to play nice with the virtualization layer. Nothing particularly magical about this.</p>
<p>The interesting part is really that people are considering using Linux which is usually very hard to tune for non-standard hardware platforms to run code on a real high-power DSP. In general, that is considered a bad idea since what you want is a thin layer of software to let you do compute programs that make maximum use of the processor. Doing interrupts and other typical OS work on a DSP is a bad idea since it breaks program flow. Some instructions sequences on a C64-type DSP even require you to turn off interrupts in order to work correctly! So I guess the performance of Linux programs are going to be pretty poor&#8230; However, the idea here seems to be to move some control-plane functions onto the DSPs in the system, which is going to be poorly performing code anyway.  I guess it makes sense if you have spare cycles on the DSP to be able to do away with the additional control-plane processor (which is usually a small slow processor anyway on the kind of base station software being targeted here).</p>
]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/46/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
