<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; GPGPU</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/gpgpu/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>GPGPU for Instruction-Set Simulation &#8211; Maybe, Maybe not</title>
		<link>http://jakob.engbloms.se/archives/1506?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1506#comments</comments>
		<pubDate>Sat, 08 Oct 2011 19:17:58 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[articles]]></category>
		<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[computer simulation technology]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[parallel computing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CCGrid]]></category>
		<category><![CDATA[cycle accuracy]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[simulation]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1506</guid>
		<description><![CDATA[I just read a quite interesting article by Christian Pinto et al, &#8220;GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms&#8220;, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png"><img class="alignleft size-full wp-image-125" style="margin: 5px 10px;" title="coreshrink1" src="http://jakob.engbloms.se/wp-content/uploads/2008/05/coreshrink1.png" alt="" width="100" height="100" /></a>I just read a quite interesting article by Christian Pinto et al, &#8220;<a href="http://infoscience.epfl.ch/record/164471">GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms</a>&#8220;, published at the <a href="http://www.ics.uci.edu/~ccgrid11/">CCGRID 2011 </a>conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.</p>
<p><span id="more-1506"></span>The paper describes a simulation for a network-on-chip based homogeneous system containing a &#8220;ARM-subset&#8221; ISS instances with local instruction and data caches, some local RAM, and also some shared RAM. Each core runs its own local software load, there is no SMP operating system. All communication between cores is over shared memory, using explicit operations across the NoC. All cores run a single cycle before they check communications from their neighbors.</p>
<p>This last point is crucial to understanding why this is feasible at all &#8211; in general, simulating a general shared-memory multiprocessor machine on a shared-memory multiprocessor falls down on the synchronization overhead. If your simulation semantics dictate that you synchronize every cycle anyway, and you do not try to optimize each core simulator, there is clearly decent room for parallel execution. By including the cache, they increase scalability, since there is more work per target cycle that can be run in isolation.</p>
<p>After reading the article, I am impressed by their work &#8211; just getting this to work is pretty good work. But there are quite a few questions which are not really answered in the article and which are crucial to understanding just how well GPGPUs could be used for this kind of ISS work.</p>
<ul>
<li>The targeted level of abstraction is a bit confusing. The authors claim it is &#8220;instruction accurate and not cycle accurate&#8221;, but still simulate caches and cycle-based communications across the NoC. If I read the paper right, communications will take a varying number of cycles depending on the distance for messages to travel. This is more detailed than a typical &#8220;instruction accurate&#8221; simulator.</li>
<li>The target system does not run an OS &#8211; that might (but I do not know) be an advantage for their approach, since it probably implies less variation in the instruction flow in cores, potentially enhancing the amount of time that all ISSes in a thread group in the GPU can execute the same instruction. This would seem crucial, as if each ISS was running a totally different program, the instruction execution part of the code would be running serialized.</li>
<li>They should really try to run the same kind of simulation on a high-end x86 CPU like an Intel Sandy Bridge with 8 or more hardware threads. I wonder if their scaling might not work just as well there &#8211; and with a much faster serial execution engine. This should give  a much more relevant point of comparison for GPU vs CPU execution of the simulator than&#8230;</li>
<li>the comparison object they use right now, a JIT-accelerated multicore simulation using OVP seems pretty irrelevant since it is not doing the same thing at all. That simulator does not simulate the caches or NoC, just a large number of isolated processors. They also do not run a parallel program on OVP, but rather a large number of single-core fibonacci and dhrystone programs. Thus, the fact that OVP uses a large temporal decoupling time slice does not matter for semantics. It just does not seem like a very relevant comparison point. OVP and their simulator try to solve different problems &#8211; fast execution of general code vs. performance profiling of massively parallel machines.</li>
<li>As I understand it, the given &#8220;S-MIPS&#8221; numbers in the evaluation tell us the total number of MIPS that we get out across all target cores. That seems to peak around 2000 &#8211; which isn&#8217;t necessarily that fantastic if we compare to high-performance ISS work in general where a few GIPS is definitely achievable. It is pretty good considering the level of detail here, though, where i would expect a normal ISS + cache simulator to produce at most a few MIPS. Once again, the authors need to be a bit more precise as to what they compare to what.</li>
<li>Not having an MMU and not implementing any interrupts or exceptions in the target machines avoids a large part of the complexity of a real ISS. That complexity might well be too much for the quite rigid execution environment of a GPGPU.</li>
<li>They missed that Simics, unique among instruction-accurate mainstream simulators, is <a href="http://jakob.engbloms.se/archives/128">parallel </a>since version 4.0.</li>
</ul>
<p>So, overall, this paper does not really tell us much whether a GPGPU can be used for instruction-set simulation in general. It does tell us that it might be doable, but there are many crucial complications which are not addressed.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1506"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1506" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1506" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1506/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Photoshop Scalability and &#8220;-10% overhead&#8221;</title>
		<link>http://jakob.engbloms.se/archives/1311?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/1311#comments</comments>
		<pubDate>Mon, 01 Nov 2010 11:45:51 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore debug]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Cary Millsap]]></category>
		<category><![CDATA[Clem Cole]]></category>
		<category><![CDATA[Communications of the ACM]]></category>
		<category><![CDATA[GPGPU]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[performance optimization]]></category>
		<category><![CDATA[Photoshop]]></category>
		<category><![CDATA[Russell Williams]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=1311</guid>
		<description><![CDATA[I just finished reading the October 2010 issue of Communications of the ACM. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg"><img class="alignleft size-full wp-image-1313" style="margin: 10px 5px;" title="cacm 10 2010" src="http://jakob.engbloms.se/wp-content/uploads/2010/10/cacm-10-20101.jpg" alt="" width="62" height="80" /></a>I just finished reading the <a href="http://cacm.acm.org/magazines/2010/10">October 2010 </a>issue of <a href="http://cacm.acm.org/">Communications of the ACM</a>. It contained some very good articles on performance and parallel computing. In particular, I found the ACM Case Study on the parallelism of Photoshop a fascinating read. There was also the second part of Cary Millsap&#8217;s articles about &#8220;Thinking Clearly about Performance&#8221;.</p>
<p><span id="more-1311"></span>Cary&#8217;s articles deal mostly with database tuning in the Oracle ecosystem, but most of his observations apply to any kind of programming with a performance requirement. It is worth a read. It was good to see him dissect performance, including obvious &#8211; but not really obvious &#8211; concepts like the difference in usefulness between average and worst-case response times from a user perspective.  In essence, you need to watch the spread of response times, and try to keep the worst times from getting too bad, rather than just look at an average that might conceal extremes that frustrate users.</p>
<p>Cary also made the comment noted in the title of this post. In his opinion, the performance instrumentation built into Oracle has an overhead of -10% &#8211; or even -20% or -30%, since it enables optimizations that would otherwise have been impossible to do. This is something worth noting in general &#8211; overhead that looks bad when considered as a local cost might be a net benefit in the grand scale of things, by enabling measurements and insight that let a program run much faster.</p>
<p>The ACM case study on Photoshop can be found online as a <a href="http://queue.acm.org/detail.cfm?id=1858330">resource at the ACM Queue</a>, with what seems to be mostly the same content. It was written by Clem Cole, at Intel, who interviews Russell Williams of the Photoshop team. It is very instructive to see how the Photoshop team has built an application that works well with 2 to 4 and maybe 8 cores, but that really needs to reconsider parts of its architecture to scale beyond 8.</p>
<p>Clem from Intel pushes Russell by bringing up various examples of next-generation architectures, in particular the fact that clusters-on-a-chip and NUMA memories look inevitable. The Photoshop people seem to take a wait-and-see approach to this: they first want to see some architecture have real traction in the market before they commit and rearchitect their software to make use of it.</p>
<p>The problems of debugging parallel software are also brought up. There used to be a simple bug in the asynchronous I/O system in Photoshop that took ten years to uncover!  Essentially, the programmers had not considered atomicity properly in the presence of multiple threads. With that kind of example, it is not surprising that the Photoshop programmers are very careful when planning and performing parallelizations.</p>
<p>The target domain of Photoshop is to some extent naturally parallel, but not as much as I would have thought. Since a user might operate on any part of an image, large or small, and maybe start and then abort an operation, it is not just a matter of splitting a image evenly across threads or cores. There is a significant amount of variation in just how parallel things can be in Photoshop.</p>
<p>Photoshop has had an easy-to-use parallelization system in place since  around 1994, which lets programmers write simple serial computational  kernels which are automatically applied to parts of an image in  parallel. The Photoshop program itself takes care of the synchronization  between kernels, and the kernels can be simple and robust and without  any parallel code inside. This is a <a href="http://jakob.engbloms.se/archives/209 ">pattern that has been seen before</a>,  and which does make a lot of sense &#8211; if it can be applied successfully.  Apparently, this is not necessarily the easiest thing to scale beyond  four cores.</p>
<p>The main performance limitation for Photoshop performance keeps being memory bandwidth, rather than raw compute performance. This also limits the need to aggressively scale to higher levels of parallelism: as long as multiple threads do not give more bandwidth, it has proven hard to use more than two or three threads on any multicore processor as that is sufficient to saturate the memory system. Apparently, this is different on the Nehalem (Core i7/i5/i3) generation of Intel multicore processors, where each core has a dedicated non-stealable slice of the memory bandwidth.</p>
<p>For the near future, it seems that the big step for Photoshop is going the route of using GPUs for acceleration, rather than 10+ core main processors.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 25px; width: 1px; height: 1px; overflow: hidden;">http://queue.acm.org/detail.cfm?id=1858330</div>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/1311"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/1311" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/1311" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/1311/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GPGPU &#8211; a new type of DSP?</title>
		<link>http://jakob.engbloms.se/archives/930?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/930#comments</comments>
		<pubDate>Fri, 11 Sep 2009 14:35:18 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[computer architecture]]></category>
		<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[GPGPU]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=930</guid>
		<description><![CDATA[My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, [...]]]></description>
			<content:encoded><![CDATA[<p>My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.</p>
<p><span id="more-930"></span>The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that &#8220;everyone has anyway&#8221; &#8212; i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.</p>
<p>But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.</p>
<p>How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.</p>
<p>The reasons for GPU raw performance seems to be three-fold:</p>
<ul>
<li>Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.</li>
<li>There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.</li>
<li>Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.</li>
<li>And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans,  and 300W power draws are OK&#8230;</li>
</ul>
<p>What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.</p>
<p>If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs &#8212; even if they can be painful compared to general-purpose processors.</p>
<p>So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.</p>
<p>Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs&#8230;</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/930"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/930" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/930" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/930/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

