• About Jakob Engblom and this blog
Observations from Uppsala Computer Simulation, Virtual Platforms, Embedded Programming, Multicore and More (by Jakob Engblom)

GPGPU – a new type of DSP?

2009 September 11 15:35 / 6 Comments / Jakob

My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.

The initial key idea behind GPGPU was that a GPU offers very high performance, and does so in a part that “everyone has anyway” — i.e., something that is found on any PC. Outside of PCs, such powerful GPUs are pretty non-existent. Then, the GPU companies picked up on this idea and are making their GPUs more applicable to general purpose tasks.

But where does all this performance come from? To me, it all looks like the rebirth of the vector processor. If we compare a GPU and an Intel or AMD x86 main processor, it is clear that the GPU gets more FLOPs per chip. Mostly, this seems to be because the GPU has many times the number of processing units. Something like 1000s of them, rather than maybe 10 in a general purpose unit.

How can all of these be fit on a die that is similar in size to the general processor? As always when you see disparity like this, it stems from optimization for different target uses leading to different architecture.

The reasons for GPU raw performance seems to be three-fold:

  • Each processor is much simpler, with a simple instruction set and no out-of-order, speculation, or other complex logic. Programming is more complicated, as programs are run on groups of processors and with lots of little constraints. This makes it possible to fit more cores into the same area.
  • There is far less cache on the die, which forces programs to rely on bandwidth and managing to stream data through the processor.
  • Processors are built to be good at repetitive math, and be very bad at anything else. This also makes it possible to optimize data flows and control handling to a far greater extent than on general-purpose processors.
  • And I guess you can add a forth parameter: power consumption and heat is not really a big problem. Watercooling, huge fans,  and 300W power draws are OK…

What this all boils down to is that the GPGPU requires predictable algorithms that can effectively and efficiently prefetch data and stream it through the cores at a predictable rate. Data also needs to be wide to engage groups of cores at once (i.e., vector processing). Integer decision-making code is out (gcc, Simics, control-plane code, most database front ends), and data-intense is in (images, audio, video, graphics). SIMD is part of it, but not the most interesting part. The point is that you apply SIMD across large vectors of independent elements in parallel. And you are looking to solve one large problem at a time.

If you compare this to the classic single-core DSP, you see a very different design. A DSP has specialized instructions in the instruction set, support for loops in very efficient ways, and is often SIMD. But they very rarely operate like vector processors. They are also general enough to be able to run a rudimentary OS and operate semi-independently from the main processor. Also, DSPs tends to be used in large multicore clusters, but there each DSP operates on a different problem at a time. So rather than one vector of 1000 elements in a video compression, you might have 1000 independent video streams being processed, out of synch with each other. DSPs also tend to have much simpler programming models compared to GPGPUs — even if they can be painful compared to general-purpose processors.

So GPGPUs are qiute different in practice from DSPs, built to solve different types of problems in different ways. In the end, it is not clear to me that a GPGPU is a winner in terms of performance per watt or performance per area. They are certainly hot in the desktop and server field, but I cannot see them replace general DSPs any day soon.

Note that something like the Tilera chip is another intermediate point between multicore DSP and a GPU. There seems to be a long continuum of core counts from around 4 to 8 for DSP to around 100 for Tilera to 1000 for GPUs…

Tweet
Posted in: computer architecture, multicore computer architecture / Tagged: DSP, GPGPU

6 Thoughts on “GPGPU – a new type of DSP?”

  1. JoachimS on 2009 September 16 at 06:25 said:

    Hmm I guess I’m partly responsible for this post… :-)

    I agree with you, at least to some extent.

    Yes, the GPU could be seen as the rebirth of the vector processor. Yes, the PEs are simple, scalar in-order-issue machines with less cache (but loads of registers and PE-shared mem – thing SW-controlled cache). Yes they are currently mainly for big things connected to a good PSU and can therefore be power hogs.

    All those things are tru. Right now.

    In about a decade the PEs have gone from specialized instances of datapaths with a few control plane adaptations to complete 32-bit RISC engines. This trend has largely been driven by MS and the development of DirectX as well as the OpenGL standard evolving with requirements on ever growing flexibility in the datapath.

    The Nvidia vs AMD/ATI arms race is still going on building ever bigger and hungrier GPUs for all those gami^D^D^D^D scientific applications. But while that is happening and gets a lot of press and steam, these players as well as a lot of niche players have actively started to target embedded/low power space. Today you can get devices as well as IP-cores that includes dynamic (as in data dependent as well as control plane adjustable) power control. In terms of W/FLOP for general computing they are not a match for CPUs or DSPs, but I’d venture to say that they are getting closer. The devices we are talking about here have not 1000+ PEs, but in the range of a few PEs to a low hundreds.

    I would not be surprised to see variants of GPUs in the near future where one or a few PEs are more equal than the others and are allowed to control the activity (power supply) of some of the others. Yes this would require a more complicated instruction fetch and memory interface, but it should be doable. A mobile phone platform having three ARM11 cores and a GPU with four cores is pretty close to a GPU with eigth cores.

    Also the thing about streaming. Since the processing power vs bandwidth is so big, esp for big GPUs you are actually not talking about streaming things through but batch processing with double buffering. If you look at the Nvidia Tesla T10, Nvidia themselves suggest that it is better to burn a few more cycles that to shuffle data. When I worked on a GPGPU project we gained performance by going to a cubic transform rather than a quadratic transform, simply because the amount of data needed/time unit was reduced and we could use some of that compute power sitting there idle.

    One key issue here is programmability. When DSPs are going multicore and are adding multiple MACs into the core datapath etc, how hard will they be to program efficiently? Compared to a GPU with CUDA or OpenCL?

    As a computer arch geek I love looking at new designs from Tilera, BOPs, Excellera and all other startups as well as research architecture proposals. But at the end of the day the application needs to be mapped onto the instruction set and the programming model of the device. How well that works out is down to two things:

    (1) How well does the instruction set and programming model match the application.

    (2) How easy is it to write the code implementing the application mapping.

    The first one is what you talk about, the difference between DSPs vs GPUs and how good they match a set of applications. The second one is all about tools.

    From my point of view as an embedded engineer is that squeezing out the performance out of DSPs (to get that MIPS/W) can be pretty hard, and quite a few engineer have neither the competence nor the time allowed project wise to do it. My observation is that when the DSP complexity goes up this discrepancy increases.

    And, to finish off this long text, if the DSPs in real-world usage is getting less and less efficient, how far off are GPUs if they (more) easily be efficiently programmed?

  2. JoachimS on 2009 September 16 at 07:18 said:

    Aloha!

    And to clarify the section about Tilera, Excellera etc: When I have evaluated NPUs, GPUs, DSPs and other more or less app-area specific architectures I often reflect on how cool, powerful and interesting the architecture is… and how lousy the provided toolchain needed to program the machine is.

    A port of GCC is basically always provided (but sometimes not even that) and of course an assembler. But debuggers might be quite simplistic and things like data flow, optimization and hotspot analysis is not that common.

    The result is that yes, you can get you application onto the device, and it might even be pretty well debugged. But getting it really efficiently mapped onto the device might be hard even though your application might be in the application domain of the device.

    I guess this is where Simics comes in. ;-)

  3. Jakob on 2009 September 16 at 10:37 said:

    I think we are basically making the same argument, in terms of the end result being pretty much like today’s DSPs… and in particular the SPE engines of the Cell in terms of programming and use style. The Tesla use model sounds very much like the archetypal DSP DMA-in, process, DMA-out cycle.

    That GPGPUs are easier to program is an interesting point. Certainly, desktop use tends to drive better tools. But that is a point not entirely lost on DSP and Tilera and the other people either… and the solution seems to be moving to model-based composition of algorithm elements. Which is essentially what OpenCL is also about, as well as graphical programming in LabView or MatLab or textual domain-specific languages.

    In essence, the key advantage your see with the GPU is that is has a thicker software stack that hides a thornier architecture.

  4. Niklas on 2009 October 5 at 20:18 said:

    http://www.nvidia.com/fermi

  5. Jakob on 2009 October 5 at 20:23 said:

    Thanks! Also see http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932 for a readable analysis.

    It seems that Nvidia is making this more and more like a vector processor, actually moving away from discrete graphics as their main market. But that also partially defeats the charm of GPGPU — to leverage for other purposes the steady march of GPU improvements driven by graphics for gamers. It is still not as easy to use as your average DSP, for example.

  6. jimmy on 2009 December 6 at 17:16 said:

    Yes, nvidia is really taking the HPC market seriously, especially with fermi. Fermi’s on-chip shared memory is quadrupled and can also be split into an l1 cache if desired. There is also a shared l2 cahce.

    This theoretically means that the GPU will be better at working on applications were one need to do more random accesses to data.

    One strength of the current Nvidia CUDA architecture is that it forces you to handle memory in a efficient way if you want good results. You’re not really allowed to write dumb code which is just as well…

    @Jakob , The fermi lineup doesn’t necessarily mean that nvidia is moving away from their main market, the Geforce series will probably not support ECC and other fancy HPC attributes and will still be very much focused on the games.. Let the gamers buy the Geforce series meanwhile HPC geeks can buy the quadro and tesla series.

    Btw, both Larrabee and the IBM Cell are dead/cancelled. They both realised they aren’t going to able to compete with Nvidia / AMD/ATI in this area..

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →

Recent Posts

  • A Few Electrons too Many
  • Wind River Blog: Visuality NQ CIFS Server on Simics
  • Everything in the Cloud?
  • Wind River Blog: TCF and Simics
  • Off-Topic: Moving Bad Piggies Save Games
  • Two Cores, Four Cores, Eight Cores – Mobile Variety
  • Bliss: Failing to Pivot for Ideology
  • Wind River Blog and Movie: Demo of Simics Debugging
  • Simulation vs Reality in Schlock Mercenary
  • Programming like Lego
  • Does ISA Matter for Performance?
  • Wind River Blog: Debugging Simics using Simics
  • Wind River Blog: Simics and Flying Piggies
  • Dragons can be Useful – when AT Models Make Sense
  • Logging (Some More Thoughts)

Categories

  • appearances (30)
  • articles (21)
  • blogging (10)
  • books (6)
  • business issues (31)
  • computer architecture (35)
  • conferences (34)
  • EDA (50)
    • ESL (35)
  • embedded (78)
    • embedded software (57)
    • embedded systeme (50)
  • general research (6)
  • history (32)
    • general history (7)
    • history of computing (26)
  • off-topic (94)
    • biking (5)
    • board games (1)
    • computer games (3)
    • desktop software (35)
    • food and drink (1)
    • funny (12)
    • gadgets (24)
    • Politics (3)
    • popular culture (5)
    • trains (5)
    • transportation (10)
    • travel (10)
    • websites (3)
  • parallel computing (92)
    • multicore computer architecture (51)
    • multicore debug (22)
    • multicore software (65)
  • programming (107)
  • review (8)
  • security (19)
  • teaching (7)
  • testing (9)
  • uncategorized (12)
  • virtual things (128)
    • computer simulation technology (68)
    • virtual machines (17)
    • virtual platforms (97)
    • virtualization (14)
  • Wind River Blog (39)

Tags

ARM blog commentary Cadence Checkpointing clock-cycle models Communications of the ACM computer architecture conference cycle accuracy debugging DML Domain-specific languages embedded freescale G900 heterogeneous homogeneous IBM Intel iPod lego linux mobile phones multicore off-topic office 2007 operating systems p4080 podcast commentary power architecture rant research reverse debugging reverse execution S4D SiCS Multicore days Simics simulation software tools Sun SystemC video virtualization Vista Windows

1

  • F-Secure Blog

Blogs and news

  • Andras Vajda's blog (on multicore)
  • Embedded in Academia (John Regehr)
  • Grant Martin
  • Jack Ganssle
  • My Wind River Blog
  • Security Now podcast
  • Secworks (Joachim Strömbergson)
  • Simon Kågström
  • Synopsys View from the Top
  • Worse Than Failure

Archives

  • May 2013 (1)
  • April 2013 (1)
  • March 2013 (4)
  • February 2013 (1)
  • January 2013 (3)
  • December 2012 (2)
  • November 2012 (2)
  • October 2012 (1)
  • September 2012 (6)
  • August 2012 (4)
  • July 2012 (4)
  • June 2012 (3)
  • May 2012 (4)
  • April 2012 (2)
  • March 2012 (3)
  • February 2012 (1)
  • January 2012 (6)
  • December 2011 (2)
  • November 2011 (3)
  • October 2011 (4)
  • September 2011 (5)
  • August 2011 (4)
  • July 2011 (3)
  • June 2011 (4)
  • May 2011 (7)
  • April 2011 (1)
  • March 2011 (3)
  • February 2011 (5)
  • January 2011 (1)
  • December 2010 (4)
  • November 2010 (3)
  • October 2010 (5)
  • September 2010 (5)
  • August 2010 (5)
  • July 2010 (6)
  • June 2010 (5)
  • May 2010 (3)
  • April 2010 (4)
  • March 2010 (3)
  • February 2010 (4)
  • January 2010 (7)
  • December 2009 (6)
  • November 2009 (6)
  • October 2009 (7)
  • September 2009 (6)
  • August 2009 (7)
  • July 2009 (11)
  • June 2009 (5)
  • May 2009 (10)
  • April 2009 (7)
  • March 2009 (8)
  • February 2009 (9)
  • January 2009 (12)
  • December 2008 (8)
  • November 2008 (9)
  • October 2008 (9)
  • September 2008 (10)
  • August 2008 (13)
  • July 2008 (12)
  • June 2008 (8)
  • May 2008 (9)
  • April 2008 (10)
  • March 2008 (7)
  • February 2008 (8)
  • January 2008 (5)
  • December 2007 (5)
  • November 2007 (7)
  • October 2007 (7)
  • September 2007 (12)
  • August 2007 (9)
  • July 2007 (2)
© Copyright 2013 - Observations from Uppsala
Infinity Theme by DesignCoral / WordPress