• About Jakob Engblom and this blog
Observations from Uppsala Computer Simulation, Virtual Platforms, Embedded Programming, Multicore and More (by Jakob Engblom)

GPU Programming: a Good Pattern to Follow?

2008 August 10 20:40 / 2 Comments / Jakob

In the March/April 2008 issue of ACM Queue, there is an article on GPU Programming by Kayvon Fatahalian and Mike Houston of Stanford that I found a very interesting read. It presents and analyzes the programming model of modern GPUs, in the most coherent and understandable way that I have seen so far. The PC GPU has a model for programming parallel hardware that might be a good pattern for other areas of processing. Programmers do not have to write explicitly parallel code, the machinery and hardware takes care of ensuring parallel behavior, as long as the code follows the assumptions made in the model.

The fundamentals of the GPU model are the following:

  • It presents a fixed pipeline of stages through which data is streamed and transformed into final output.
  • Some stages are fixed-function (with some parameters), some are fully programmable.
  • The programmable stages are programmed in a local state, simple input-to-output transformation style with no access to global variables or any way to affect other computations. In this respect, the model is similar to DSP programming with its DMA-in, compute, DMA-out style. If a bit more automated.
  • There is shared global state — but it is read-only, for parameters (textures, etc.)
  • Parallelism is present in two dimensions: each stage operates on lots of data in parallel, and all stages execute concurrently.
  • The really tricky transformations of the input data stream that involve dependences between data items are encapsualted inside the fixed-function stages. In essence, this lets a few experts take care of the hard part of programming, and presents a streamlined simple model .
  • It is possible to users to destroy performance with badly written programs, but the typical use case and hardware design rests on users doing sensible things within a fairly narrow domain.
  • Code is compiled into byte codes, which are then translated and optimized for a particular GPU by the driver in the final PC running the application. This two-stage just-in-time compilation (or dynamic recompilation, or whatever we want to call it) technique is a known good way to combine performance with portability.

This model is an interesting pattern as it has been extensively proven in practice. There are large numbers of programmers doing graphics programs, and they seem to have not too big problems in getting GPUs to run massively parallel computations. If you think about it, that is a pretty major success story! It also validates that the idea of “no shared state” that keeps being brought up does simplify programming. The model above is quite similar to what you have in Erlang/OTP for example.

The lesson that can be drawn from this for other domains is likely that you need to create a framework (both in concept and in implementation) for processing that makes the code users write simple, single-threaded, and straightforward. The framework then automatically runs lots of little sequential snippets in parallel, and takes care of resource scheduling and the data flow.

Obviously, there are domains where this does seems harder to do than in other domains, but I think this is the pattern for the future. Unless most parallel applications are “easy” to create, we will not make much use of parallelism. And I think that this can usually be the case.

In particular can see this kind of framework being quite possible for things like packet processing in various network transforms like firewalls, routing, switching, and virus scanning. A smart hardware manufacturer in the networking market should provide these frameworks, just like the graphics chips providers are today.

Finally, here is a nice-looking link to the article, as generated by ACM’s online magazine publishing system:

Link to article at ACM
March/April 2008 issue of Queue

asdf

Tweet
Posted in: multicore computer architecture, multicore software, programming / Tagged: GPU, Kayvon Fatahalian, Mike Houston

2 Thoughts on “GPU Programming: a Good Pattern to Follow?”

  1. Ricky Clarkson on 2008 August 11 at 12:53 said:

    Nice article, but the link to the ACM article is quite poor – it brings up a highly graphical page that blinks every 2 seconds (Firefox 3, Ubuntu).

  2. Jakob on 2008 August 11 at 14:34 said:

    That is how ACM has chosen to put out Queue magazine per default. It works on FF3 for me (on Windows host), but I guess the good tip is to grab the downloadable PDF — that is what I prefer to read. The download page is at http://mags.acm.org/queue/20080304/templates/download_offline … there seems to be no good direct PDF download link. This trend to “paper-like” web publishing is highly annoying, would prefer to just have classic old HTML instead where you can link into anything you want.

    /jakob

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →

Recent Posts

  • Wind River Blog: Simics 4.8 is Here
  • A Few Electrons too Many
  • Wind River Blog: Visuality NQ CIFS Server on Simics
  • Everything in the Cloud?
  • Wind River Blog: TCF and Simics
  • Off-Topic: Moving Bad Piggies Save Games
  • Two Cores, Four Cores, Eight Cores – Mobile Variety
  • Bliss: Failing to Pivot for Ideology
  • Wind River Blog and Movie: Demo of Simics Debugging
  • Simulation vs Reality in Schlock Mercenary
  • Programming like Lego
  • Does ISA Matter for Performance?
  • Wind River Blog: Debugging Simics using Simics
  • Wind River Blog: Simics and Flying Piggies
  • Dragons can be Useful – when AT Models Make Sense

Categories

  • appearances (30)
  • articles (21)
  • blogging (10)
  • books (6)
  • business issues (31)
  • computer architecture (35)
  • conferences (34)
  • EDA (50)
    • ESL (35)
  • embedded (78)
    • embedded software (57)
    • embedded systeme (50)
  • general research (6)
  • history (32)
    • general history (7)
    • history of computing (26)
  • off-topic (94)
    • biking (5)
    • board games (1)
    • computer games (3)
    • desktop software (35)
    • food and drink (1)
    • funny (12)
    • gadgets (24)
    • Politics (3)
    • popular culture (5)
    • trains (5)
    • transportation (10)
    • travel (10)
    • websites (3)
  • parallel computing (92)
    • multicore computer architecture (51)
    • multicore debug (22)
    • multicore software (65)
  • programming (107)
  • review (8)
  • security (19)
  • teaching (7)
  • testing (9)
  • uncategorized (12)
  • virtual things (129)
    • computer simulation technology (68)
    • virtual machines (17)
    • virtual platforms (98)
    • virtualization (14)
  • Wind River Blog (40)

Tags

ARM blog commentary Cadence Checkpointing clock-cycle models Communications of the ACM computer architecture conference cycle accuracy debugging DML Domain-specific languages embedded freescale G900 heterogeneous homogeneous IBM Intel iPod lego linux mobile phones multicore off-topic office 2007 operating systems p4080 podcast commentary power architecture rant research reverse debugging reverse execution S4D SiCS Multicore days Simics simulation software tools Sun SystemC video virtualization Vista Windows

1

  • F-Secure Blog

Blogs and news

  • Andras Vajda's blog (on multicore)
  • Embedded in Academia (John Regehr)
  • Grant Martin
  • Jack Ganssle
  • My Wind River Blog
  • Security Now podcast
  • Secworks (Joachim Strömbergson)
  • Simon Kågström
  • Synopsys View from the Top
  • Worse Than Failure

Archives

  • May 2013 (2)
  • April 2013 (1)
  • March 2013 (4)
  • February 2013 (1)
  • January 2013 (3)
  • December 2012 (2)
  • November 2012 (2)
  • October 2012 (1)
  • September 2012 (6)
  • August 2012 (4)
  • July 2012 (4)
  • June 2012 (3)
  • May 2012 (4)
  • April 2012 (2)
  • March 2012 (3)
  • February 2012 (1)
  • January 2012 (6)
  • December 2011 (2)
  • November 2011 (3)
  • October 2011 (4)
  • September 2011 (5)
  • August 2011 (4)
  • July 2011 (3)
  • June 2011 (4)
  • May 2011 (7)
  • April 2011 (1)
  • March 2011 (3)
  • February 2011 (5)
  • January 2011 (1)
  • December 2010 (4)
  • November 2010 (3)
  • October 2010 (5)
  • September 2010 (5)
  • August 2010 (5)
  • July 2010 (6)
  • June 2010 (5)
  • May 2010 (3)
  • April 2010 (4)
  • March 2010 (3)
  • February 2010 (4)
  • January 2010 (7)
  • December 2009 (6)
  • November 2009 (6)
  • October 2009 (7)
  • September 2009 (6)
  • August 2009 (7)
  • July 2009 (11)
  • June 2009 (5)
  • May 2009 (10)
  • April 2009 (7)
  • March 2009 (8)
  • February 2009 (9)
  • January 2009 (12)
  • December 2008 (8)
  • November 2008 (9)
  • October 2008 (9)
  • September 2008 (10)
  • August 2008 (13)
  • July 2008 (12)
  • June 2008 (8)
  • May 2008 (9)
  • April 2008 (10)
  • March 2008 (7)
  • February 2008 (8)
  • January 2008 (5)
  • December 2007 (5)
  • November 2007 (7)
  • October 2007 (7)
  • September 2007 (12)
  • August 2007 (9)
  • July 2007 (2)
© Copyright 2013 - Observations from Uppsala
Infinity Theme by DesignCoral / WordPress