• About Jakob Engblom and this blog
Observations from Uppsala Computer Simulation, Virtual Platforms, Embedded Programming, Multicore and More (by Jakob Engblom)

SiCS Multicore Day 2012

2012 September 16 22:12 / 4 Comments / Jakob

The 2012 edition of the SiCS Multicore Day was fun, like they have always been in the past. I missed it in 2010 and 2011, but could make it back this year. It was interesting to see that the points where keynote speakers disagreed was similar to previous years, albeit with some new twists. There was also a trend in architecture, moving crypto operations into the core processor ISA, that indicates another angle on the hardware accelerator space.

Many-Core Missing

Five years have passed since the first SiCS Multicore Day in 2007 (making this the sixth event), and in the introduction by Erik Hagersten he looked back at some of the predictions made back then. One missed prediction stood out clearly. The idea that by now, 128 cores would be mainstream in personal computers. My theory of why this has not happened is simple. GPGPU. GPUs have eaten up the easy parallelism. Instead of using massively multicore regular processors, heavy-duty personal computing has been shifted onto GPUs. With the disappearance of these workloads, there has been little pressure on main processors to become more parallel as there would not be much to gain from that, performance-wise. GPUs have turned out to be perfect for massively dataparallel work in media and other areas (including tasks like cracking password hashes and mining for bitcoins), achieving performance orders of magnitude higher than what could be hoped for with a multicore main processor – while costing less and using comparatively little power.

The prevalence of GPGPU on the desktop is not mirrored in the top supercomputers, however. According to Erik Hagersten, there is no real GPGPU machine in the top-500 supercomputer list at the moment. Maybe 5% of the performance and 3% of the chips are GPUs. I suspect part of this might have to do with the kinds of tasks being done. HPC at the high-end probably requires more flexibility and programmer control than GPUs can offer.

Programmability might be more important in architectural design for HPC, as HPC users tend to be programmers. Most regular computer users, on the other hand, just use software written by someone else. Thus, it is enough that a few people go through the hard work of coding in CUDA or OpenCL or similar toolkits, and the results of their work can be spread across a very large user base. GPGPUs are perfect to provide “performance for the rest of us”, for common tasks coded by a few expert programmers.

Homogeneous, Heterogeneous

The debate over GPGPU is part of a bigger debate about homogeneous vs heterogeneous compute systems (see previous blog posts like this, this, this, and this). The debate is still going on, with the same intensity as it always have. To me, that would seem to indicate that hardware accelerators are here to stay, even if some people do not really like them.

This year, the primary example of the drive to homogeneity was Intel’s recently announced “more than 50 x86 cores on a chip” Knight’s Corner (Xeon Phi). The argument for the chip is very much programmability: “just a large x86 box that runs Linux”. But I guess you do need special compilers or libraries to make use of the big somewhat Cray-like vector unit (512-bit SIMD unit) each core has been equipped with. At least special optimization will be needed to make the best use of the chip, just like you always need to do when performance matters.

The UltraSparc T5 presented by Rich Hetherington from Oracle fell somewhere inbetween. It has 16 identical cores, but can tweak how it uses the SMT threading to make a core run a certain serial task faster than it otherwise would. This is a step towards the kind of heterogeneous performance in a single ISA that ARM is going after with their Cortex-A15/A7 bigLITTLE approach – but without the same span in performance, and also with less impact on the overall flexibility of the chip. The T5 also removed the special crypto accelerator hardware that used to be there, instead adding a few crypto instructions to the ISA.

The reason they moved crypto from an accelerator into the ISA was that it turned out to be costly to use a separate hardware unit for small pieces of data. There is OS overhead in invoking an accelerator, and that requires a decent size buffer of data to work on. With instructions in the ISA, you can work on a single word and still get performance gains. User-level software also have a far easier time accessing it, as the instructions are just part of the regular instruction stream. Interestingly, ARM (as presented by Stephen Hill) had done the exact same thing for crypto, for the same reason. This is an important point for hardware accelerators in general: the driver overhead has to be managed, sometimes by mapping hardware straight to individual programs (I made a simple experiment a few years ago that showed this nicely). On the other hand, everything put into the ISA risks making the entire processor a bit slower and power hungrier, making general ISA extensions something done with great care. Hardware accelerators can be removed from a certain SoC if they turn out not to be needed, not so easy with ISA components.

Stephen Hill from ARM clearly believed in heterogeneity, with four types of processing on a typical chip:

  • Big core (ARM Cortex-A15 today)
  • Little core (ARM Cortex-A7) – to create the kind of bigLITTLE setup that allows for a bigger span of power-performance settings.
  • GPU (from ARM, that means Mali T604 today) – they clearly see that GPGPU is moving into the mobile space very quickly, doing the same kind of work that it has done on the desktop, and with the same effect of reducing the need for general processor cores.
  • Special-purpose accelerators – except when merged into the ISA, as noted above.

In researching some of the material from James Larus’ talk, I also came across an interesting talk from Surge 2011 where Artur Bergman from fastly.com tell how they have optimized their content delivery network by only relying on plain processors and not using any network processing offload, router ASICs, etc. Too hard to use, to easy to make errors and have the software crash, and “Xeons are simply faster”. Note that the word “energy” is never mentioned in his talk.

Software Needs to be 100x Better

The software perspective was presented by James Larus from Microsoft research. His talk made many interesting points, but I think the main points were that:

We are not even trying to make efficient systems today, throwing away billions of clock cycles on plain pure overhead. Example: IBM had investigated the conversion of a SOAP (text) date to a Java date object in IBM Trade benchmark.268 function calls and 70 object allocations.  There is great modularity and nothing obviously wrong in the code.  About 20% of memory is used to hold actual data, the rest is hash table, object management overhead, etc.  In general, objects are small and waste is large. Great for programmers, bad for machines. We could and should find ways to do better in programming than this, need to find a way to make performance an abstraction we can work with.

Languages should be as efficient as they can. Today, common runtimes like PHP, Python, and Ruby are very far from optimal. They work “well enough”, but it should be pretty easy to make them 10 to 100 times more efficient with known compiler techniques. This should be done in addition to parallelism and distribution, it is almost criminal to leave that much performance on the table when it is so easy to get. Positive example: the 100x performance improvement for Javascript in recent years shows what can be done once it becomes important enough.

Note that Larus is not advocating going back to assembly language – there is far too much value in programmer productivity – but just that we remove unnecessary waste from our systems while advancing the state of the art in programming languages.

Distributed systems are the new norm. Why don’t we teach it? All programmers should need to understand how build systems from many separate parts. In particular, the impact of IO and network traffic on software performance. Distribution is not free either.

For an example of how bad things can be, he brought up a nice introspective talk from Surge 2011, about the Etsy website:

  • The original talk on Youtube
  • ArsTechnica coverage

So, that’s my summary of an interesting day.

Tweet
Posted in: computer architecture, conferences, embedded software, multicore computer architecture, multicore debug, multicore software, parallel computing, programming / Tagged: Erik Hagersten, heterogeneous, homogeneous, James Larus, Rich Hetherington, SiCS Multicore days, Stephen Hill

4 Thoughts on “SiCS Multicore Day 2012”

  1. zero_energy on 2012 September 17 at 10:13 said:

    Good presentations.
    Some comments:
    “Great for programmers, bad for machines.”. I thought the times when each and every bit was essential/precious are gone ( not unless that bit represents your alive status 0 dead 1 alive or viceversa ). DDR3 RAM is so cheap and abundant today and is mainly idle. After all you can only optimize so much and you can not be overly obsessed.
    C evolved from the writing notes aside columns of assembler code.
    New features do take up memory and processor cycles and this is how it will always be.
    You should buy new hardware for your new software now and then.

  2. Jakob on 2012 September 17 at 10:49 said:

    zero_energy :Good presentations.
    Some comments:
    “Great for programmers, bad for machines.”. I thought the times when each and every bit was essential/precious are gone [..]. DDR3 RAM is so cheap and abundant today and is mainly idle. After all you can only optimize so much and you can not be overly obsessed.

    Not quite – memory capacity might be pretty cheap (until you try to scale up like crazy and hit the limit of how much RAM a machine can take). But memory bandwidth is not, and it is a continuous bottleneck that slows down software unnecessarily. Machines are fast today, but not arbitrarily fast.

  3. zero_energy on 2012 September 17 at 16:01 said:

    “… and hit the limit of how much RAM a machine can take).”
    You can always use the huge amount of RAM as a virtual disk if the processor can not address all RAM.

  4. zero_energy on 2012 September 19 at 16:55 said:

    Chatting about memory, check this out:
    http://phys.org/news/2012-09-samsung-mass-industry-highest-density.html
    I can see memory capacity only increases. This will allow you to shoot many hours of FullHD crisp videos with your new digital camera.
    And the next year model will probably have double capacity for the same money.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →

Recent Posts

  • Wind River Blog: Simics 4.8 is Here
  • A Few Electrons too Many
  • Wind River Blog: Visuality NQ CIFS Server on Simics
  • Everything in the Cloud?
  • Wind River Blog: TCF and Simics
  • Off-Topic: Moving Bad Piggies Save Games
  • Two Cores, Four Cores, Eight Cores – Mobile Variety
  • Bliss: Failing to Pivot for Ideology
  • Wind River Blog and Movie: Demo of Simics Debugging
  • Simulation vs Reality in Schlock Mercenary
  • Programming like Lego
  • Does ISA Matter for Performance?
  • Wind River Blog: Debugging Simics using Simics
  • Wind River Blog: Simics and Flying Piggies
  • Dragons can be Useful – when AT Models Make Sense

Categories

  • appearances (30)
  • articles (21)
  • blogging (10)
  • books (6)
  • business issues (31)
  • computer architecture (35)
  • conferences (34)
  • EDA (50)
    • ESL (35)
  • embedded (78)
    • embedded software (57)
    • embedded systeme (50)
  • general research (6)
  • history (32)
    • general history (7)
    • history of computing (26)
  • off-topic (94)
    • biking (5)
    • board games (1)
    • computer games (3)
    • desktop software (35)
    • food and drink (1)
    • funny (12)
    • gadgets (24)
    • Politics (3)
    • popular culture (5)
    • trains (5)
    • transportation (10)
    • travel (10)
    • websites (3)
  • parallel computing (92)
    • multicore computer architecture (51)
    • multicore debug (22)
    • multicore software (65)
  • programming (107)
  • review (8)
  • security (19)
  • teaching (7)
  • testing (9)
  • uncategorized (12)
  • virtual things (129)
    • computer simulation technology (68)
    • virtual machines (17)
    • virtual platforms (98)
    • virtualization (14)
  • Wind River Blog (40)

Tags

ARM blog commentary Cadence Checkpointing clock-cycle models Communications of the ACM computer architecture conference cycle accuracy debugging DML Domain-specific languages embedded freescale G900 heterogeneous homogeneous IBM Intel iPod lego linux mobile phones multicore off-topic office 2007 operating systems p4080 podcast commentary power architecture rant research reverse debugging reverse execution S4D SiCS Multicore days Simics simulation software tools Sun SystemC video virtualization Vista Windows

1

  • F-Secure Blog

Blogs and news

  • Andras Vajda's blog (on multicore)
  • Embedded in Academia (John Regehr)
  • Grant Martin
  • Jack Ganssle
  • My Wind River Blog
  • Security Now podcast
  • Secworks (Joachim Strömbergson)
  • Simon Kågström
  • Synopsys View from the Top
  • Worse Than Failure

Archives

  • May 2013 (2)
  • April 2013 (1)
  • March 2013 (4)
  • February 2013 (1)
  • January 2013 (3)
  • December 2012 (2)
  • November 2012 (2)
  • October 2012 (1)
  • September 2012 (6)
  • August 2012 (4)
  • July 2012 (4)
  • June 2012 (3)
  • May 2012 (4)
  • April 2012 (2)
  • March 2012 (3)
  • February 2012 (1)
  • January 2012 (6)
  • December 2011 (2)
  • November 2011 (3)
  • October 2011 (4)
  • September 2011 (5)
  • August 2011 (4)
  • July 2011 (3)
  • June 2011 (4)
  • May 2011 (7)
  • April 2011 (1)
  • March 2011 (3)
  • February 2011 (5)
  • January 2011 (1)
  • December 2010 (4)
  • November 2010 (3)
  • October 2010 (5)
  • September 2010 (5)
  • August 2010 (5)
  • July 2010 (6)
  • June 2010 (5)
  • May 2010 (3)
  • April 2010 (4)
  • March 2010 (3)
  • February 2010 (4)
  • January 2010 (7)
  • December 2009 (6)
  • November 2009 (6)
  • October 2009 (7)
  • September 2009 (6)
  • August 2009 (7)
  • July 2009 (11)
  • June 2009 (5)
  • May 2009 (10)
  • April 2009 (7)
  • March 2009 (8)
  • February 2009 (9)
  • January 2009 (12)
  • December 2008 (8)
  • November 2008 (9)
  • October 2008 (9)
  • September 2008 (10)
  • August 2008 (13)
  • July 2008 (12)
  • June 2008 (8)
  • May 2008 (9)
  • April 2008 (10)
  • March 2008 (7)
  • February 2008 (8)
  • January 2008 (5)
  • December 2007 (5)
  • November 2007 (7)
  • October 2007 (7)
  • September 2007 (12)
  • August 2007 (9)
  • July 2007 (2)
© Copyright 2013 - Observations from Uppsala
Infinity Theme by DesignCoral / WordPress