Probably thanks to the yearly Mobile World Congress, there have been a slew of recent announcements of mobile application processors recently. Everything is ARM-based, but show quite some variety in the CPU core configurations used. Indeed, I think this variety has something to say on the general state of multicore.
Category Archives: Computer Architecture
Two Cores, Four Cores, Eight Cores – Mobile Variety
Does ISA Matter for Performance?
When I grew up with computers, the big RISC vs CISC debate was raging. At the time, in the late 1980s, it did indeed seem that RISC was inherently superior to CISC. SPARCs, MIPS, and Alpha all outpaced boring old x86, VAX and 68000 processors. This turned out to be a historical parenthesis, as the Pentium Pro from Intel showed how RISC-style performance could be mated to a CISC ISA. However, maybe ISAs still do matter.
“Eagle” Cycle-Accurate Simulator Anno 1979
I recently read the classic book The Soul of a New Machine by Tracy Kidder. Even though it describes the project to build a machine that was launched more than 30 years ago, the story is still fresh and familiar. Corporate intrigue, managing difficult people, clever engineering, high pressure, all familiar ingredients in computing today just as it was back then. With my interesting in computer history and simulation, I was delighted to actually find a simulator in the story too! It was a cycle-accurate simulator of the design, programmed in 1979.
Some Fun Cache Results from Carbon
Carbon Design Systems have been on a veritable blogging spree recently, pushing out a large number of posts around various topics. Maybe a bit brief for my taste in most cases (I have a tendency to throw out 1000+ word pseudo-articles when I take the time to write a blog), but sometimes very interesting nevertheless. I particularly liked a few posts on cache analysis, as they presented some good insight into not-quite-expected processor and cache behaviors.
IBM Mainframe: Parallelism as Patch
When IBM moved their mainframe systems (the S/360 family that is today called System Z) from BiCMOS to mainstream CMOS in 1994, the net result was a severe loss in clock frequency and thus single-processor performance. Still, the move had to be done, since CMOS would scale much better into the future. As a result, IBM introduced additional parallelism to the system in order to maintain performance parity. Parallelism as a patch, essentially.
Carbon “Swap ‘n’ Play” – A New Implementation of an Old Idea
Carbon Design Systems have been quite busy lately with a flurry of blog posts about various aspects of virtual prototype technology. Mostly good stuff, and I tend to agree with their push that a good approach is to mix fast timing-simplified models with RTL-derived cycle-accurate models. There are exceptions to this, in particular exploratoty architecture and design where AT-style models are needed. Recently, they posted about their new Swap ‘n’ Play technology, which is a old proven idea that has now been reimplemented using ARM fast simulators and Carbon-generated ARM processor models.
David Ungar: It is Good to be Wrong
I was recently pointed to a 2011 SPLASH presentation by David Ungar, an IBM researcher working on parallel programming for manycore systems. In particular, in a project called Renaissance, run together with the Vrije Universiteit Brussels in Belgium (VUB) and Portland State University in the US. The title of the presentation is “Everything You Know (about Parallel Programming) Is Wrong! A Wild Screed about the Future“, and it has provoked some discussion among people I know about just how wrong is wrong.
Fujitsu Server Fault Injection Robot
Fault Injection is a topic that has fascinated me for a long time. Not just the area of software-to-software fault injection, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a simulator). I just stumbled on a short interesting note about such hardware-actuated fault injection in a Fujitsu article.
GPGPU for Instruction-Set Simulation – Maybe, Maybe not
I just read a quite interesting article by Christian Pinto et al, “GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms“, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.
Nvidia “Kal-El” Variable SMP
Nvidia recently announced that their already-known “Kal-El” quad-core ARM Cortex-A9 SoC actually contains five processor cores, not just four as a “normal” quad-core would. They call the architecture “Variable SMP”, and it is a pretty smart design. The one where you think, “I should have thought of that”, which is the best sign of something truly good.
IBM i – I’m Impressed
From what little I had heard and read, the IBM AS/400 (later known as iSeries, and now known as simply IBM i) sounded like a fascinating system. I knew that it had a rich OS stack that contained most of the services a program needs, and a JVM-style byte code format for applications that let it change from custom processors to Power Architecture without impacting users at all. It was supposedly business-critical and IBM-quality rock solid. But that was about it.
So when Software Engineering Radio episode 177 interviewed the i chief architect Steve Will, I was hooked. It turned out that IBM i was cooler than I imagined. Here are my notes on why I think that IBM i is one of the most interesting systems out there in real use.
Read More →
SecurityNow on Randomness
Episodes 299 and 301 of the SecurityNow podcast deal with the problem of how to get randomness out of a computer. As usual, Steve Gibson does a good job of explaining things, but I felt that there was some more that needed to be said about computers and randomness, as well as the related ideas of predictability, observability, repeatability, and determinism. I have worked and wrangled with these concepts for almost 15 years now, from my research into timing prediction for embedded processors to my current work with the repeatable and reversible Simics simulator.
Evaluating HAVEGE Randomness
I previously blogged about the HAVEGE algorithm that is billed as extracting randomness from microarchitectural variations in modern processors. Since it was supposed to rely on hardware timing variations, I wondered what would happen if I ran it on Simics that does not model the processor pipeline, caches, and branch predictor. Wouldn’t that make the randomness of HAVEGE go away?
Execution Time is Random, How Useful
When I was working on my PhD in WCET – Worst-Case Execution Time analysis - our goal was to utterly precisely predict the precise number of cycles that a processor would take to execute a certain piece of code. We and other groups designed analyses for caches, pipelines, even branch predictors, and ways to take into account information about program flow and variable values.
The complexity of modern processors – even a decade ago – was such that predictability was very difficult to achieve in practice. We used to joke that a complex enough processor would be like a random number generator.
Funnily enough, it turns out that someone has been using processors just like that. Guess that proves the point, in some way.
Modeling Endianness
Endianness is a topic in computer architecture that can give anyone a headache trying to understand exactly what is happening and why. In the field of computer simulation, it is a pervasive problem that takes some thinking to solve in an efficient, composable, and portable way.
This blog post describes how I am used to working with endianness in virtual platforms, and why this approach makes sense to me. There are other ways of dealing with endianness, with different trade-offs and overriding goals.
Read More →
Wind River Blog: Interview with a Virtualization Researcher
Past Friday, I posted a new blog post in my Wind River blog. It is an interview the PhD student Girish Venkatasubramanian from the University of Florida. He is doing research on virtual machines/hypervisors and how they can be implemented more efficiently by making fairly small changes to the architecture of memory management units.
Microsoft + ARM = ARM64?
Pipeline Performance Simulator Anno 1960
I have just found what almost has to be the first cycle-accurate computer simulator in history. According to the article “Stretch-ing is Great Exercise — It Gets You in Shape to Win” by Frederick Brooks (the man behind the Mythical Man-Month) in the January-March 2010 issue of IEEE Annals of the History of Computing, IBM created a simulator of the pipeline for the IBM 7030 “Stretch” computer developed from 1956 to 1961 (photo from IBM.com).
Power Architecture Rip Van Winkle
For some reason (I guess it is the job…) I was browsing through the Power ISA version 2.06 specification last week and hit the following gem of an instruction: “rvwinkle“. It is named after a short story I had never heard about, but which apparently is sufficiently well-known in the US literary canon to warrant a sleep mode being named after it.
Read More →
GPGPU – a new type of DSP?
My post on SiCS multicore, as well as the SiCS multicore day itself, put a renewed spotlight on the GPGPU phenomenon. I have been following this at a distance, since it does not feel very applicable to neither my job of running Simics, nor do I see such processors appear in any customer applications. Still, I think it is worth thinking about what a GPGPU really is, at a high level.
Cavium Octeon II: Short Notes
About two months ago, Cavium Networks launched their second generation of Octeon chips, the Octeon II. The most obvious difference to the previous generation (Octeon, Octeon Plus) is a new MIPS64 core with much better support for hypervisors and virtualization. There are some other interesting aspects to this chip, though.
When does Hardware Acceleration make Sense in Networking?
Yes, when does hardware acceleration make sense in networking? Hardware acceleration in the common sense of “TCP offload”. This question was answered by a very nicely reasoned “no” in an article by Mike Odell in ACM Queue called “Network Front-End Processors, Yet Again“.
Cool Obscure Hardware: Sun SCC and Software License Protection
In a very roundabout way, I recently got to hear about a cool Sun server feature introduced sometime back in 2003 or 2004: the SCC System Configuration Card. This is a smart card that stores the system hostid and Ethernet MACs, along with other info, and which can be transferred from one server to another.
Kunle Olukotun Interview: Heterogeneity, Domain-Specific Programming
The Radio Register has a nice interview with Kunle Olukotun, the man most known for the Afara/Sun Niagara/UltraSparc T1-2-etc. design. It is a long interview, lasting well over an hour, but it is worth a listen. A particular high point is the story on how Kunle worked on parallel processors in the mid-1990s when everyone else was still chasing single-thread performance. He really was a very early proponent of multicore, and saw it coming a bit before most other (general-purpose) computer architects did. Currently, he is working on how to program multiprocessors, at the Stanford Pervasive Parallelism Laboratory (PPL). In the interview, I see several themes that I have blogged about before being reinforced…
Once upon a time, all programming was bare metal programming. You coded to the processor core, you took care of memory, and no operating system got in your way. Over time, as computer programmers, users, and designers got more sophisticated and as more clock cycles and memory bytes became available, more and more layers were added between the programmer and the computer. However, I have recently spotted what might seem like a trend away from ever-thicker software stacks, in the interest of performance and, in particular, latency.
When I started out doing computer science “for real” way back, the emphasis and a lot of the fun was in the basics of algorithms, optimizing code, getting complex trees and sorts and hashes right an efficient. It was very much about computing defined as processor and memory (with maybe a bit of disk or printing or user interface accessed at a very high level, and providing the data for the interesting stuff). However, as time has gone on, I have come to feel that this is almost too clean, too easy to abstract… and gone back to where I started in my first home computer, programming close to the metal.
DNS: Hardware Accelerator Time!
Read More →