I once wrote a blog post about the use of computer architecture pipeline simulation in the IBM ”Stretch” project, which seems to be the first use of computer architecture simulation to design a processor. After the ”Stretch” machine, IBM released the S/360 family in 1964. Then, the Control Data Corporation showed up with their CDC 6600 supercomputer, and IBM started a number of projects to design a competitive high-end computer for the high-performance computing market. One of them, Project Y, became the IBM Advanced Computing Systems project (ACS). In the ACS project, simulation was used to document, evaluate, and validate the very aggressive design. There are some nuggets about the simulator strewn across historical articles about the ACS, as well as an actual technical report from 1966 that I found online describing the simulation technology! Thus, it is possible to take a bit of a deeper look at computer architecture simulation from the mid-1960s.
I have posted a two-part blog post to the public Intel Developer Zone blog, about the “Small Batches Principle” and how simulation helps us achieve it for complicated hardware-software systems. I found the idea of the “small batch” a very good way to frame my thinking about what it is that simulation really brings to system development. The key idea I want to get at is this:
[…] the small batches principle: it is better to do work in small batches than big leaps. Small batches permit us to deliver results faster, with higher quality and less stress.
I had many interesting conversations at the HiPEAC 2017 conference in Stockholm back in January 2017. One topic that came up several times was the GEM5 research simulator, and some cool tricks implemented in it in order to speed up the execution of computer architecture experiments. Later, I located some research papers explaining the “full speed ahead” technology in more detail. The mix of fast simulation using virtualization and clever tricks with cache warming is worth a blog post.
Intel CoFluent Technology is a simulation and modeling tool that can be used for a wide variety of different systems and different levels of scale – from the micro-architecture of a hardware accelerator, all the way up to clustered networked big data systems. On the Intel Evangelist blog on the Intel Developer Zone, I have a write-up on how CoFluent is being used to do model just that: Big Data systems. I found the topic rather fascinating, how you can actually make good predictions for systems at that scale – without delving into details. At some point, I guess systems become big enough that you can start to make accurate predictions thanks to how things kind of smooth out when they become large enough.
The SiCS Multicore Day took place last week, for the tenth year in a row! It is still a very good event to learn about multicore and computer architecture, and meet with a broad selection of industry and academic people interested in multicore in various ways. While multicore is not bright shiny new thing it once was, it is still an exciting area of research – even if much of the innovation is moving away from the traditional field of making a bunch of processor cores work together, towards system-level optimizations. For the past few years, SiCS has had to good taste to publish all the lectures online, so you can go to their Youtube playlist and see all the talks for free, right now!
Once upon a time, when multicore processors were novelties, multicore was motivated by the simple fact that it was impossible to keep raising the clock frequency of processors. More “clocks” simply would result in an overheated mess. Instead, by adding more cores, much more performance could be obtained without having to go to extreme frequencies and power budgets. The first multicore processors pretty much kept clock frequencies of the single-core processors preceding them, and that has remained the mainstream fact until today. Desktop and laptop processors tend to stay at 4 cores or less. But when you go beyond 4 cores, clock frequencies tend to start to go down in order to keep power consumption per package under control. A nice example of this can be found in Intel’s Xeon lineup.
Continue reading “Clocks or Cores? Choose One”
The Tech Report podcast did an interview with David Kanter earlier this year. David Kanter is an industry veteran who runs http://www.realworldtech.com/, as well as being a regular contributor to the Microprocessor Report. I have read the MPR since my PhD days, and it is still one of the best places for information on new chips. The subject of the podcast episode was an analysis of the somewhat mysterious Softmachines “VISC” processor architecture. However, it rather turned into a very good discussion on how you do performance predictions, performance projections of competing systems, and the nature of benchmarking and benchmark numbers.
IEEE Micro published an article called “Architectural Simulators Considered Harmful”, by Nowatski et al, in the November-December 2015 issue. It is a harsh critique of how computer architecture research is performed today, and its uninformed overreliance on architectural simulators. I have to say I mostly agree with what they say. The article follows in a good tradition of articles from the University of Wisconsin-Madison of critiquing how computer architecture research is performed, and I definitely applaud this type of critique.
I have read some recent IBM articles about the POWER8 processor and its hardware debug and trace facilities. They are very impressive, and quite interesting to compare to what is usually found in the embedded world. Instead of being designed to help with software debug, it seems the hardware mechanisms in the Power8 are rather focused on silicon bringup and performance analysis and verification in IBM’s own labs. As well as supporting virtual machines and JIT-based systems!
In a blog post at Wind River, I describe how the Wind River Helix Lab Cloud system can be used to communicate hardware design to software developers. The idea is that you upload a virtual platform to the cloud-based system, and then share it to the software developers. In this way, there is no need to install or build a virtual platform locally, and the sender has perfect control over access and updates. It is a realization of the hardware communication principles I presented in an earlier blog post on use cases for Lab Cloud.
But the past part is that the targets I talk about in the blog post and use in the video are available for anyone! Just register on Lab Cloud, and you can try your own threaded software and check how it scales on a simulated 8-core ARM!
In a dusty bookshelf at work I found an ancient tome of wisdom, long abandoned by its previous owner. I was pointed to it by a fellow explorer of the dark arts of computer system design as something that you really should read. The book was “Fortress Rochester”, written by Frank Soltis, and published in 2001.
I just read an article from IEEE Annals of Computing history about the COMIC Color-Matching Analog computer built and sold by Davidson and Hemmendinger, a US firm. It seems the computer is pretty well known inside the colorant industry, actually, and it provides an interesting example of how to do a good-enough solution to break open the market – while leaving the user in control of the process to build faith in the approach.
When mobile phones first appeared, they were powered by very simple cores like the venerable ARM7 and later the ARM9. Low clock frequencies, zero microarchitectural sophistication, sufficient for the job. In recent years, as smartphones have come into their own as the most important computing device for most people, the processor performance of mobile phones have increased tremendously. Today, cutting-edge phones and tablets contain four or eight cores, running at clock frequencies well above 2 gigahertz. The performance race for most of the market (more about that in a moment) was mostly about pushing higher clock frequencies and more cores, even while microarchitecture was left comparatively simple. Mobile meant “fairly simple”, and IPC was nowhere near what you would get with a typical Intel processor for a laptop or desktop.
Today, that seems to be changing, as the Nvidia Denver core and Apple’s Cyclone core both go the route of a few fat cores rather than many thin cores.
I just found and read an old text in the computer systems field, “Why Do Computers Fail and What Can Be Done About It?” , written by Jim Gray at Tandem Computers in 1985. It is a really nice overview of the issues that Tandem had encountered in their customer based, back in the early 1980s. The report is really a classic in the computer systems field, but I did not read it until now. Tandem was an early manufacturer of explicitly fault tolerant and highly reliable and available computers. In this technical report Jim Gray describes the basic principles of fault tolerance, and what kinds of faults happen in the field and that need to be tolerated.
At the ISCA 2014 conference (the biggest event in computer architecture research), a group of researchers from Microsoft Research presented a paper on their Catapult system. The full title of the paper is “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services“, and it is about using FPGAs to accelerate search engine queries at datacenter scale. It has 23 authors, which is probably the most I have ever seen on an interesting paper. There are many things to be learnt from and discussed about this paper, and here are my thoughts on it.
The Mill is a new general-purpose high-performance processor design from out-of-the-box computing (http://ootbcomp.com/). They claim to beat typical high-end out-of-order (OOO) designs like the Intel Haswell generation by crazy factors, such as being 2.3x faster while using 2.3x less power compared to a Haswell. All the while costing less. Ignoring the cost aspect, the power and performance numbers are truly impressive – especially for general code. How can they do something so much better than what we have today? For general code? That requires some serious innovation. With that perspective, I ask myself where the Mill is really significantly different from what we have seen before.
Apple just released their new iPhone 5s, where the biggest news is really the 64-bit processor core inside the new A7 SoC. Sixty four bits in a phone is a first, and it immediately raises the old question of just what 64 bits gives you. We saw this when AMD launched the Opteron and 64-bit x86 PC computing back in the early 2000’s, and in a less public market the same question was asked as 64-bit MIPS took huge chunks out of the networking processor market in the mid-2000s. It was never questioned in servers, however.
Via the EETimes, I found a very interesting talk by Bristol professor David May, presented at the 4th Annual Bristol Multicore Challenge, in June of 2013. The talk can be found as a Youtube movie here, and the slides are available here. The EETimes focused on the idea to cut down ARM to be really RISC, but I think the more interesting part is Professor May’s observations on multicore computing in general, and the case for and against heterogeneity in (parallel) computers.
It is quite interesting to see how Qualcomm has emerged as a major player in the “processor market” and is trying to build themselves into a serious consumer brand. I used to think of them as a company doing modems and other chips that made phones talk wirelessly, known to insiders in the business but not anything a user cared about. Today, however, they are working hard on building themselves into a brand to rival Intel and AMD. At the center of this is their own line of ARM-based application processors, the Snapdragon. I can see some thinking quite similar to the old “Intel Inside” classic, and I would not be surprised to see the box or even body of a phone carrying a Snapdragon logo at some point in the future. A part of this branding exercise is the Snapdragon Batteryguru, an application I recently stumbled on in the Google Play store.
Probably thanks to the yearly Mobile World Congress, there have been a slew of recent announcements of mobile application processors recently. Everything is ARM-based, but show quite some variety in the CPU core configurations used. Indeed, I think this variety has something to say on the general state of multicore.
When I grew up with computers, the big RISC vs CISC debate was raging. At the time, in the late 1980s, it did indeed seem that RISC was inherently superior to CISC. SPARCs, MIPS, and Alpha all outpaced boring old x86, VAX and 68000 processors. This turned out to be a historical parenthesis, as the Pentium Pro from Intel showed how RISC-style performance could be mated to a CISC ISA. However, maybe ISAs still do matter.
The 2012 edition of the SiCS Multicore Day was fun, like they have always been in the past. I missed it in 2010 and 2011, but could make it back this year. It was interesting to see that the points where keynote speakers disagreed was similar to previous years, albeit with some new twists. There was also a trend in architecture, moving crypto operations into the core processor ISA, that indicates another angle on the hardware accelerator space.
I recently read the classic book The Soul of a New Machine by Tracy Kidder. Even though it describes the project to build a machine that was launched more than 30 years ago, the story is still fresh and familiar. Corporate intrigue, managing difficult people, clever engineering, high pressure, all familiar ingredients in computing today just as it was back then. With my interesting in computer history and simulation, I was delighted to actually find a simulator in the story too! It was a cycle-accurate simulator of the design, programmed in 1979.
Carbon Design Systems have been on a veritable blogging spree recently, pushing out a large number of posts around various topics. Maybe a bit brief for my taste in most cases (I have a tendency to throw out 1000+ word pseudo-articles when I take the time to write a blog), but sometimes very interesting nevertheless. I particularly liked a few posts on cache analysis, as they presented some good insight into not-quite-expected processor and cache behaviors.
When IBM moved their mainframe systems (the S/360 family that is today called System Z) from BiCMOS to mainstream CMOS in 1994, the net result was a severe loss in clock frequency and thus single-processor performance. Still, the move had to be done, since CMOS would scale much better into the future. As a result, IBM introduced additional parallelism to the system in order to maintain performance parity. Parallelism as a patch, essentially.
Carbon Design Systems have been quite busy lately with a flurry of blog posts about various aspects of virtual prototype technology. Mostly good stuff, and I tend to agree with their push that a good approach is to mix fast timing-simplified models with RTL-derived cycle-accurate models. There are exceptions to this, in particular exploratoty architecture and design where AT-style models are needed. Recently, they posted about their new Swap ‘n’ Play technology, which is a old proven idea that has now been reimplemented using ARM fast simulators and Carbon-generated ARM processor models.
Once upon a time, all programming was bare metal programming. You coded to the processor core, you took care of memory, and no operating system got in your way. Over time, as computer programmers, users, and designers got more sophisticated and as more clock cycles and memory bytes became available, more and more layers were added between the programmer and the computer. However, I have recently spotted what might seem like a trend away from ever-thicker software stacks, in the interest of performance and, in particular, latency.
I was recently pointed to a 2011 SPLASH presentation by David Ungar, an IBM researcher working on parallel programming for manycore systems. In particular, in a project called Renaissance, run together with the Vrije Universiteit Brussels in Belgium (VUB) and Portland State University in the US. The title of the presentation is “Everything You Know (about Parallel Programming) Is Wrong! A Wild Screed about the Future“, and it has provoked some discussion among people I know about just how wrong is wrong.
Fault Injection is a topic that has fascinated me for a long time. Not just the area of software-to-software fault injection, but more so how you inject faults into hardware using hardware (and how to conveniently approximate this using a simulator). I just stumbled on a short interesting note about such hardware-actuated fault injection in a Fujitsu article.
I just read a quite interesting article by Christian Pinto et al, “GPGPU-Accelerated Parallel and Fast Simulation of Thousand-core Platforms“, published at the CCGRID 2011 conference. It discusses some work in using a GPGPU to run simulations of massively parallel computers, using the parallelism of the GPU to speed the simulation. Intriguing concept, but the execution is not without its flaws and it is unclear at least from the paper just how well this generalizes, scales, or compares to parallel simulation on a general-purpose multicore machine.