DRAMsys is a simulator for modern RAM systems, built by researchers at Fraunhofer IESE and the Technische Universität Kaiserslautern. Over the past few years, I have heard several talks about the tool and also had the luck to talk a bit to the team behind it. It is an interesting piece of simulation technology, in particular for how it manages to build a truly cycle-accurate model on top of the approximately-timed (AT) style defined SystemC TLM-2.0.Continue reading “DRAMsys – Cycle-Accurate Simulation using Transactions”
Back when I was a PhD student working on worst-case execution-time (WCET) analysis, one of the leading groups researching the topic was the “Saarbrücken gang” led by Professor Reinhard Wilhelm. Last year, Professor Wilhelm published a retrospective look on their work on WCET in the Communications of the ACM. It is a really interesting history write-up from the perspective of the Saarbrücken group.Continue reading “Professor Reinhard Wilhelm on the History of WCET Analysis”
Using USB-C to charge a laptop while simultaneously providing display and other IO traffic sounds a little bit too good to be true in practice. Maybe it would work for a set of devices from a single manufacturer (like a Thunderbolt-based USB-C-attached dock from the same vendor as a laptop). However, recently I was surprised (in a good way) when it turned out that I had accidentally got myself a USB-C-based single-cable-to-the-laptop setup. USB-C promises a lot, in this case it delivers perfectly, but what bothers me is the fact that there is really no way I could have figured this out ahead of time.Continue reading “USB-C Works, but how would you Know?”
The July 2020 edition of the Communications of the ACM (CACM) had a front-page theme of “Domains-Specific Hardware Accelerators”, or DSAs. It contained two articles about the subject, one about an academic genomics accelerator, and one about the Google TPU. Hardware accelerators dedicated to particular types of computation are basically everywhere today, and an accepted part of the evolution of computers. The CACM articles have some good tidbits and points about how accelerators are designed and used today. At the same time, I also found a youtube talk about the first hardware accelerator, the IBM Stretch HARVEST, showing both contrasts with today as well as a remarkable continuity in concept.Continue reading “CACM on DSAs”
I have been working with computer simulation and computer architecture for more than 20 years, and one thing that has been remarkably stable over time is the simulation slowdown inherent in “cycle accurate” computer simulation. Regardless of who I talked to or what they were modeling, the simulators ran at around 100 thousand times slower than the machine being modeled. It even holds true going back to the 1960s! However, there is a variant of simulation that aims to make useful performance predictions while running around 10x faster (or more) – mechanistic models (in particular, the Sniper simulator).Continue reading “Simulating Computer Architecture with “Mechanistic” Models – No more 100k Slowdown?”
A while ago, Ars Technica reviewed the Mega Sg, a modern clone of the old Sega Genesis gaming system. I stumbled on this review recently and realized that this is a fascinating piece of hardware. The Mega Sg is produced by a company called Analogue (https://www.analogue.co/), presumably named thus because they create analogues to old gaming consoles. The way this is done is different from most current “revive the old consoles” products that simply use software emulation to run old games. Instead, Analogue seems to have settled on using FPGA (Field-Programmable Gate Array) technology to basically build new hardware that is functionally equivalent to the old console hardware.Continue reading “Using FPGAs to Simulate old Game Consoles”
Earlier in July 2019, I had the honor of presenting one of the keynote talks at the 19th SAMOS (International Conference on Embedded Computer Systems: Architectures, MOdeling, and Simulation) conference, held on the island of Samos in Greece. When I got the invite, I had no real idea what to expect. I asked around a bit and people said it was a good conference with a rather special vibe. I think that is a very good description of the conference: a special vibe. In addition to the usual papers and sessions, there is a strong focus on community and social events, fostering discussion across academic disciplines and between industry and academia. There were many really great discussions in addition to the paper and keynote presentations, and overall it was one of the most interesting conferences I have been to in recent years.Continue reading “SAMOS 2019 – Insights, Mechanisms, Heterogeneity, and more”
I heard about the DOOM Game Engine Black Book by Fabien Sanglard on the Hanselminutes podcast episode 666, and immediately ordered the book. It was a riveting read – at least for someone who likes technology and computer history like I do. The book walks through how the ID Software classic DOOM game from 1993 works and the tricks and techniques used to get sufficient performance out of the hardware of 1993. As background to how the software was written, the book contains a great description of the hardware design of IBM-compatible PCs, gaming consoles, and NeXT machines circa 1992-1994. It covers software design, game design, marketing, and how ID Software worked.Continue reading “DOOM Black Book – This is Brilliant!”
There are some things in computing that seem “obviously” true and that “clearly” make it “impossible” to do some things. One example of this is the idea that you cannot go backwards in time from the current state of a program or computer system and recover previous state by just reversing the semantics of the instructions in the program. In particular, that you cannot take a core dump from a failed system and reverse-execute back from it – how could you? In order to do reverse debugging and reverse execution, you “have to” record the state at the first point in time that you want to be able to go back to, and then record all changes to the state. Turns out I was wrong, as shown by a recent Usenix OSDI paper.Continue reading “Microsoft REPT: You CAN Reverse from a Core Dump!”
There have been quite a few security exploits and covert channels based on timing measurements in recent years. Some examples include Spectre and Meltdown, Etienne Martineau’s technique from Def Con 23, the technique by Maurice et al from NDSS 2017, and attacks on crypto algorithms by observing the timing of execution. There are many more examples, and it is clear that measuring time, in particular in order to tell cache hits and cache misses apart, is a very useful primitive. Thus, it seems to make sense to make it harder for software to measure time, by reducing the precision of or adding jitter to timing sources. But it seems such attempts are rather useless in practice.
[Updated 2018-01-29 with a note on ARC SEM110-120 processors]
Continue reading “Timing Measurements and Security”
The introduction of non-volatile memory that is accessed and addressed like traditional RAM instead of using a special interface has some rather interesting effects on software. It blurs the traditional line between persistent long-term mass storage and volatile memory. On the surface, it sounds pretty simple: you can keep things living in RAM-like memory across reboots and shutdowns of a system. Suddenly, there is no need to reload things into RAM for execution following a reboot. Every piece of data and code can be kept immediately accessible in the memory that the processor uses. A computer could in principle just get rid of the whole disk/memory split and just get a single huge magic pool of storage that makes life easier. No file system, no complications, easy programmer life. Or is it that simple?
In a previous Intel blog post “Question: Does Software Actually Use New Instruction Sets?” I looked at the kinds of instructions used by few different Linux setups, and how each setup was affected by changing the type of the processor it was running on (comparing Nehalem to Skylake). As a follow-up to that post, I have now done the same for Microsoft* Windows* 10. In the blog post, I take a look at how Windows 10 behaves across processor generations, and how its behavior compares to Ubuntu* 16 (they are actually pretty similar in philosophy).
Over time, Intel and other processor core designers add more and more instructions to the cores in our machines. A good question is how quickly and easily new instructions added to an Instruction-Set Architecture (ISA) actually gets employed by software to improve performance and add new capabilities. Considering that our operating systems and programs are generally backwards-compatible, and run on all kind of hardware, can they actually take advantage of new instructions?
I once wrote a blog post about the use of computer architecture pipeline simulation in the IBM ”Stretch” project, which seems to be the first use of computer architecture simulation to design a processor. After the ”Stretch” machine, IBM released the S/360 family in 1964. Then, the Control Data Corporation showed up with their CDC 6600 supercomputer, and IBM started a number of projects to design a competitive high-end computer for the high-performance computing market. One of them, Project Y, became the IBM Advanced Computing Systems project (ACS). In the ACS project, simulation was used to document, evaluate, and validate the very aggressive design. There are some nuggets about the simulator strewn across historical articles about the ACS, as well as an actual technical report from 1966 that I found online describing the simulation technology! Thus, it is possible to take a bit of a deeper look at computer architecture simulation from the mid-1960s.
I have posted a two-part blog post to the public Intel Developer Zone blog, about the “Small Batches Principle” and how simulation helps us achieve it for complicated hardware-software systems. I found the idea of the “small batch” a very good way to frame my thinking about what it is that simulation really brings to system development. The key idea I want to get at is this:
[…] the small batches principle: it is better to do work in small batches than big leaps. Small batches permit us to deliver results faster, with higher quality and less stress.
I had many interesting conversations at the HiPEAC 2017 conference in Stockholm back in January 2017. One topic that came up several times was the GEM5 research simulator, and some cool tricks implemented in it in order to speed up the execution of computer architecture experiments. Later, I located some research papers explaining the “full speed ahead” technology in more detail. The mix of fast simulation using virtualization and clever tricks with cache warming is worth a blog post.
Intel CoFluent Technology is a simulation and modeling tool that can be used for a wide variety of different systems and different levels of scale – from the micro-architecture of a hardware accelerator, all the way up to clustered networked big data systems. On the Intel Evangelist blog on the Intel Developer Zone, I have a write-up on how CoFluent is being used to do model just that: Big Data systems. I found the topic rather fascinating, how you can actually make good predictions for systems at that scale – without delving into details. At some point, I guess systems become big enough that you can start to make accurate predictions thanks to how things kind of smooth out when they become large enough.
The SiCS Multicore Day took place last week, for the tenth year in a row! It is still a very good event to learn about multicore and computer architecture, and meet with a broad selection of industry and academic people interested in multicore in various ways. While multicore is not bright shiny new thing it once was, it is still an exciting area of research – even if much of the innovation is moving away from the traditional field of making a bunch of processor cores work together, towards system-level optimizations. For the past few years, SiCS has had to good taste to publish all the lectures online, so you can go to their Youtube playlist and see all the talks for free, right now!
Once upon a time, when multicore processors were novelties, multicore was motivated by the simple fact that it was impossible to keep raising the clock frequency of processors. More “clocks” simply would result in an overheated mess. Instead, by adding more cores, much more performance could be obtained without having to go to extreme frequencies and power budgets. The first multicore processors pretty much kept clock frequencies of the single-core processors preceding them, and that has remained the mainstream fact until today. Desktop and laptop processors tend to stay at 4 cores or less. But when you go beyond 4 cores, clock frequencies tend to start to go down in order to keep power consumption per package under control. A nice example of this can be found in Intel’s Xeon lineup.
Continue reading “Clocks or Cores? Choose One”
The Tech Report podcast did an interview with David Kanter earlier this year. David Kanter is an industry veteran who runs http://www.realworldtech.com/, as well as being a regular contributor to the Microprocessor Report. I have read the MPR since my PhD days, and it is still one of the best places for information on new chips. The subject of the podcast episode was an analysis of the somewhat mysterious Softmachines “VISC” processor architecture. However, it rather turned into a very good discussion on how you do performance predictions, performance projections of competing systems, and the nature of benchmarking and benchmark numbers.
IEEE Micro published an article called “Architectural Simulators Considered Harmful”, by Nowatski et al, in the November-December 2015 issue. It is a harsh critique of how computer architecture research is performed today, and its uninformed overreliance on architectural simulators. I have to say I mostly agree with what they say. The article follows in a good tradition of articles from the University of Wisconsin-Madison of critiquing how computer architecture research is performed, and I definitely applaud this type of critique.
I have read some recent IBM articles about the POWER8 processor and its hardware debug and trace facilities. They are very impressive, and quite interesting to compare to what is usually found in the embedded world. Instead of being designed to help with software debug, it seems the hardware mechanisms in the Power8 are rather focused on silicon bringup and performance analysis and verification in IBM’s own labs. As well as supporting virtual machines and JIT-based systems!
In a blog post at Wind River, I describe how the Wind River Helix Lab Cloud system can be used to communicate hardware design to software developers. The idea is that you upload a virtual platform to the cloud-based system, and then share it to the software developers. In this way, there is no need to install or build a virtual platform locally, and the sender has perfect control over access and updates. It is a realization of the hardware communication principles I presented in an earlier blog post on use cases for Lab Cloud.
But the past part is that the targets I talk about in the blog post and use in the video are available for anyone! Just register on Lab Cloud, and you can try your own threaded software and check how it scales on a simulated 8-core ARM!
In a dusty bookshelf at work I found an ancient tome of wisdom, long abandoned by its previous owner. I was pointed to it by a fellow explorer of the dark arts of computer system design as something that you really should read. The book was “Fortress Rochester”, written by Frank Soltis, and published in 2001.
I just read an article from IEEE Annals of Computing history about the COMIC Color-Matching Analog computer built and sold by Davidson and Hemmendinger, a US firm. It seems the computer is pretty well known inside the colorant industry, actually, and it provides an interesting example of how to do a good-enough solution to break open the market – while leaving the user in control of the process to build faith in the approach.
When mobile phones first appeared, they were powered by very simple cores like the venerable ARM7 and later the ARM9. Low clock frequencies, zero microarchitectural sophistication, sufficient for the job. In recent years, as smartphones have come into their own as the most important computing device for most people, the processor performance of mobile phones have increased tremendously. Today, cutting-edge phones and tablets contain four or eight cores, running at clock frequencies well above 2 gigahertz. The performance race for most of the market (more about that in a moment) was mostly about pushing higher clock frequencies and more cores, even while microarchitecture was left comparatively simple. Mobile meant “fairly simple”, and IPC was nowhere near what you would get with a typical Intel processor for a laptop or desktop.
Today, that seems to be changing, as the Nvidia Denver core and Apple’s Cyclone core both go the route of a few fat cores rather than many thin cores.
I just found and read an old text in the computer systems field, “Why Do Computers Fail and What Can Be Done About It?” , written by Jim Gray at Tandem Computers in 1985. It is a really nice overview of the issues that Tandem had encountered in their customer based, back in the early 1980s. The report is really a classic in the computer systems field, but I did not read it until now. Tandem was an early manufacturer of explicitly fault tolerant and highly reliable and available computers. In this technical report Jim Gray describes the basic principles of fault tolerance, and what kinds of faults happen in the field and that need to be tolerated.
At the ISCA 2014 conference (the biggest event in computer architecture research), a group of researchers from Microsoft Research presented a paper on their Catapult system. The full title of the paper is “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services“, and it is about using FPGAs to accelerate search engine queries at datacenter scale. It has 23 authors, which is probably the most I have ever seen on an interesting paper. There are many things to be learnt from and discussed about this paper, and here are my thoughts on it.
The Mill is a new general-purpose high-performance processor design from out-of-the-box computing (http://ootbcomp.com/). They claim to beat typical high-end out-of-order (OOO) designs like the Intel Haswell generation by crazy factors, such as being 2.3x faster while using 2.3x less power compared to a Haswell. All the while costing less. Ignoring the cost aspect, the power and performance numbers are truly impressive – especially for general code. How can they do something so much better than what we have today? For general code? That requires some serious innovation. With that perspective, I ask myself where the Mill is really significantly different from what we have seen before.
Apple just released their new iPhone 5s, where the biggest news is really the 64-bit processor core inside the new A7 SoC. Sixty four bits in a phone is a first, and it immediately raises the old question of just what 64 bits gives you. We saw this when AMD launched the Opteron and 64-bit x86 PC computing back in the early 2000’s, and in a less public market the same question was asked as 64-bit MIPS took huge chunks out of the networking processor market in the mid-2000s. It was never questioned in servers, however.