There are some things in computing that seem “obviously” true and that “clearly” make it “impossible” to do some things. One example of this is the idea that you cannot go backwards in time from the current state of a program or computer system and recover previous state by just reversing the semantics of the instructions in the program. In particular, that you cannot take a core dump from a failed system and reverse-execute back from it – how could you? In order to do reverse debugging and reverse execution, you “have to” record the state at the first point in time that you want to be able to go back to, and then record all changes to the state. Turns out I was wrong, as shown by a recent Usenix OSDI paper.Continue reading “Microsoft REPT: You CAN Reverse from a Core Dump!”
DVCon Europe took place in München, Bayern, Germany, on October 24 and 25, 2018. Here are some notes from the conference, including both general observations and some details on a few papers that were really quite interesting. This is not intended as an exhaustive replay, just my personal notes on what I found interesting.
I have a documented love for keyboards with RGB lighting. So I was rather annoyed when one of my Corsair K65 keyboards suddenly seemed to lose its entire red color component. The keyboard is supposed to default to all-red color scheme with the WASD and arrow keys highlighted in white when no user is logged in to the machine it is connected to – but all of a sudden, it went all dark except a light-blue color on the “white” keys. I guessed it was just a random misconfiguration, but it turned out to be worse than that.
Recently I stumbled on a nice piece called “Always Measure One Level Deeper” by John Ousterhout, from Communications of the ACM, July 2018. https://cacm.acm.org/magazines/2018/7/229031-always-measure-one-level-deeper/fulltext. The article is about performance analysis, and how important it is to not just look at the top-level numbers and easy-to-see aspects of a system, but to also go (at least) one level deeper to measure the components and subsystems that affect the overall system performance.
I recently asked myself the question of just how many Powerpoint files I had on my work laptop and on my home machines. It turns out that it was pretty easy to figure that out using Windows Powershell, with some commands I found on a random website.
Thanks to a tip from “Derek” on a previous blog post about a replay debugger from 1995, I was made aware of the reverse execution ability that was available in the Borland Turbo Debugger version 3.0 from 1992! This is the oldest commercial instance of “reverse” that I have found (so far), and definitely one of the oldest incarnations of the idea overall. Thanks to Google and the Internet, I managed to find a scanned copy of the manual of the product, which provided some additional information. Note that the debugger only does reverse execution, but not reverse debugging since you cannot run in reverse to stop at a breakpoint.
I just found a story about Undo software that was rather interesting from a strategic perspective. “Patient capital from CIC gives ‘time travelling’ company Undo space to pivot“, from the BusinessWeekly in the UK. The article describes a change from selling to individual developers, towards selling to enterprises. This is an important business change, but it also marks I think a technology thinking shift: from single-session debug to record-replay.
Injecting faults into systems and subjecting them to extreme situations at or beyond their nominal operating conditions is an important part of making sure they keep working even when things go bad. It was realized very early in the history of Simics (and the same observation had been made by other virtual platform and simulator providers) that using a virtual platform makes it much easier to provide cheap, reliable, and repeatable fault injection for software testing. In an Intel Developer Zone (IDZ) blog post, I describe some early cases of fault injection with Simics.
There have been quite a few security exploits and covert channels based on timing measurements in recent years. Some examples include Spectre and Meltdown, Etienne Martineau’s technique from Def Con 23, the technique by Maurice et al from NDSS 2017, and attacks on crypto algorithms by observing the timing of execution. There are many more examples, and it is clear that measuring time, in particular in order to tell cache hits and cache misses apart, is a very useful primitive. Thus, it seems to make sense to make it harder for software to measure time, by reducing the precision of or adding jitter to timing sources. But it seems such attempts are rather useless in practice.
[Updated 2018-01-29 with a note on ARC SEM110-120 processors]
Continue reading “Timing Measurements and Security”
The introduction of non-volatile memory that is accessed and addressed like traditional RAM instead of using a special interface has some rather interesting effects on software. It blurs the traditional line between persistent long-term mass storage and volatile memory. On the surface, it sounds pretty simple: you can keep things living in RAM-like memory across reboots and shutdowns of a system. Suddenly, there is no need to reload things into RAM for execution following a reboot. Every piece of data and code can be kept immediately accessible in the memory that the processor uses. A computer could in principle just get rid of the whole disk/memory split and just get a single huge magic pool of storage that makes life easier. No file system, no complications, easy programmer life. Or is it that simple?
A while ago, I visited my Intel colleagues in Costa Rica and ran a workshop for university teachers and researchers, showing how Simics could be used in academia. I worked with a very smart and talented intern, Jose Fernando Molina, and after a rather long process I have published an interview with him on my Intel blog: https://software.intel.com/en-us/blogs/2017/12/05/windriver-simics-to-inspire-teachers-costarica
Over time, Intel and other processor core designers add more and more instructions to the cores in our machines. A good question is how quickly and easily new instructions added to an Instruction-Set Architecture (ISA) actually gets employed by software to improve performance and add new capabilities. Considering that our operating systems and programs are generally backwards-compatible, and run on all kind of hardware, can they actually take advantage of new instructions?
A blog post from Undo Software informed me that Microsoft has rather quietly released a reverse debugger tool for Windows programs – WinDbg with Time Travel Debug. It is available in the latest preview of WinDbg, as available through the Windows Store, for the most recent Windows 10 versions (Anniversary update or later). According to a CPPcon talk about the tool (Youtube recording of the talk) the technology has a decade-long history internally at Microsoft, but is only now being released to the public after a few years of development. So it is a new old thing 🙂
Intel Software Guard Extensions (SGX) is a pretty cool piece of technology that aims to make it possible for user programs to hide secrets from other user programs and the operating system itself. It establishes enclaves in the system that hides the data being processed and the code processing it from all other software. The original application for SGX was to support client-machine features like DRM, to create a safe space on a client that a server can trust. Recently, the people behind the Signal messaging system have provided a really interesting example of an application that makes use of the of SGX “in reverse”, to make it possible for a client to trust a server.
Integration is hard, that is well-known. For computer chip and system-on-chip design, integration has to be done pre-silicon in order to find integration issues early so that designs can be updated without expensive silicon re-spins. Such integration involves a lot of pieces and many cross-connections, and in order to do integration pre-silicon, we need a virtual platform.
IEEE Spectrum ran a short interview with Thomas Knoll, the creator of Photoshop, who made a very interesting point about the move to subscription-based software rather than one-time buys plus upgrades. His point is that if you are building based sold using the “upgrade model”, developers have to create features that cause users to upgrade. In his opinion, that means you have to focus on flashy features that demo well and catch people’s attention – but that likely do not actually help users in the end.
I have just published a piece about the Intel Excite project on my Software Evangelist blog at the Intel Developer Zone. The Excite project is using a combination of of symbolic execution, fuzzing, and concrete testing to find vulnerabilities in UEFI code, in particular in SMM. By combining symbolic and concrete techniques plus fuzzing, Excite achieves better performance and effect than using either technique alone.
Human Resource Machine, from the Tomorrow Corporation, is a puzzle game that basically boils down to assembly-language programming. It has a very charming graphical style and plenty of entertaining sideshows, hiding the rather dry core game in a way that really works! It is really great fun and challenging – even for myself who has a reasonable amount of experience coding at this low level. It feels a bit like coding in the 1980s, except that all instructions take the same amount of time in the game.
Today, when developing embedded control systems, it is standard practice to test control algorithms against some kind of “world model”, “plant model” or “environment simulator”.
Using a simulated control system or a virtual platform running the actual control system code, connected to the world model lets you test the control system in a completely virtual and simulated environment (see for example my Trinity of Simulation blog post from a few years ago). This practice of simulating the environment for a control computer is long-standing in the aerospace field in particular, and I have found that it goes back at least to the Apollo program.
In the early 1990s, “PC graphics” was almost an oxymoron. If you wanted to do real graphics, you bought a “real machine”, most likely a Silicon Graphics workstation. At the PC price-point, fast hardware-accelerated 3D graphics wasn’t doable… until it suddenly was, thanks to Moore’s law. 3dfx was the first company to create fast 3D graphics for PC gamers. To get off the ground and get funded, 3dfx had to prove that their ideas were workable – and that proof came in the shape of a simulator. They used the simulator to demo their ideas, try out different design points, develop software pre-silicon, and validate the silicon once it arrived. Read the full story on my Intel blog, “How Simulation Started a Billion-Dollar Company”, found at the Intel Developer Zone blogs.
A new entry just showed up in the world of reverse debugging – Simulics, from German company Simulics. It does seem like the company and the tool are called the same. Simulics is a rather rare breed, the full-system-simulation-based reverse debugger. We have actually only seen a few these in history, with Simics being the primary example. Most reverse debuggers apply to user-level code and use various forms of OS call intercepts to create a reproducible run. Since the Simulics company clearly comes from the deeply embedded systems field, it makes sense to take the full-system approach since that makes it possible to debug code such as interrupt handlers.
I have also updated my history of commercial reverse debuggers to include Simulics.
Intel CoFluent Technology is a simulation and modeling tool that can be used for a wide variety of different systems and different levels of scale – from the micro-architecture of a hardware accelerator, all the way up to clustered networked big data systems. On the Intel Evangelist blog on the Intel Developer Zone, I have a write-up on how CoFluent is being used to do model just that: Big Data systems. I found the topic rather fascinating, how you can actually make good predictions for systems at that scale – without delving into details. At some point, I guess systems become big enough that you can start to make accurate predictions thanks to how things kind of smooth out when they become large enough.
Simics and other simulation solutions are a great way to add more variation to your software testing. I have just documented a nice case of this on my blog at the Intel Developer Zone (IDZ), where the Simics team found a bug in how Xen deals with MPX instructions when using VT-x. Thanks to running on Simics, where scenarios not available in current hardware are easy to set up.
UndoDB is an old player in the reverse debugging market, and have kept at it for ten years. Last year, they released the Live Recorder record-replay function. Most recently, they have showed an integration between the recorder function and Jenkins, where the idea is that you record failing runs in your CI system and replay them on the developer’s machine. Demo video is found on Youtube, see https://www.youtube.com/watch?v=ap8552P5vss.
Last year (2015), a paper called “Don’t Panic: Reverse Debugging of Kernel Drivers” was presented at the ESEC/FSE (European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering) conference. The paper was written by Pavel Dovgalyuk, Denis Dmitriev, and Vladimir Makarov from the Russian Academy of Sciences. It describes a rather interesting approach to Linux kernel device driver debug, using a deterministic variant of Qemu along with record/replay of hardware interactions. I think this is the first published instance of using reverse debugging in a simulator together with real hardware.
My first blog post as a software evangelist at Intel was published last week. In it, I tell the story of how our development teams used Simics to test the software behavior (UEFI, in particular) when a server is configured with several terabytes of RAM. Without having said server in physical form – just as a simulation. And running that simulation on a small host with just 256 GB of RAM. I.e., the host RAM is just a small fraction of the target. That’s the kind of things that you can do with Simics – the framework has a lot of smarts in it.
It was rather interesting to realize that just the OS page tables for this kind of system occupies gigabytes of RAM… but that just underscores just how gigantic six terabytes of memory really is.
How important is the documentation (manual, user guide, instruction booklet) for the actual quality and perceived quality of a product? Does it materially affect the user? I was recently confronted by this question is a very direct way. It turned out that the manual for our new car was not quite what you would expect…
This is just the first page, and as you can see if you know Swedish or German or both, it is a strange interleaving of sentences in the two languages.
A comment on my old blog post about the history of reverse execution gave me a pointer to a fairly early example of replay debugging. The comment pointed at a 2002 blog post which in turn pointed at a 1999 LWN.net text which almost in passing describes a seemingly working record-replay debugger from 1995. The author was a Michael Elizabeth Chastain, of whom I have not managed to find any later traces.
I love bug and debug stories in general. Bugs are a fun and interesting part of software engineering, programming, and systems development. Stories that involve running Simics on Simics to find bugs are a particular category that is fascinating, as it shows how to apply serious software technology to solve problems related to said serious software technology. On the Intel Software and Services blog, I just posted a story about just that: debugging a Linux kernel bug provoked by Simics, by running Simics on a small network of machines inside of Simics. See https://blogs.intel.com/evangelists/2016/05/30/finding-kernel-1-2-3-bug-running-wind-river-simics-simics/ for the full story.