When I was working on my PhD in WCET – Worst-Case Execution Time analysis – our goal was to utterly precisely predict the precise number of cycles that a processor would take to execute a certain piece of code. We and other groups designed analyses for caches, pipelines, even branch predictors, and ways to take into account information about program flow and variable values.
The complexity of modern processors – even a decade ago – was such that predictability was very difficult to achieve in practice. We used to joke that a complex enough processor would be like a random number generator.
Funnily enough, it turns out that someone has been using processors just like that. Guess that proves the point, in some way.
I was recently introduced to the concept of the HAVEGE project – HArdware Volatile Entropy Gathering and Expansion, run at IRISA in Rennes in France from what seems to be around 2002 to 2006. The main author, Andre Seznec, has also published in the WCET field. Today, the same idea is found nicely packaged in the HAVEGED code base for Linux, found at http://www.issihosts.com/haveged.
The idea behind HAVEGE is to run a piece of code that is designed to incur cache misses, confuse branch predictors, and generally strain the prediction mechanisms of a processor. In this way, the timing of the code will fluctuate even though it is basically straight-line code with no decision-making. These timing variations can be captured by reading a high-resolution timer such as the x86 processor’s TSC (Time Stamp Counter), or some other source that can report the execution time of a piece of code.
The key advantage of such a source of randomness is that it is easy to quickly acquire lots of randomness (or entropy in crypto language), and it is also impossible to predict the results. For cryptographic applications, this unpredictability from the perspective of an outside observer is very important, as it makes random numbers generated based on this much stronger in the face of an attack.
I think HAVEGE offers a good example of how to make lemonade from lemons. If we conclude that processor timing cannot be predicted, consider that fact as a feature for cryptography rather than as a problem for WCET.
The first paper on HAVEGE is called “Hardware Volatile Entropy Gathering and Expansion: Generating unpredictable random numbers at user level“, IRISA internal report 1492, October 2002. It presents the core idea a little differently from later papers. In it, they measure the cache and TLB effects on randomness, assuming the key to randomness being the effects of interrupts where OS code affect the cache and TLB entries used by the program. An underlying assumption is that if you just run a program in isolation, the caching and speculation mechanism will converge to a good state for the program, with no or little timing variation as a result.
I wonder if that is still true on a modern machine. Their measurements were performed on a mid-1990s UltraSPARC II, which is in-order and much simpler than current Intel Core processors. Even an ARM Cortex-class processor is much more complex. I would really like to see measurements about the inherent randomness in today’s processors, without any recourse to interrupts and hardware actions to disturb the picture. I wonder if you would still see variations in the execution time of a body of code due to the different periods of various hardware mechanisms, or if it all converges to maximum throughput and minimal hardware latencies for all parts of the pipeline. For some reason, I have my doubts that the hardware would be that ideal in practice.
What makes the randomness of the actual hardware hard to evalutate is that the available codebase is the HAVEGE code, which is an “expansion” of the basic HAVEG idea. The expansion being to couple a PRNG to the collection of entropy from the hardware, in order to produce much more random noise (in terms of random bits per second) than just the hardware would provide. While very practical, this also serves to obscure the fundamental randomness of the hardware from direct measurement.
Essentially, HAVEGE generates a ton of random data that appears to be of high quality in the tests provided. But that data mixes three factors into a single measurement:
- Hardware low-level random fluctuations (cache, pipeline, branch predictor)
- Hardware coarse-grained variation (interrupt timing, the time taken
to perform OS actions in response to interrupts) - The effectiveness of the PRNG code
Picking these three apart would be interesting, and it is a shame that there seems to be no recent evaluation of HAVEGE.
This is a very interesting concept. The PROARTIS EU project (http://www.proartis-project.eu/) that started in 2009 pushes that idea much further. The aim is to design processors that have randomised timing behaviour *by design* so that only probabilistic analyis are possible.
For example, use random cache replacement policies. This makes static prediction of which memory accesses are cache hits or misses impossible, but the behaviour can be analysed probabilistically to derive the probability distribution that a particular execution of the program will exceed X cycles with a probability lower than 10E-12 (or any arbitrary low number).
The idea is to introduce these sort of mechanism across the whole architecture. CPU, caches, memory latencies, bus arbitration protrocols, RTOS and applications. The benefit is that even though pathological worst-case scenarios can still happen, they are provable (in the proof mathematical sense) that the probability of happening together is negligible.
Guillem Bernat
That is yet another angle on randomness: as a way to smooth out large-scale variations by encouraging small-scale variations. Making hardware locally variable but globally predictable. Do you take crytographic aspects into account in the architecture?
The key issue is not only to smooth large scale variations, but to break the systematic nature of some pathological cases.
consider the following code
for(i=0;i<N;i++) {
a();
b();
c();
d();
e();
}
In a processor with cache associativity of 4, you can have 2 very different behaviours.
1) If the linker happens to put the code spread out across the cache, then the typical scenario of "the first iteration all functions have instruction cache misses" and then "any following iteration of the loop, all functions hit the instruction cache" applies
2) however, there is a pathological case…
(note to the reader. Stop here, and try to think how could it be that (1) does not apply!!!).
The problem is that if all these functions share the same cache lines, then every execution of the loop would result in cache misses. The difference in performance is non linear (and can be actually quite brutal 5X or 10X depending on the memory access latency). This happens for some particular cache replacements policies, e.g. LRU. And similar scenarios can be build for other cache policies.
The problem here is therefore that a "small" change in the code layout, can have a tremendous change in the execution time of the program. The second key issue is that this behaviour is systematic. Every iteration will have the same behaviour and therefore the pathological case (e.g. all executions miss the cache, will happen every time).
By adding randomisation in the cache replacement, for example, you may lose some average case performance, but the probability of that pathological case from happening, e.g. every instruction access is a cache miss can not happen every time. It happens only with exponentially decreasing probability.
—
Guillem Bernat
@Guillem Bernat
I have to apologize – this reply got stuck in the spam filter, and I did not see it until today. Bad of me. Thanks for the insightful reply, Bernat!
Do you also randomize cache placement in some way? That could be interesting, implementing a fully associative cache with random replacement. For a case like this, after some bad luck, eventually you would expect the cache to settle down with all code mapped since that is hte only scenario not changing the cache contents 🙂