This is a short story from the world of virtual platforms. It is about how hard – or easy – it is to model a simple and well-defined hardware behavior that turns out to mercilessly expose the limitations of simulation kernels.
Which Hardware?
The CLINT interrupt controller used by many RISC-V designs has a built-in timer used to generate timer events, typically for operating system timer interrupts. The system is simple – there is a counting-upwards mtime register, and a comparison register mtimecmp (actually, one such register per core served by the CLINT). The registers are all 64 bits.
When software does not want to get regular timer events, or as a safeguard when changing the value using two 32-bit writes, the convention is apparently to program the timer compare register with all ones. I.e., a very large value that the timer will never reach. Checking the math, if we assume a 10 MHz timer, (2**64)-1ticks of the timer is something like 58000 years. Not quite the end of the universe, but not a very likely uptime for a computer system either.
Limited Simulation Time
How would you model that in a virtual platform? The standard transaction-level programming model is to react to a write to the compare register by posting a simulator event. The event should be placed at the point in virtual time where the counter equals the programmed compare value. That way, no simulation time is wasted by incrementing the timer register values.
Easy enough. Compute the point in time when the timer will trigger (i.e., 58000 years into the future), and post an event at that point in time.
The event queue is typically implemented using a very small time base, in order to accurately handle events posted at very small differences in time correctly. Something like picoseconds is a typical level of resolution. And picoseconds are very small. If you use an unsigned 64-bit counter for picoseconds, time ends/the counter overflows after about 200 days. Note that times on the event queue should be relative to the current point in time, so this is a rolling 200 days horizon.
Such a time queue would have no problem with the expected common cases, when the timer expiry time is not all that far away. Typically, a timer is programmed to expire with a fraction of a second. When a timer is not in use, it would be disabled and no event would be posted.
Not so in this case. Your average 64-bit picoseconds timed event queue has no way to represent an event 58000 years into the future. What to do?
Solutions?
200 days is a bit short of 58000 years, but for all practical purposes, putting an event at the biggest time available provides a perfectly functional solution. I.e., the interrupt trigger at 58000 years into the future gets modeled by an event posted some 200 days into the future.
The error is a factor of 100 thousand (a 10MHz clock compared to a 1THz) – but it is not visible to the target system since the event will never trigger anyways.
No software will ever notice. In theory, some kind of bizarro test that programs the compare register to all ones and then waits for a very long time to check that no timer interrupt happens might be able to discern the approximation. But what kind of hardware test would be running for most of a year to make sure something does NOT happen?
But based on that observation, another solution would be to not actually post any event at all. If the idea is that it is OK to post it at a very wrong time since it will never trigger, it is just as right to not post an event at all. It feels intuitively strange, but really it is just another way to handle corner case as a special case.
Why Care?
I just found this case fascinating since it illustrates the nature of simulator approximations and limitations. You could argue that it is a problem that a virtual platform cannot accurately represent the behavior of the very simple and clear hardware specification. Is that argument sufficient to motivate a change to the implementation of the event queue to use 128-bit integers instead of 64 bits?
Moving to 128 bits would certainly make it possible to model this case “precisely”; but it would incur a performance penalty for every single event, and the benefit would be to better handle an event that will never trigger. Does not sound like a good trade-off to me.
Sometimes, good enough is simply good enough.