Carbon Design Systems have been on a veritable blogging spree recently, pushing out a large number of posts around various topics. Maybe a bit brief for my taste in most cases (I have a tendency to throw out 1000+ word pseudo-articles when I take the time to write a blog), but sometimes very interesting nevertheless. I particularly liked a few posts on cache analysis, as they presented some good insight into not-quite-expected processor and cache behaviors.
The posts in questions are Cortex-A9 Cache Optimization part 2 and Cortex-A9 Cache Optimization part 3. They are not really that much about Cortex-A9, as the most interesting results seem to come from old ARM11 cores…. but the Cortex-A9 certainly shines in comparison. What I like about these kinds of small experiments and write-ups of them is when you discover something unexpected that you did not foresee at a higher level of abstraction.
In one case, we see behaviors at the hardware level (RTL) that would not normally be expected if looking at a cache system from a designer perspective. In the part 2 blog, the ARM11 core suffers extra cache misses when an L2 cache is attached to it – but not activated. The attached L2 changes some subtle core timing, causing some other core mechanisms to change their observed behavior, leading to more cache misses. Normally, you would assume that the result would be totally neutral. On the other hand, why build an L2 cache unless you intend to use it? Still, interesting observation.
It was also surprising to me that with the same size cache, the Cortex-A9 has fewer cache misses than the ARM11 on the same code. I wonder what lies behind that? It would have been nice with some deeper analysis, my guess is that it is something to do with the replacement policy in use. Or some little tweak like a prefetcher or victim cache or something.
Part 3 shows the well-known fact that an L2 cache is necessary to clean out the access stream to main memory, as well as indicating that with an L2, the L1 does not affect the main memory accesses much. The interaction between L2 and branch prediction was surprising, though. Once again, more explanations would have been in order.
But I can see the fun in playing around with these models and observing behaviors that would otherwise go unnoticed simply due to the difficulty of measuring such information on hardware.
Having the ultimate powerful design that will scale up to N cores and well into petascale computing is not a trivial task. Imagine a mobile phone with Tflps computing power in your hand? What kind of apps will be using all that power? How about a Pfls mobile device? Again what could such a device be used for?
And once this hand held petascale computer is available … somebody will drop it on the pavement and destroy it. Somebody will hit it with a stone and shatter.