“Processor Performance Insights and Optimization” – Computer and System Architecture Unraveled Event Four

Finally, the fourth CaSA, Computer and System Architecture Unraveled, meetup happened on November 6. It took far too long to get it organized, but we finally did it. The theme was about processor performance analysis and efficient processor implementation, offering two talks from very different perspectives. The location was almost the same as before, on the 19th floor of the Kista Science Tower building. Once more thanks to the sponsorship from Vasakronan and Kista Science City.

Intel Processor Trace

The first talk was “The Intel processor tracer (Intel PT) and its use cases”, presented by Bogdan Tanasa. Bogdan provided an overview of how Intel PT hardware works and how it is used by software on Linux machines, as well as some examples of how he has used it to get timing and program flow information from applications.

The current implementations of Intel PT can do quite a few things. The base functionality is classic processor tracing, where the processor cores emit information about their execution, based on a configuration from the user. A later addition is the PTWRITE functionality that provides a way for software to emit information into the Intel PT data stream – basically, providing low-overhead instrumentation.

For software timing measurements, the cost can be as low as a few cycles to run a PTWRITE, an order of magnitude faster than grabbing the current time using RDTSC and saving it somewhere. The tradeoff is that the trace information has to be decoded and the information mapped back to the software being instrumented.

The talk went into quite some details on the interactions between the hardware behavior, driver interfaces, and instrumented software. For example, how to reconstruct program flow based on branch information packets. This requires use of objdump to get information from a binary, parsing X86 instructions using XED, and rebuilding the full information from the quite compressed Intel PT packets stored in memory.

The setups that Bogdan used were all based on tracing software on one core with a decoder running on another core in the same system. Basically, the traced core puts information into shared memory that is consumed by the decoder. The decoder has to run fast as the memory buffer is of limited size.

The best source for information about Intel PT is the Intel Software Development Manual, SDM, Chapter 33. Bogdan referred to this repeatedly in his presentation.

Microcoded Processors

The second talk was “Why Moore’s law makes microcode a good idea”, by Stefan Blixt of Telesis Innovation. Stefan has been involved in processor for a long time, and started his talk with a true look back to the times of TTL in the 1980s:

He then went through the history of the processor design that his current company is working on. Basically, the idea has always been a microcoded processor where the microcode is used to implement complex but efficient instruction sets for the user code stored in RAM.

This was coupled with the use of microcode routines to implement IO and acceleration functions instead of using additional specialized hardware. This design pattern has worked well for a range of applications going back to the 1980s and a company called Lynx Datorteknik. Lynx provided technology to the Versal company that produced “data terminals” in the days before “PC” was a thing, as well as to Facit.

The performance that resulted was excellent for the time, beating standard solution performance.

Unfortunately, in what became a running joke during the talk, Versal and Facit went under in the 1992 crisis. Bringing Lynx with them. The core design continued under the name of Imsys, rising to fame as a “Java processor” that I used to use as an example of an interesting computer architecture in lectures from the late 1990s… The processor used microcode to implement a JVM, providing excellent performance for the given chip size, memory size, and power consumption. They called it the NISC, No Instruction-Set Computer.

By 2000, they had two strong customers again: Ericsson, looking to use the processor in phones, and Array Printers, using it as a cost-effective printer controller. Unfortunately, Ericsson then went with ARM for the phones and Array went under in the dotcom bust and telecom crisis. Recurring theme, as I said.

In the next crisis, in 2009, Imsys split off Qulsar – which used the Imsys processor to build time-synchronization equipment. Hans Brandberg who presented at our CaSA event #2 was part of the team that moved in Qulsar, so we can say that this is the second time the Lynx Datateknik legacy has presented at the CaSA series!

In 2024, the technology is still around and relevant. Imsys built an AI accelerator using the same principles, but unfortunately went bankrupt earlier in 2024. The new company Telesis Innovation is taking over the technology and bringing it forward.

One of the new ideas is to use the LLVM compiler intermediate code as the user-facing instruction set, known as the ISAL (ISA for LLVM, I guess). This is in addition to the already mentioned JVM mode, and an older C-compiler-facing ISA known as ISAC (I.e., ISA for C). Essentially, this would provide the benefits of a modern compiler frontend without requiring the machine to support a full (and arguably inefficient) standard instruction set.

Discussions

There was a Q&A session and long discussions while mingling, as always.

The crowd listening to the talks

One interesting discussion that came up was just what kind of instruction architecture modern ARM is anyways. Everyone agreed that with extensions like vector and matrix operations you cannot really talk about “Reduced Instruction Set” anymore. Still, the fact that AArch64 instructions are uniformly 32-bits does mean they are a little more reduced in their complexity compared to something like X86.

Another thread was about out-of-order execution and Intel PT. Intel PT operations are part of the regular flow of instructions, and as such might require the use of fences in order to ensure that PT packets properly delineate a piece of code that is being measured. This is annoying, but makes sense as making PTWRITEs contain an implicit fence would make them and the instrumented program much slower for all the cases where the precise order of retirement of instructions does not matter.

About Intel PT, a question came up on whether something like that is available on other architectures. Not quite, at least not today. ARM CoreSight debug blocks and TARMAC functionality can do extensive precise tracing – but it turns out that it is basically impossible to get hold of hardware that exposes that, for a regular mainstream system user. A cloud-based ARM VM will not give you a “JTAG” connector, and neither will consumer-facing ARM-based hardware like Windows-for-ARM PCs or Apple Macs. ARM has a statistical sampling solution available for user code, but it is not really the same thing as it only targets performance, not precise control-flow reconstruction or instrumented code.

The Imsys processors do have support for tracing – but only for the microcode, and that is only facing the microcode developers. They do complete trace recordings, allowing single-stepping back and forth in a way quite reminiscent of the old Green Hills Time Machine debugger feature. Of course, given the flexible nature of a microcoded processor, some kind of tracing capability can be added to the user-facing software interface.

There were also discussions around target markets and their requirements. For example, is binary compatibility relevant or not? Definitely is for something like a server, laptop, or phone – where “standard” processors make sense. But for embedded specialized systems, not so much. That is really where the Imsys/Telesis Innovation chips make sense.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.