“Unusual Perspectives on AI” – Computer and System Architecture Unraveled Event Five

The fifth CaSA, Computer and System Architecture Unraveled, meetup took place on January 30. We finally gave in and joined the AI hype train, resulting in an event with a somewhat different audience and different discussions. More society and applications, less computer architecture. Our two presenters were Håkan Zeffer from SambaNova Systems and Björn Forsberg from RI.SE (doing his second CaSA presentation!). Håkan talked about the architecture of the Sambanova AI processors, and Björn about AI compilers.

The location was the same as before, on the 19^th floor of the Kista Science Tower building. Once more thanks to the sponsorship from Vasakronan and Kista Science City.

We did get an upgrade to the equipment, even if the nice large touch-screen TV had an annoying tendency to go into power-save mode when we were trying to get things set up. It is always amusing to see a group of PhDs and decades of experience in the computer industry being totally stumped by a simple piece of technology. Unless you are one of the PhDs, and you are being stumped.

The SambaNova SN40L Reconfigurable Dataflow Unit

Håkan Zeffer, Senior Director Hardware at SambaNova Systems, presented the architecture and performance of the AI processors built by SambaNova. In particular, the current SN40L “Cerulean” RDU (Reconfigurable Dataflow Unit).

The RDU is a more specialized compute engine than a GPU, tailored to the needs of AI and machine learning (ML).

*Slide from the* *Hot Chips 2024 presentation* *about the SambaNova RDU*

It has a pretty large on-chip SRAM (520 MB per two-die socket) along with both HBM and DDR5 DRAM. This is in contrast to the Cerebras design that only uses SRAM. The idea is to offer both a lot of fast memory to run models, and enough bulk memory to allow quick switching between different models (without involving disks). The memory is large enough that a single rack can run a 405B LLama model, where competitors like Groq require several racks to run even a 70B LLama model.

Each chip contains a network of PCUs (compute units) and PMUs (memory units). The precise configuration of each unit and how data flows between them is programmed into the hardware in a way that is FPGA-like. It is essentially coding the dataflow into the hardware, and the compiler does place-and-route to find the best way to map a model onto the hardware. It is all in the name – a reconfigurable dataflow unit. This makes it good at structured compute, but bad at decision making. It is not a general-purpose machine, and was never intended to be.

A key part of the performance of the RDU is the way in which multiple stages of an LLM operation like a transformer can be handled as one single invocation of the accelerator, instead of multiple kernel calls as it has to be done on current GPUs. SAmbaNova illustrates what it might look like on a GPU like this:

Where the RDU can do the entire set of operations as one single “kernel”. For more details, I recommend a ServeTheHome article from 2024 that goes through the entire HotChips presentation with annotations.

A key selling point and raison-de-être for SambaNova machines is their efficiency. The current generation of SambaNova machines have a significant advantage in performance and efficiency over Nvidia GPUs (which is the most relevant competition). In terms of performance, SambaNova claims that they can produce output 10x faster than an Nvidia H100. It should be noted that this level of absolute performance is also found in other more specialized accelerators like Cerebras and Groq.

In addition, SambaNova can provide the performance at 1/10 the power of an Nvidia setup. This means that the same work might be achieved at 1/100 the energy cost. Very impressive, and not entirely unreasonable seeing how Nvidia chases performance by pushing power to insane levels. Also, for all its optimization, a GPU is still an instruction-processing machine, which does include unavoidable overheads. By skipping instruction processing entirely, the RDU should by all rights be more efficient.

It should be noted that the SambaNova system is mostly intended for inference workloads, not training. While it can do training, it is not quite as performant on such workloads. So, it is not a pure inference machine, unlike some AI accelerators. It has the memory to tackle training.

There was a lot more architecture information in the talk, but I am trying to keep things brief. Suffice to say that it was really interesting to get a deep dive into this rather unusual architecture.

I can personally attest to the RDU being fast. I used their free cloud service to run some AI models in my experiments with code analysis and creation, and the output from the model definitely flows faster than from any other of the services I tried. The SambaNova hardware appears to offer lower latency and faster token production.

Finally, Håkan talked about what the SambaNova team in Stockholm is doing. There is a compiler team and a simulator and architecture team.

The simulator team has several products it seems: both a fast functional simulator for pre-silicon software development, and a slow timing-accurate model that is used for performance analysis and as a performance debug tool. The RDU is not easy to program, and having a simulator that can provide insight into what is going on in the system is invaluable. A simulator is better than hardware as it can stop at any point, you can script actions, and it is easy to pull out performance information to show in a GUI.

AI Compilers in the Golden Age

Björn Forsberg, researcher in computer science at RI.SE, talked about compilers for AI applications and ongoing European research projects in the area.

This is the second time that Björn presents at the CaSA seminars – last time was a little less than a year ago when he was part of our RISC-V event. This talk can be seen as a complement to the previous one – where that was mostly about a hardware design that enabled parallel compute, this talk was about the compilers that enable us to actually get programs to run on the hardware.

The title of the talk is a play on the 2019 Turing Lecture by Hennessy and Patterson. As Björn put it in his abstract, “In their Turing Award lecture 2019 Hennesy and Patterson outlined a new golden age for computer architecture, enabled by the emergence of open standards, domain-specific architectures and languages, and more. The rapid development of AI embodies this shift — and create a need for a new generation of compilers that codify generalizable techniques to bring AI to the diversifying architectures.”

AI provides a very good environment for compilers. The inputs are unusually rich in semantic information: PyTorch, TensorFlow, ONNX and other all tends towards defining what should be computed and less about how (unlike classic programming languages that basically are only about the how). The distance from the high-level inputs and the actual hardware is also quite long, providing both opportunities and challenges for the compilers.

There is a rather large set of related technologies used with AI compilers, as illustrated in the above slide showed by Björn.

A key technology is MLIR, “Multi-Level Intermediate Representation”. It was originally developed at Google and probably called “Machine Learning IR”. The key creator is Christopher Lattner (who previously created LLVM and Clang). He is currently at Modular, a company working on AI compilers. MLIR is being used in most AI stacks according to Björn. It is developed under the principle of moving fast and breaking backwards compatibility.

Björn is active in a research project involving Swedish industry and academia targeting the use of “embedded AI” – including using AI for real-time processing in applications like telecoms and avionics. This is a very different type of problem compared to compilation for a data center. There are resource constraints and timing constraints for the output, while the input must still be “standard” AI to provide the expected user interface for domain experts. It makes sense to prototype solutions on standard hardware and then deploy to more constrained environments, without requiring changes to the input model. Reminds me of the keynote by Michaela Blott from AMD on how to compile AI to FPGA that was given at DVCon Europe in 2023.

The talk was wrapped up with a quick overview of some of the EU projects and initiatives that are ongoing to strengthen AI in Europe. In particular, the DARE (Digital Autonomy with RISC-V in Europe) project that mixes AI, high-performance computing, and RISC-V. I am not sure I think the RISC-V angle on AI acceleration makes much sense at all, but it is clearly where the EU is pushing in a lot of resources.

Discussions

The event lived up to its name about unusual perspectives on AI in the discussions. As mentioned, the audience was different and asked questions more about applications of AI and less about the technological nuts and bolts. There were some nuts and bolts, but not as much as we usually get at a CaSA event.

Securing AI

One question that came up is how we can process sensitive information using AI systems. It is obvious that sending things up to public cloud services is a bad idea. But what is a good idea?

One aspect is keeping data to some part of the world – like keeping EU data in the EU for processing on hardware located in the EU. Even if it is in a remote data center that you do not own yourself (i.e., “cloud”). This would be combined with isolation between tenants in the same system.

Another way is to run your AI workloads locally in your own datacenter. For all the talk of cloud, selling racks ready-to-roll is part of the business of SambaNova, Groq, Nvidia and others. SambaNova provides a mix of hardware and software – to have something to run on the on-premises system they offer a service to take open-source models and optimize, specialize, and tune them.

Artificially Intelligent?

A key issue that we circled around was whether we can consider current AI/GenAI/LLM technology “intelligent”. Jonas Svennebring made the case that “intelligence is the in eye of the viewer”, something he learned a long time ago. It is very easy to ascribe real intelligence to an interaction with a current LLM, which is something that we generally agree on is not truly intelligent. The LLM output is just very easy to interpret as intelligent. But think about the Eliza Effect.

Current AI systems/LLMs are definitely useful, but we have to be careful how and where we apply them, and how we use their output. The problem is that they often manage to sound very convincing and convinced of the correctness of the generated answers, while actually having no idea what they are talking about.

Håkan made the point that we are currently “throwing AI at any problem to see what sticks”. Totally agree.

Where are LLMs Going?

We had quite a debate about the nature of LLMs and how they can develop and whether you can trust their output. In general, everyone agreed that LLMs definitely have a tendency to produce incorrect answers – but do so in a way that makes no indication about any uncertainty.

I would argue that LLMs exhibit a software failure behavior, in the sense that small changes in input can result in the output going from good to totally incorrect. Such highly-input-sensitive chaotic behavior is typical of software. Unfortunately, my sense is that human minds are pretty poor at handling unexpected breakages – in the physical world, small changes in input or force applied results in small changes to the state. Software is not like that. I have some good examples in my exploration of AI models explaining code…

Sustainability

Another discussion was around the sustainability of AI. LLMs as currently implemented use ridiculous amounts of energy. Where is this heading?

On one hand, we see ever more power-hungry systems to run and train bigger models faster. On the other hand, there is architectural innovation going on that improves the efficiency of the hardware and the software. For example, the DeepSeek models and SambaNova hardware (as discussed above). On the third hand, there is always Jevon’s Paradox.

Eventually, it should come down to economics. As users get exposed to the actual cost of building and running the compute needed for AI (today it is clearly being subsidized), the demand should reorganize itself to the areas where LLMs have a unique advantage over other ways to compute solutions to problems.

DeepSeek and Architectural Innovation

We could not avoid a discussion on the recent announcements from DeepSeek. With some real experts in the room at least I learnt something new. There are some real innovations in the DeepSeek models, as they have been presented.

First of all, it uses an unusually wide mixture-of-experts model – with 256 different experts/submodels instead of a more typical 8 or so. Each expert can be quite specialized to a certain type of task or knowledge. This is combined with a good “router” in the model, allowing activation of just a fraction of the model overall for each query.

Second, the model architecture of the attention block was changed to make it more efficient, based on compressing its inputs based on assumptions on what is important. By reducing the amount of data needed for each operation, it is possible to support longer input windows.

There are several other innovations in DeepSeek that are too deep for me to explain.

If we abstract it a bit, I would argue that DeepSeek is a representative of software innovation in AI. Making LLMs better by doing classic software optimization and algorithmic innovations.

Another example of the room for innovation is Google’s work on Titans, replacing classic transformers. Titans scale better to large context windows (classic transformers are ordo n-squared, Titans are closer to linear).

Innovations like these show why it is too early to design totally specialized accelerators for specific parts of today’s LLM pipelines. For example, if you had a “transformer” block in hardware it would be worthless for these new algorithms.

It shows that AI is just like classic computer architecture in a deep sense – too much hardware specialization limits hardware applicability. You need the right balance between.