Schloss Dagstuhl (and a Seminar and Cerebras)

A month ago, I participated in a seminar at Schloss Dagstuhl in Germany, about “Discrete Algorithms on Modern and Emerging Compute Infrastructure”. Not my usual cup of tea, but it was very interesting and insightful nevertheless. I have attended a Dagstuhl seminar once before, back in 2003.

Schloss Dagstuhl

Dagstuhl is a very special institution in the computer science world. The Schloss is run by an organization with shareholders from a range of German universities, as well as some French and Dutch universities. The goal is to provide an affordable venue for computer science seminars and workshops, and staying for a week with full board is very cheap compared to staying at a regular hotel.

“It is the mission of Schloss Dagstuhl – Leibniz Center for Informatics to further world-class research in computer science by facilitating such communication and interaction between researchers.”

The location is an old Schloss (which basically translates into Country House I guess, “castle” is not technically correct) with some modern additions. The site is a bit remote on purpose, and participants are expected to be present for the duration of the week, from Monday to Friday morning. Even so, it is a bit annoying that you have to take a taxi from one of the closest railroad stations to get to the castle.

There is a garden with outdoor seating, and seminars often take some time in the schedule to go for a hike or walk in the surrounding countryside. Up on the hill behind the current schloss there is a ruin from an older actual define castle (Burg Dagstuhl).

A Shrine to Computer Science

Dagstuhl is something of a shrine to computer science. Even its official address is Konrad Zuse strasse.

The library is fantastic in case you are looking for paper books, and the librarians make sure to collect books from the people participating in a seminar and displaying them. I was absolutely flabbergasted that they managed to find the book I wrote together with Daniel Aarno in 2014 and even procured a paper copy in time for the seminar!

There is also art with computer science themes. For example, this weave was donated by Donald Knuth and his wife Jill.

Social

Dagstuhl seminars are designed to foster collaboration and discussions. The schedule tends to be quite light to allow lots of time for impromptu discussions. Indeed, one of the key performance indicators of schloss Dagstuhl is that people who attend seminars go on to form new research collaborations and write papers together.

There is a tradition and expectation that seminar attendees will socialize in the evenings and maybe late into the night. Beer, water, sodas, and wine are all available basically at-cost and paid for at the end of the seminar relying on an honors system where you note down what you consumed. I spent several evenings talking computer architecture and PC building… or enjoying stories from the other attendees.

The Seminar

What about the seminar itself?

The title was “Discrete Algorithms on Modern and Emerging Compute Infrastructure”, organized by Kathrin Hanauer, Uwe Naumann, Alex Pothen, and Robert Schreiber. From the Universität Wien, RWTH Aachen, Purdue, and Cerebras respectively. It was very wide-ranging, but the main themes as I see were:

“High-performance computing” – i.e., large-scale computing for supercomputers and datacenters, not clients
Specifically, HPC for sparse data structures or graphs – i.e., not necessarily “dense” math like classic HPC.
How modern hardware trends affect algorithms and performance in that area.
Where hardware is going – including trends in GPU and CPUs as well as novel architectures like the Cerebras wafer-scale computer, and finally Quantum computing.

The participants came from both industry and academia, from both sides of the Atlantic. There were some people from traditional supercomputing national labs, as well as people working in the financial industry (a big user of compute). Hardware companies were represented by Intel, ARM, Nvidia, and Cerebras.

Observations and Learnings

As an Intel person, it is interesting to note that AVX512 comes up all the time in discussions about HPC on CPUs. The HPC crowd definitely uses it but are uncertain where Intel is going with the instruction sets – hopefully AVX10 will straighten that out once and for all. The fact that using AVX2 or AVX512 would lower clocks in early designs is well-remembered and still makes people wonder if AVX512 actually adds performance in practice.

The Intel Math Kernel Library (MKL) is a really successful framework in HPC. Still considered the gold standard as it is more stable and has more features with higher performance than the competition at least right now.

The Knight’s family (Xeon Phi) is remembered fondly by many users in HPC. It was a nice machine, I guess it gave many people 100s of cores when nothing else could. With powerful SIMD units on each core. But that was then. Today it is quite different.

Nvidia is the high-performance computing company today, since most compute is done on GPUs, and Nvidia has the biggest market share. In addition, the CUDA GPU programming kit has huge market share and mind-share. The default solution to “program something for a GPU” is to use CUDA as it gives you the best performance and it works in more places than the alternatives. It is a good example of technical lock-in achieved in a space where users typically try hard to avoid lock-in.

The key architectural issue facing sparse algorithms is the imbalance between compute and memory bandwidth and communication. For graphs, you usually have a little bit of work to do for each node or edge, so the bottleneck is getting the information from memory to the cores (in general). When you scale up problem sizes, you need to use many compute nodes, and then communication becomes an added issue. The ideal machine is one with a very large very fast memory that can be randomly accessed from any processor, which obviously does not exist. The Cerebras machine is a reaction to this, prioritizing memory bandwidth and raising the bandwidth-per-compute ratio. To achieve this, it makes a number of other compromises in terms of size of memory per node and a specific communications scheme that makes it a dedicated accelerator rather than a general-purpose machine.

Adding memory bandwidth is a key performance driver in current high-performance machines. It is honestly a bit surprising just how effective it is to add a few High-Bandwidth Memory (HBM) stacks to an existing design and see the performance increase (for problems that are memory bound).

The HPC community feels a little bit left behind as hardware companies chase the AI/ML world. Today, HPC users have to use whatever comes onto the market designed for other purposes. That said, Nvidia claims that Graph algorithms and “sparse” algorithms benefit from the current AI boom. AI-optimized hardware is good at moving data around, which benefits these algorithms as well.

Notes: Eras in Graph Algorithms

John Gilbert, Professor Emeritus at UCSB gave a good talk with an overview of the “eras of graph algorithms”.

1970s and 1980s – sparse direct methods, classic graph search, single-core machines.
1990s and 2000s – parallel scientific compute, graphs used to manage other HPC, but also the beginning of computations on graphs by parallel machines. Distributed memory a challenge.
2010s and 2020s – the graph is the focus of the problem space. Social networks. Biology. Infrastructure. Security. Finance. Frameworks like NetworkX, GraphBLAS, Pregel.

The problem today: the memory wall. Just like noted above, memory is the problem. John put it very succinctly: “You can buy bandwidth, but you cannot buy latency.”

What is on for the next ten years?

Data – what does the graph actually represent? Labels on edges and nodes can be complicated, not just simple numbers. The data might have to be retrieved from other systems and not co-located with the graph as such.
Data – how can researchers get data to train algorithms? Not very easy to do, and data is getting more and more valuable.

Note that NASA has a requirement that all data produced under their contracts are to be in the open domain.
Architecture: End of Moore’s law gives a chance to define hardware that is novel and not totally outdated by the progress of standard processors.
- CGRA, Coarse-Grained Reconfigurable Architecture. More high-level than FPGA.
- Quantum?
- Neuromorphic?
AI:
- Use AI to guide algorithms, as part of the algorithms.
- AI is also obviously a very large application domain for graphs.
- Can AI CoPilots be used to help programmers, including the very difficult task of performance tuning?

Notes: Cerebras

On the final day, Robert Schreiber from Cerebras presented their wafer-scale compute. Very different architecture, not easy to program. But with real potential for certain types of problems. Robert had some key points on why the system looks like it does:

“We cannot make significant progress by moving transistors around on the chip”.
We have to “rethink parallel compute”
Cerebras has the first commercial wafer-scale system; the idea was tried once before in the 1980s.

CS-3 is the third Cerebras machine. 16nm to 7nm to 5nm. They sell packaged systems, not wafers. Each system comes with a 20kW power supply, liquid cooling, cool plate on one side of the wafer, IO on the other side of the wafer.

Start with a 300mm wafer, cut out a 200mm x 200mm square from the middle.
Still steps a reticle across the wafer so you get a number of functional squares separated by a little bit of dead space, they just do not cut the silicon apart.
Each die/reticle is 17×30 mm, which is rather conservative. For comparison, the Intel Xeon 4^th gen MCC die with 32 cores is 30×25. The largest reticle today is supposedly something like 33×26 mm.
46225 mm² of total chip area
There are 900k cores or tiles:
- Each tile is now an 8-wide (used for 4-wide) SIMD unit that can do FP16 math natively. FP32 is also available, but FP64 is not. AI training is done using FP16 inputs, accumulating into FP32.
- The memory at each tile is 8 banks of 6kB of SRAM each – that is the same amount of RAM my first home computer had back in 1983!
- Each tile runs its own program with its own program counter, even though applications tend to replicate the same binary over and over for obvious reasons.
They assume some tiles will be dead – they have hardware support for deactivating a tile and routing around dead tiles. There are 45 degree physical backup wires to allow a single tile to be routed around, and there are also wires through tiles in one direction.
44 GB of on-chip memory – about half the area of the chip is memory. No cache hierarchy, no virtual memory, no DRAM, all just direct SRAM local on each tile. This was a big sell for the initial versions of the Cerebras design is my impression – you had more fast memory than anything else. Not sure about what is happening now with GPUs sporting 96GB of RAM.
21 PBps memory bandwidth – no memory bottleneck – to local memory.
214 Pbps network/fabric bandwidth – due a very large number of connections to each of the 900k processing units.
- Communication is part of the core instruction set, and programs are basically written as a combination of local computations and data movement instructions.
- The system is designed to be programmed using a data flow paradigm. Programs are stationary on the tiles, and data is streamed in and through the machine to be processed. There is a rather interesting color-based routing scheme.
IO: Terabit per second using standard Ethernet – 12 x 100G Ethernet for the outside

There are a few programming environments available:

Assembly,
C (apparently discontinued),
CSL, with a higher-level generator. I talked to some researchers from the Simula lab in Norway who had used CSL to program the machine.
PyTorch and TensorFlow is available as the “AI entrypoint”, just like any other hardware targeting the AI space.

In terms of applications, the machine is optimized for AI or sparse linear algebra. A machine-learning matrix can be maybe 1000 x 1000, matrix size given by the longest chain fed into a Transformer node. They have customers outside of AI, especially in US national labs. These customers code at a lower level than the AI people. This is using Cerebras for more like classic HPC (but only in FP32). There are some interesting results there were the time per iteration for some simulations are way lower than on traditional clusters.

Getting Home was Almost Exciting

The way back home from the seminar almost got a little bit too exciting. A massive rainstorm hit, causing flooding in several towns close to Dagstuhl. We got lucky in that we caught the train from a station that worked – the station before Türkismühle was flooded enough that trains did not stop there (but they did make it through). The train we took out of there was the one that was supposed to have left almost an hour earlier but was severely delayed. As the journey continued northwards, we kind of expected the train to be stopped at any point, especially as we could see swollen creeks and rivers outside.

The train did not go all the way to Frankfurt Airport as the last few stops were cancelled, but we found another train to change to. In the end, we got to the airport about when we had planned to anyway. But it was a bit of scare.

Schloss Dagstuhl (and a Seminar and Cerebras)