On the evening of the last Wednesday in September, we had our first CaSA, Computer and System Architecture Unraveled, event. CaSA is a meetup in Kista (Sweden) for people interested in computer architecture, system architecture, and how software and hardware interact down towards the lower levels of the stack. The topic for the inaugural event was “Core Count Explosion: A Challenge for Hardware and Software”, and it was great in some many ways!
Meetup?Getting this going has taken our small group of organizers quite a few months. The motivation for setting up the meetup series is that there are many people in the Stockholm area who work with (or are interested in) computer architecture and system architecture, but there is no real meeting place outside of work and formal conferences. What we wanted was an informal event where people meet up, after work, and just discuss deeply technical things with other deeply technical people. Totally nerdy. I would say we succeeded in that. With a sample size of one event.
Location Location Location
Indeed, getting hold of a suitable place was the biggest issue we faced. Borrowing an office floor in the tower for an evening was just perfect. The view from up there is nice, you can look out over the offices and residential buildings in Kista. And we got to see a sunset (a bit cloudy but still good).
Ola Liljedahl was first, presenting “How to utilize lock-free programming for more scalable programs.” As Ola presented it, lock-free programming is about using Compare-and-Swap (CAS) instructions to atomically update data without explicit locks. Basically, pushing the problem of atomic updates from the software (no explicit locks) into the hardware. Lock-free programming offers significant benefits to software scalability, but there are several pitfalls as well.
Jonas Svennebring had the second presentation of the evening, talking about “Hardware complexity and how it impacts software.” He looked at the problem of synchronization, atomic operations, and software scalability from the memory-system angle. Classic cache coherency has issues scaling when many cores are contending for the same lock using atomic instructions. Instead, you can add “remote atomics” to the hardware, instructions that perform atomic operations out in the memory system. This provides much better scaling than bouncing cache lines around.
But presentations are just presentations. You can attend a presentation remotely while sitting all alone at home. The point of the meetup is meet up. And that worked really well. We had lively discussions all night long: both a structured questions and answers session, and small spontaneous group discussions over beer, wine, and cheese.
The discussion between the presenters and the audience in the Q&A session I would say revolved around the issues of how to write software to make good use of large numbers of general-purpose cores to run a single program or application efficiently and with good performance scaling. The current fast growth (or explosion) in the core counts of general-purpose processors have not been entirely for free. As core counts go up, more of the hardware aspects start to leak through to software.
In particular, core-to-core and memory-to-core latencies are non-uniform enough that they start to affect software performance and real-time behavior (I recommend reading some of the articles on ChipsAndCheese.com where they painstakingly plot out the core-to-core latencies to get a sense for this). One issue in particular is that with memory controllers attached at particular points of the on-chip or multi-die interconnect, memory allocation affects performance and latencies. Ideally, you want to allocate the memory used by code on a certain processor to an address range handled by a memory controller close to that core.
Current software abstractions and APIs do not appear to suffice to deal with these issues. Pinning threads to particular cores offers a possible workaround, physically grouping related threads to keep latencies between them down and under control.
Another question was whether (and how) software should be rearchitected to better handle the realities of current hardware. The answer was “probably yes, but rather not”. Or is it possible to solve the issues with another layer of software? Can runtime systems and schedulers help handle the hardware effects? There is a trend towards providing more and more information from the hardware to help the operating system make good decisions, such as the Intel Thread Director and ARM’s features that report hardware performance to software for the same purpose.
It should be noted that both lock-free programming and using remote atomics require some changes the software. Some of those might be possible to hide as implementations of operating-system or library primitives. But is it possible to reliably encapsulate techniques like lock-free programming inside of libraries, like we do with numerical instruction set variations? I don’t know. But it is fun to discuss this with experts.