Earlier in July 2019, I had the honor of presenting one of the keynote talks at the 19th SAMOS (International Conference on Embedded Computer Systems: Architectures, MOdeling, and Simulation) conference, held on the island of Samos in Greece. When I got the invite, I had no real idea what to expect. I asked around a bit and people said it was a good conference with a rather special vibe. I think that is a very good description of the conference: a special vibe. In addition to the usual papers and sessions, there is a strong focus on community and social events, fostering discussion across academic disciplines and between industry and academia. There were many really great discussions in addition to the paper and keynote presentations, and overall it was one of the most interesting conferences I have been to in recent years.
SAMOS on Samos
The conference was founded in 2001 by Vasillis Stammatidis, who hailed from Samos. For a long time, the conference was actually held in his old hometown on the north of the island, far from everything else. In recent years, it has moved to the south side of Samos, to the main city of Pythagorion close to the airport.
The conference dinner was still held in the same restaurant on the north side – but the bus ride up there clearly showed how the north side of Samos has really suffered in recent times. The hotels are closed, there are plenty of abandoned housing development projects, and only a few restaurants survive. While in the south there are many lively beach resorts catering to sun-and-beach tourists from Northern Europe. The conference makes good use of the location, with a lot of time spent outdoors. It was so warm and relaxed that I actually went to the conference sessions dressed in shorts and sandals – which has never really happened before. The use of the outdoors included two social events – one boat trip to a remote beach for lunch and a beachnote, and one bus trip to a beach for lunch and discussions. This provided many hours to interact with the other conference attendees for some really useful and insightful discussions.
I delivered my keynote on the topic of virtual platforms in industry, talking about the many different use cases we have seen over the years. From shifting software left and other pre-silicon verification activities, to topics like debugging and supporting legacy and out-of-production hardware. The slides should come up on the conference website at some point – and I will post them on my own website.
Yale Patt on Insights and Mechanisms
Yale Patt, the computer architecture legend, delivered a “beachnote” like he has apparently done every year since 2008. Yale is now eighty years old, but still very sharp and active. By the way, at the time of writing this blog, the photo of Yale on his Wikipedia entry actually shows him wearing a SAMOS conference polo shirt!
The beachnote is an interesting concept – rather than a talk in the dark auditorium of the conference center, it is a lecture performed as part of the first conference excursion, outside, in the spirit of the old Greek philosophers who would lecture outside without projectors and Powerpoint. The purpose of the beachnote is to provoke thought and foster discussion, rather than deliver specific information on a specific topic.
This year, the beachnote addressed the question of what is actually being accepted for publication at computer architecture conferences. As a researcher, you need to get published in order to get credit for your work, and thus the norms and unwritten rules that guide reviewing in a particular field are critical in driving the overall direction and style of research.
With a bit of simplification and abstraction, we can consider research to produce insights and mechanisms as its results. Insights are realizations about how things work, and are usually key to big advances in computer architecture. Mechanisms are what we build in order to actually take advantage of insights. To evaluate the mechanisms, it is also necessary to have some kind of baseline. Yale was very insistent that work should be compared to best known results, not just some randomly select baseline that makes the results look good. Often, there is a trivial baseline that is easy to use, but which is far from the best available.
Yale brought up branch prediction as an example of how insight drives mechanism. We started out with simple one-level predictors that looked at each branch by itself, and which while better than nothing still missed a lot of branches. A key insight was that the outcome of a branch did not so much depend on the branch itself, but rather the preceding history of branches. Thus, the two-level pattern-based predictors were created, with much better accuracy. It is difficult to see how someone could have invented two-level predictors without first having the insight.
Given a particular insight, it is also clear that there is room for a lot of improvement over time in how it is exploited. Just look at how branch predictors have evolved and keep evolving! And when they evolve, new papers should compare to the previously best published results, not something like “no branch prediction at all”. To foster progress, we need to progress beyond what has been done before.
However, today, it would like be very hard to publish just the basic fundamental insight on its own without an accompanying mechanism. That mechanism then had better be good enough to improve overall performance significantly, like 10-20% on SPEC benchmarks simulated on top of GEM5 or something like that. Only then will reviewers tend to evaluate your papers for “accept”. If your paper does not feature bar graphs showing performance differences, you likely will not get into the best computer architecture conferences. On the other hand, if you just randomly try stuff and get that 10-20% improvement without really understanding why, you are very likely to get published (machine learning was mentioned as a mechanism where this tends to happen).
This sounds rather broken, and it is not clear just what we can do about it. It seems that computer architecture in particular has become a field where you just have to do certain things to get published, even if those things are mostly irrelevant and beside the point for what you have discovered and want to say. We had a lot of discussions around this at the conference, but no real solutions.
I definitely want to see insights be published – insights help explain how the world works and can have very long lifetimes, while mechanisms are likely to be eclipsed by better mechanisms as time goes by.
Another problem is that results that show that some mechanisms do not work very well tend not to get published. If you have what sounds like a great idea for a mechanism that turns out to be not-so-great in the end, you might well have a very useful insight to share… but it is very unlikely that reviewers will accept a paper that said “we did all of this and in the end performance decreased by 5%”.
Totally independently, I made a similar point to my own keynote (if limited to a much narrower question). A few years ago, I wrote a blog post about why architectural simulation at the cycle-level is likely often a bad idea in research, and ten years ago I opined that trying to cook up “cycle accurate” models post-hoc is likely not to work particularly well. In any case, this is the slide I showed complaining about how papers are written in order to make reviewers happy:
Homogeneous vs Heterogeneous with Soner Önder
An ongoing discussion during the conference was the question of heterogeneous vs homogeneous architectures. This really started out with Soner Önder’s keynote on the first day where he made the case for homogeneity in computer architecture – while everything we build today is basically heterogeneous compute. However, homogeneity and heterogeneity can be considered at a lot of levels and from a lot of angles.
What I think Soner said was that he had found a clever way to design an instruction set or hardware-software interface that makes it possible to capture at least the instruction-level and fine-grained threading parallelism in a uniform way. It is the Future-gated Single Assignment form (FSGA) form, presented at CGO 2014. He is designing a machine to actually execute such forms directly, which has the potential to really expose inherent algorithm parallelism to the hardware for execution.
However, underlying this mechanism, the deeper design philosophy and core insight is to reconnect control-flow and data-flow dependencies. As he points out, classic von Neumann architectures separate control and data, leading to the complex pipelines of current designs where the processors work frantically to make sure that data dependencies are honored while running ahead in the control flow to issue as many instructions as possible in parallel.
What he is exploring with FSGA is how to express programs in a way that captures control and data dependencies in a unified way – like classic data-flow architectures, but different enough to make it possible to compile from C-style code and not just from functional languages.
He did note that publishing these kinds of results is hard – reviewers want bar graphs and numbers. They even insist on numbers for things like power consumption and performance estimates compared to current machines… which is honestly pointless at this level of novelty. Research is about exploring ideas and the unproven, not about building a product on a given schedule with a given performance goal.
What about heterogeneity? At a high level pretty much everyone agreed that computer systems are and will be heterogeneous with specialized processing elements and accelerators like GPUs, in addition to the general-purpose cores. However, it might still make sense to try to make them more homogeneous in programming – portability like that enabled by Java is nice. I personally like to see more domain-specific languages, but that is not necessarily the focus of researchers in rather general-purpose machines.
Memory and Compute with Onur Mutlu
Unfortunately, the last scheduled keynote speaker of the conference had to cancel at the last moment. Onur Mutlu from ETH Zürich and Carnegie-Mellon filled in the gap with an impromptu talk about the problem with memory – in particular DRAM. In current designs, memory accesses are a main limiter to performance. Memory uses a lot of energy, and we could consider the loading of data from DRAM into the on-chip cache hierarchy a waste of power.
The main problem is that DRAM latency is hardly improving at all. From 1999 to 2017, DRAM capacity has increased by 128x, bandwidth by 20x, but latency only by 1.3x! This means that more and more effort has to be spent tolerating memory latency. But what could be done to actually improve memory latency?
One interesting piece of research that Onur’s team has done is to look at what DRAM latencies actually look like, by measuring the actual latencies they get for data access in actual DRAM chips. It turns out that JEDEC DRAM use very conservative latency specifications. In practice, the vast majority of DRAM chips can operate at much lower latencies for most of the time – but current standards expect and force memory controllers to treat them all as worst-case chips at maximum temperature. You could “easily” get a 30% latency improvement by having DRAM chips provide a bit more precise information to the memory controller about actual latencies and current temperatures.
It sounds very similar to the power and thermal management and binning that is applied to processors today – some will run at higher frequencies than others since they got lucky in manufacturing. And operating speeds are adjusted on the fly to account for current power consumption and temperatures. It is actually a very natural idea. They have a many papers about this, but a good one is the SIGMETRIC 2016 “Understanding Latency Variation in Modern DRAM Chips:Experimental Characterization, Analysis, and Optimization”.
Another concept to truly break the memory barrier is to move the compute to the memory. Basically, why not put the compute operations in memory? One way is to use something like High-Bandwidth Memory (HBM) and shorten the distance to memory by stacking logic and memory.
Another rather cool (but also somewhat limited) approach is to actually use the DRAM cells themselves as a compute engine. It turns out that you can do copy, clear, and even logic ops on memory rows by using the existing way that DRAMs are built and adding a tiny amount of extra logic. This means that operations can be performed without data ever having to move to the processor chip – with very high speed and low power. It should be noted that copy and clear are rather common operations – measurements by Google that Onur cited said that memcpy and memset make up some 5% of all cycles in their workloads! This type of analog compute in the DRAM array is kind of like making the Rowhammer problem work for you instead of against you.
However, there are some limitations to the “compute in the DRAM chip” approach. It only works within a single DRAM chip or maybe a single DIMM – beyond that, you have to move data over the standard DRAM bus and you cannot take shortcuts within the memory itself. It also operates on rather large chunks of data, like 512 bytes or more, and memory allocations in operating systems would need to adapt to make sure that different pieces of data do not share the same row in memory – or take some kind of additional action to “save” unrelated data… which gets back to the memory bus. With a bit more sophistication I guess you could start to add mask registers to the DRAMs to somehow limit the scope of operations, but then the nice analog compute is mostly gone.
Still, a very intriguing idea.
Doing Real Hardware
I have not attended an academic computer architecture conference since a decade ago (or so), and one thing that was striking was just how many research teams were working with real RTL and real hardware. Several of the regular papers on the conference did evaluations by implementing actual processor cores, interconnects, accelerators, and other mechanisms in RTL and running the systems on FPGAs. The wide availability and relative ease-of-use of FPGAs has started to really make an impact, along with free cores available as RTL (including both RISC-V and ARM Cortex-M).
Another “actual RTL” aspect was the evaluation of designs for low-power by taking RTL-level designs and actually synthesizing them to an ASIC target, using libraries from real fab processes and doing power estimation using industrial tools from the EDA vendors. This demonstrates how useful it can be to provide access to real commercial tools to researchers – not just as a way for vendors to build mindshare and a pipeline of future talent, but also helping the community overall gain new insight and invent new things.
Some of the researchers were even going further, towards actually manufacturing chips using the various cheap options open to research projects these days. That is highly impressive! My only advice here is to remember to put some real hardware debug features into the chips before sending them off to manufacturing… otherwise it will be not-fun to figure out why they do not run right once they come back…
The conference had a dedicated session on “negative results”. This is something I really like seeing, as there is a lot of insight to be had from negative results. It is also clear that doing a good session on negative results is hard – only a certain type of negative results qualify. It is necessary to have a well-explained and well-analyzed failure, from something you had reason to believe might succeed. It is easy enough to fail in a trivial way, and it is not sufficient to invent a mechanism, simulate it, and get poor results on benchmarks. Thankfully, the papers presented were all “good” failures were some useful lessons could be learnt.
The first paper was about accelerating video encoding on GPUs, in particular motion estimation. It turned out that previous publications had overstated what could be achieved, by selecting a rather poor baseline. A GPU can be faster than a CPU on this, but it does so while expending an order of magnitude more work since current algorithms are a rather poor fit for GPUs. Recommendation is to look for new algorithms that better fit GPUs.
The second paper dealt with the mapping of task graphs to processing nodes in a network-on-chip, and showed that an intuitive “nice” mapping that should help multiple algorithms share a chip is a lot less efficient than a “messy” mapping.
The third paper was about why the CAPH dataflow-based high-level synthesis language has failed to catch on beyond a small circle of academics. Interesting case study, if a bit unclear on what exactly should have done differently. There is a tension here between doing an interesting language with interesting ideas that might themselves be picked up by other languages in the future – and designing a language to solve enough real-world pain-points to get a growing user base. Sometimes you are lucky and get both at once, but more often than not you end up designing to one of these two points.
The Law of 100k Slowdown
A final observation from SAMOS was that it seems that my decade-old observation that cycle-level simulations and cycle-accurate transaction-level models all seem to end up at a slowdown of around 100k times was validated. I talked to quite a few research groups at SAMOS, and it seems that no matter who or what we do, once a cycle-level simulator is under load it ends up around 100k (give or take 2x or something). It is a bit curious to see this hold up so well – even going back 50 years.
A lot more could be said, and I could have spent another thousand words trying to summarize all the other papers and discussions from the conference.