A month ago, I participated in a seminar at Schloss Dagstuhl in Germany, about “Discrete Algorithms on Modern and Emerging Compute Infrastructure”. Not my usual cup of tea, but it was very interesting and insightful nevertheless. I have attended a Dagstuhl seminar once before, back in 2003.
Continue reading “Schloss Dagstuhl (and a Seminar and Cerebras)”Tag: HPC
When is Redundancy Cheaper?
I find the subject of fault tolerance and resiliency in computers quite interesting. It also very interesting to look into what kinds of faults actually do happen in the real world, and what impact they have. I recently found a couple of good sources on this. First of all, a paper from Super Computing 2012 by Fiala et al, called “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing” (ACM Digital Library). One of its references was to a 2011 talk by Al Geist, “What is the Monster in the Closet”, which provided some more data on how common faults are.