I find the subject of fault tolerance and resiliency in computers quite interesting. It also very interesting to look into what kinds of faults actually do happen in the real world, and what impact they have. I recently found a couple of good sources on this. First of all, a paper from Super Computing 2012 by Fiala et al, called “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing” (ACM Digital Library). One of its references was to a 2011 talk by Al Geist, “What is the Monster in the Closet”, which provided some more data on how common faults are.