Memory Hardware Failure Research Project

Overview

Our research focuses on the characteristics of memory hardware errors and their implications on software systems. A plethora of research works can be found on memory fault tolerance. Often times researchers use accelerated tests in their controlled environments to collect data. We have taken a different approach to conduct field tests that monitor computers in real time and record the actual errors happening on them.

Memory errors can be categorized as soft errors and hard errors. Soft errors are also known as transient errors, which refer to those errors that are caused by environmental factors only and would not have lasting effects on memory hardware. Hard errors (or permanent errors) include those that are due to real hardware defects. We discovered that in each of our measured environments, the soft error rates are orders of magnitude lower than previously reported on a per Mbit basis. In the server farm at Ask.com, we have found quite a few hard errors. We made some error rate/pattern analysis in our previously published papers Usenix'07 and Hotdep'07. And right now we are engaged in using the data collected to predict the whole memory system failure rate/pattern given different ECC schemes and maintenance strategies.

Publications

Project Members

Xin Li, Phd Candidate @U of Rochester
Kai Shen@U of Rochester
Michael Huang@U of Rochester
Lingkun Chu @ask.com

And also we'd like to thank Tao Yang and Alex Wang at ask.com that have made this research and the data release possible.

Contact

Xin Li (xin.li@rochester.edu)

Link

We are a participant of USENIX computer failure data repository . You can find more failure data in addition to ours following this link.

Data download and format description

Our raw data file can be found in here. This file is of a csv format, which can be recognized by all major software for data analysis purposes.

We give a brief description of the fields. For more information, please contact me .

Method

Some necessary detailed facts you should notice:

  • Our monitoring started on Nov 30, 2006.
  • The memory hardware scrubs the memory continuously. It records the errors discovered in some special purpose registers. At any time, the registers can only hold the info about two errors, and before the info is cleared by a user-level program, further error info will be discarded. Therefore, in any period of time, there can be more much errors present in the system than what the registers have recorded.
  • We poll the memory controller regularly (typically 3 times a day) for error information. After we collect the info, we clear the registers so that the hardware can continue to work.