Speaker: Fred Douglis, CTO office of the EMC Core Technologies Division
Title: RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
Authors: Ao Ma, Rachel Traylor*, Fred Douglis, Mark Chamness, Guanlin Lu, and Darren Sawyer, EMC Corporation; Surendar Chandra and Windsor Hsu, Datrium, Inc.
* work done during internship; student at U. Texas at Arlington
Host: Zhong Shao
Abstract: Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from 6 disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures.
With these findings we designed RAIDSHIELD, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.
We conclude with discussions of operational considerations in deploying RAIDSHIELD more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors, we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.
This work was presented at FAST 2015 and recently extended with additional authors for publication in ACM Transactions on Storage.
Bio: Fred Douglis is in the Advanced Development group of EMC Core Technologies Division, in the office of the CTD CTO. He works on systems and storage technologies such as deduplication, compression, load balancing, and others. He holds M.S. and Ph.D. degrees in computer science from U.C. Berkeley and a B.S. in computer science from Yale.
He has worked in industrial applied research throughout his career, including Matsushita, AT&T (Bell) Labs, and IBM Research before joining EMC in 2009. He also has been a visiting professor at VU Amsterdam and Princeton University. He served as editor in chief of IEEE Internet Computing from 2007-2010 and has been on its editorial board since 1999.