Saturday, March 15, 2014
CERN on petabyte errors
- Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
- Single bit errors. 10% of disk errors.
- Sector (512 bytes) sized errors. 10% of disk errors.
- 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.
- RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The disks are spec’d at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec’d rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
- Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn’t have been any. Only double bit errors can’t be corrected.
99% BAD HARDWARE WEEK: