On the longevity of hard drives

By now, the fact that disk drives fail a lot more than the vendors say they should, and for different reasons than we used to think, should be old news. However, it’s been on my mind a lot lately, as in the last three months I’ve lost two drives, and a third is starting to fail. Coincidentally, all three are Maxtor DiamondMax 10 drives, one 150 GB and two 300 GB, all SATA150. They are all well within their design life (and warranty); they have all operated well within their environmental limits; there is no reason why three out of the six Maxtor drives I have should fail in such rapid succession, while all my Western Digital drives – some of them twice as old – are fine. In fact, I’ve never lost a Western Digital drive; on the other hand, all the IBM drives I’ve had are toast, as is the only Seagate I ever bought, a Barracuda that was pretty much DOA, though I misidentified the problem and let the disk lie on a shelf while the warranty ran out. Continue reading “On the longevity of hard drives” »

Dead Disk Update

In the end, I only lost two sectors: one in the middle of an ISO file in my home directory, another somewhere in my DocumentRoot. Both files were easily recoverable. The affected file systems are now safely parked on a mirror while I get the new array up and running.

ZFS proved uncooperative at first: I had trouble getting a consistent and up-to-date set of patches, and every time I tried to create a file system and copy data over, it would panic. Pawel and I tracked it down to zfs_reclaim(), and finally figured out that it was caused by zfs_reclaim() calling vdropl() directly instead of vdrop(). The thing is that vdropl() is actually private to vfs_subr.c, and declared static; the code just happened to build and work because ZFS was being built with most warnings turned off, and most testers didn’t set CPUTYPE. Giving vdropl() external linkage and a prototype in vnode.h put an end to the kernel panics.

Oh, and by the way, bunnies are cute.

What we have here is a lack of redundancy

What kind of idiot stores his home directory and his entire web content on a striped set of four disks, with no redundancy?

Me, that’s who. You’d think I would have learned from January’s fiasco that disk crashes don’t just happen to other people. But back in January, I was (relatively) lucky, as the disk that crashed was part of a mirrored set. Not so this time.

The file server actually crashes when trying to read from the faulty disk, so I had to get creative and figure out a way of not only copying it to a healthy disk over the network, but doing so in a way that allows me to recover from crashes and continue where I left off. The result is ndr, the Network-assisted Disk Recovery tool. Continue reading “What we have here is a lack of redundancy” »

Detect drives done, no any drive found

(actual diagnostic message from the on-board JMicron RAID controller on an Asus P5B-V motherboard)

I learned a few lessons on Monday:

Lesson the first
If you have a RAID 1 array, and one of the drives suddenly drops out of it, do not simply assume it was a software error and reassign it, or you will be very unhappy a few months later when that drive really fails. Continue reading “Detect drives done, no any drive found” »