Silent drive corruption (LVM — the silent killer)

UPDATE: it appears to be software raid, rather than LVM, that is causing the corruption. More tomorrow after more testing.

There’s a fair amount of disk space on my mythtv backend, set up like this:

Onboard SATA 2×250GB -> md0 (raid 1) -> LVM [1]
PCI SATA 2×300GB -> md1 (raid 1) -> LVM [2]

The LVM is as one big ext3. I was going to copy a bunch of AVIs off my main file server to the mythtv machine (since that’s the only place I watch them anyway), and after doing a big rsync, before I deleted them I though “Hey, I should do that checksum rsync just for fun”. The -c option makes rsync do a checksum on the files to determine if they should sync a new copy, rather than just using the filesize and update time. All the files failed that checksum.
Hmm! That’s rather strange! rsync must be broken!

So I did an md5 of a file on either machine, and lo and behold, they are different. Next step was to copy a file on the mythtv machine to see if the corruption is disk related or network related.

Doing a cp of a file creates a new file with a new md5. The md5 is the same each subsequent check of any given file, which means that the corruption is happening during writing rather than the reading. It appears that the corruption is somehow in LVM, MD, ext, SATA, or at the drive level.
The next step is to pull out one of the pairs of drives and test them outside of LVM. To do this, I’ll need to remove enough content from the partition to shrink it down to 230GB, then remove the md1 pair (which is the more likely pair for corruption because of the PCI SATA card). After they’re pulled out, I’ll make a fileystem on them and try the same cp/md5 test. If there’s no corruption on them, I’ll need to move all the content to them (outside of lvm), reformat the LVM one, and try the cp/md5 test on the new LVM one, then copy the content back and add md1 back into the pool and expand the partition.

The only reason I think this will help is that while I was trying to get the myth machine more stable, it crashed about a million times, possibly corrupting the filesystem. The causes of the crash appear to be primarily the eepro driver (e100 has been rock solid for me) and actually using the smart daemon to try and find out about drive failure (the crashes and their causes are worth their own post at some point).

The steps for the test are:

Remove content

resize2fs /dev/myth_volume/myth_logical 230G
lvreduce -L -279.46G /dev/myth_volume/myth_logical
vgreduce myth_volume /dev/md1
pvremove /dev/md1

I’m at this point now and have made a new ext2 partition on md1, and I’m not seeing the md5s change, either on the (now half size) LVM partition or on the new ext2 partition. It’s possible that the ext2 resize fixed it, or something like that.

I’m going to go through with the rest of rebuilding it all though. I’m tempted to go back to XFS since the ext3 performance is so bad, but I think I’m going to hold off for now.


[1]
0000:00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
[2]
0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02)

Leave a Reply