Silent corruption with software RAID1 under Linux
EDIT: I was wrong, the problem ended up being a problem with the via chipset on the motherboard
While attempting to do some reorganization of the files on my main home fileserver, I copied a bunch of video files to my MythTV machine, figuring I might even leave them there since that’s the only way I watch them anyway. The MythTV machine locked up multiple times (turned out to be from using the eepro driver for the intel/100 card, e100 has been rock solid), and since rsync normally just uses timestamp + filesize to decide if the file needs to be updated, I figured I’d use the -c flag to it to force a checksum.
All the files were corrupted!
I’ve tracked it down to being something in the software raid (not ext, lvm, bad hardware,etc), perhaps in combination with the PCI SATA card I’m using.
To test I used a simple script that newfs’d the partition, mounted it, wrote out multiple files of the requested size via dd, md5′d the resulting file, and umounted.
The only time there was ever corruption was with RAID1. The first set of tests were done with the software raid setup like this:
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6] [raid10] [faulty]
md1 : active raid0 sdb1[1] sda1[0]
146480384 blocks 64k chunks
md2 : active raid1 sdb2[1] sda2[0]
73240256 blocks [2/2] [UU]
md3 : active raid5 sdb4[3] sdb3[1] sda4[2] sda3[0]
219720768 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md0 : active raid1 sdd1[1] sdc1[0]
244195904 blocks [2/2] [UU]
unused devices:
The second was with it like this:
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6] [raid10] [faulty]
md1 : active raid0 sda3[1] sdb3[0]
146480512 blocks 64k chunks
md2 : active raid1 sdb4[1] sda4[0]
73312512 blocks [2/2] [UU]
md3 : active raid5 sdb2[3] sdb1[2] sda2[1] sda1[0]
219720576 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
md0 : active raid1 sdd1[1] sdc1[0]
244195904 blocks [2/2] [UU]
unused devices:
md0 can be ignored in both cases, that’s the original RAID1 (which works fine as far as I know).
I also tested partitions sda1-sda4 and sdb1-sdb4 in the same way, independently.
The only difference between the configurations above is which partitions the RAIDs use — none of them use the same partition in the second test that they do in the first test.
The md5s were all correct, except occasionally on any of the raid1 partitions, where it would (apparently) randomly have a different checksum. There were no logged messages about anything going wrong.




