Time to rebuild (or, why linux is not a choice when you want stability)
So, I’m finally recreating my site after losing years of work. It was my own fault I lost it, but linux could have been a bit more forgiving.
1st mistake: Using linux (yeah, kind of a troll). I needed it for some of the hardware support.
2nd mistake: Using XFS. It’s so much faster than ext3, but apparently it just isn’t quite ready for prime time
3rd mistake: Using software raid rather than the 3ware raid. I wanted to be able to expand the raid, or combine with other drives (like using an onboard controller to have a 9th drive as a spare). Also, it’s much faster using the CPU instead of the 3ware (7500 series) chip for raid 5.
I have a server with 8 250GB drives. There’s a 10GB /, a 10GB /home, and the rest is either swap or a large /raid partition. I used to have it as XFS, it’s now back to ext3.
The server was completely locking up pretty regularly (once every 2 days or so on average), of course nothing on the console or in the logs. I think this is where the problem started. It may have been bad hardware or something, but I just wish there had been any sign of what it might have been. Even remote syslog wasn’t showing anything. Every once in a while I’d see a panic, but I don’t know if it was related.
Then I decided to take everything out of the Antec case it was in and put it in a server case with better cooling on the drives to see if that would help (and maybe bring down the temperature of the machine as a whole to help with stability). That’s where the problems started.
On one of the drives (a hitachi somethingorother) one of the IDE pins pushed in when I put the connector in, and it went and broke the connection between that pin and the board. “Crap! Well, at least it’s part of a raid 5, so I have to be really really careful with the rest, then I can use my spare 250GB and rebuild it and everything’s fine”.
So, I got it all together, booted up, and it didn’t work. For some reason, linux had decided that one of the drives was now a spare, and with the missing drive that meant that the raid was gone.
There’s things online about how to restore from this situation, and I tried them, being very careful, and finally got it back up. I forced an xfs fsck, which was difficult because it kept running out of ram (1GB of ram on the server), but it got a little further each time, and eventually finished. I was left with ~50,000 files in lost+found, and little else. Ugh. So, I started working on the largest directories in there, mv’ing things back to their place. However, I started noticing some oddities, like songs being messed up (static and such) when played on my slimserver.
I started checking, and it turned out that pretty much every file was corrupted. I think I must have rebuild the raid incorrectly, and the xfs fsck was too incompetent to figure out that all the data was invalid. So, pretty much all of it had to be recreated.
The worst part was that I had moved mysql to that partition about a month prior, because of cacti being idiotic and trying to put the log file in the db once a night and failing miserably because it was taking like 8 gigs or something. I had forgotten that this meant that it would no longer be backed up (I back up everything except the raid at least once daily).
A large part of the raid was stuff that can be recreated (music, ripping all my CDs yet again, what fun), other stuff I don’t really care about (movies I’ve downloaded, although there were quite a few rare things that I’ll have to spend some time searching for again), but there were also digital pictures that existed nowhere else.
What this made me do is think about backups and a better way to make sure that the stuff I really care about is preserved in multiple places. I don’t want this to happen again.
So, I think that what I’m going to do for now is have a nightly backup of the /home and / stuff to a different computer in my house, and regularly burn that to DVD and take it in to work for safe keeping.
I really wish there was an economical way to back up large amounts of data, but nothing has kept up with drive space.
The ext3 performance is really appalling. I’m willing to sacrifice some speed for performance, but geeze, it’s almost unusable if you’re doing large writes. There’s some people with suggestions for ext3 performance tuning, and I’m going to try them when I get a chance. It’s like crawling in mud right now though.




