3ware card goes up in smoke

It didn’t really let the magic blue smoke out, but it might as well have.

It started with a weird error in syslog:

1 Nov 1 21:57:19 <user.crit> XXX /kernel: twe0: AEN: <sbuf integrity check failure>

After contacting 3ware, they suggested that we get a full log from the web interface (3dmd). We did that, and it was kind of weird because the 3ware log had other regular BSD syslog information from that machine in its buffer. Nothing really useful though.

About 20 minutes after getting the log, however, the machine completely locked up. After physically inspecting it, it seemed OK, but the bios wasn’t seeing the 3ware card. After physically removing the card and putting it back in, the card was visible again, and the machine booted just fine.

We ordered a replacement card at that point. It came in that afternoon, but since everything seemed to be behaving, we didn’t think it was necessary to take a production machine down.

That was our big mistake.

Several days later it locked up again. This time, half the drives were missing from the array, but the card thought the array was degraded, rather than dead. We got it to rebuild the array (removing the “incomplete” drives, then adding them back in), but BSD wouldn’t boot, it just had the boot manager F1 thing. It wouldn’t get any further.

We booted up a rescue CD and started doing some fscks to see if there was anything salvageable, but there were so many inode errors that it looked pretty bad from the outset.

Given that there was about a TB of data on there (8×250 in raid 10), this was a lot of data lost.

We were able to get back much of the config information, since anything that fit in a 64k slice was pretty intact, but all the large files were gone.

3ware has since replied saying that it was most likely a failing card, that doesn’t help us much now though.

It’s very disappointing that the 3ware card failed in this way though. I guess we can’t really rely on them to not “do the wrong thing” when there’s a bad situation, so in addition to having fault tolerant disks, we need a completely redundant server.

Leave a Reply