Tuesday, December 29, 2009

It won't happen to me...

Really. Everyone believes it won't happen to them because it hasn't happened. Well, it happened. To me. *twice* in the same week. *sigh*.

Both of my linux servers lost a hard drive; the root volume. I do keep regular backups, but what one learns when something like this happens is how much info you've been forgetting. Settings and other one-offs that you forgot to backup.

The way you noticed the failures is never how you expect. I had always imagined that I'd be doing something on the system when it failed, or that the system wouldn't boot. In this case, it started with the media frontend not finding some files. Further inspection pointed to the SMB share wasn't responding. Then, a ping to the host failed... panic building... Check the console... and I/O errors rolling across the screen. *sigh*

I grabbed some older IDE drives, two 250G drives and reinstalled Ubuntu 9.04. Clearly no one installs a RAID1 Root mirror because the process is a royal pain in the ass.

You can't simply partition one drive and then point to a second and say "mirror this one here." Rather, you create a partition, indicate that its type is Linux RAID. You must do this for each partition, and then a second time exactly the same on the second drive. Only then can you enter the RAID menu where it will populate the various RAID options (0,1 and 5). Quizically, it asks if you have any spares when creating a RAID1. This threw me off since I didn't think about actually having a hot-spare for a mirror. After creating new devices, md0, md1, mdX, etc which is composed of a partition from each of the two drives in mirror, ensuring that you match of the correct partition then you can return to the more "normal" partitioner setup; namely selecting which device is mounted where with what filesystem etc. *whew*.

Note that once you format these devices Linux makes the raid *active*. What this means in a mirror secenario is that all writes to the first go to the second. What this means practically is that you'll be doing *double* the writes during the install. Assuming you have reasonably new system and the drives are SATA, you probably won't have much trouble; however I initially did my mirror operation with two IDE drives, on the same channel. This was horrendously slow. Normally a server install takes maybe 20 minutes. The install over RAID1 took over an hour. And the drive hadn't been completely mirrored after the install. After the final boot into Ubuntu 9.04 for the first time the system was unsuable for about a day as the software RAID continued to replicate the data to the second drive.

After having survived the lost of a root drive on one of my two servers I was ready for a break. Non such luck. I discovered the second root failure in a similar manner. Files not accessible, followed by failed ping, followed by console inspection resulting in a screen full of ext3 errors. This time I didn't give up hope and booted a live CD and fscked then ran badblocks -c hoping that I could mark the bad blocks, relocate the data and move on. I met with little success. After waiting over a day while badblocks ran I finially gave up and ordered another two 250G SATA drives for the server.

Now, this second server is the home of the major part of the total storage. I have a 4-port 3Ware RAID card with four 500G drives and four 1.5TB drives, LVM'ed together, plus one root drive for a total of 9 drives. The motherboard has 7 SATA ports. A 5-port Nvidia fakeraid and a 2-port JMicron. The JMicron ports are separate since the chipset provides a SATA to IDE conversion for legacy OSes that don't support booting from SATA. The current setup occupied 4 of the 5 Nvidia ports and one of the JMicron ports. After replacing the busted root drive with two 250G SATA drives, I now had the same 4 1.5TB drives in the Nvidia controller and the 250s in the Jmicron. I proceeded to the Ubuntu installer only to find that it would only "see" one of the two root drives. Hrm, that's odd. OK, pull one of the cables for the root drive, now it "sees" that drive file; maybe one of the ports on the JMicron is bad... no, same drive can be seen in either port. Either drive can be seen in either port... So strange. I tried to use the 5th port of the Nvidia; no dice. That port didn't even seem active. Well, maybe there isn't enough power. I disconnected first the Nvidia controller. Success!

I installed 9.04 onto a RAID1 mirror and on my laptop started shopping for a larger power supply. The next day I log into the system to find I/O errors all over the place. Sometime during the night the SATA controller encountered an error and dropped on of the root drives. Frantic google searching only turned up that the SATA controller might be bad. I was still concerned that I might have a power issue so I unplug *all* drives except *one* 250G drive. I plug that single drive into the Nvidia controller and install Ubuntu. I wait another day and I'm greeted with the same SATA failure. I can't be the power supply since I'm pulling next to nothing here; just the main board and a single SATA drive. Reluctantly I place an order for a new motherboard which can use the same CPU (AMD Sempron) and RAM (4Gs DDR2).

In my haste, I screwed up and got a new board with only 2 DIMM slots instead of four; however, after swapping the board, I load the system up with all 10 drives; install Ubuntu and the system is restored with greater protection than before. Space is cheap, do yourself a favor and use disk mirroring.

No comments: