Raid 5 Distress and File Server Woes

Today I thought it was time to do a little preventive maintenance. About twice a year I strive to take the home/office computers outside and with my compressor blow all the dust out of the fans, power supply and case. It is something I have done for over 20 years and it has served me well. No matter how clean you think your house or office is, any computer that is close to the floor will have a huge amount of dust inside the enclosure even over a 6 month period. It had been about 8 months on one of my file servers and I was going to be rebooting the server anyway because of a failed mouse on a cheap KVM switch. Since it was going down anyway with the reboot I thought I would do some preventive maintenance while the weather was especially nice for this time of year … so why not? Well here is why. 🙂

This file server does double duty for a few virtual environments here at the office/house. I found a shell and did a sync to force any outstanding I/O’s to the drive. Generally it comes back in about 3-5 seconds worst case for an idle machine. It took about 40 seconds and I thought that was strange at the time but after it returned I repeated the command and it came back right away. That should have been my first warning sign that something might be wrong and I now wish I had run the following command.


dualcore:~:41> cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdb1[0] sdd1[1]
      4008064 blocks [2/2] [UU]

md1 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda1[0]
      723310080 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

About 15 minutes later, I was back inside with the computer connected to the KVM switch and restarting the computer. The first noise I heard sounded like a bearing was going bad on one of the drives or fan. Experience over the years has taught me to wait a few minutes to see if this is a problem or a one off. This computer has 4 disk drives and the server will start them with a staggered start. Nothing on the display yet and then I hear it. A drive attempting to spin up and then spinning down. Still nothing on the display but this cycle repeating over and over. No post, nothing. My first thought is that I must of bumped something while I had everything apart. I have never had a computer not come up after doing one of these preventive maintenance routines and I have been doing this since our Sun 3/80 in 1989.. I am at a loss so I open the case and double check the connections of power and anything else to see if something came lose. This particular server is heavily tied down and every cable is in place so there is no easy fix. I am in denial for about 5 minutes. I refuse to believe that it is a disk drive. Nothing will come up and the motherboard won’t even post. It has to be something that I did during maintenance but quickly I exhaust other possibilities. I check the power supply with my multimeter and am finally at the stage where I turn my attention to the drives.

There are four western digital WD2500YD 250GB SATA (Date: March 2006) installed in the chassis so all have passed the 5 year mark. In my past, I have seen a failed drive hold the bus and stop the post so I start to systematically determine which drive or drives may have failed. The four drives are in a linux software raid 5 configuration but at this point, I don’t know if its one or all of them. The SATA cables are marked 0-3. I start at 3 because it is easiest to get to and most likely the first drive that spins up. My initial guess is that it will be either drive 0 or drive 3. I unplug drive 3 and power up the computer… Same effect and no POST. I plug drive 3 back in and repeat the procedure with drive 0. I power the server on and it comes up but fails with the message:

Filesystem type unknown, partition type 0x82
Kernel ..... bla bla
Error 17: Cannot mount selected partition

I have grub as my boot loader and I have the 4 disks with a raid 1 for the /boot and a raid 5 for everything else. I have two swap partitions so the 4 disks have the same partitioning. They each have a partition used for the raid 5 and a smaller partition that is either /boot or swap. I try to boot again and this time I edit the grub menu to change the disk from (hd1,0) to (hd0,0) and issue the boot command. It comes up with the 3 disks. When I look at the raid configuration, I see that my md0 (/boot) is intact but my md1 (/) is degraded because of the failed drive. That means I lost a swap partition and a raid 5 disk which is not too bad given what I was thinking about 10 minutes ago.

Repairing and Replacing the Disk

I checked my storage closet for my decommissioned disks and find one that has 250GB. It takes me about 10 minutes to figure out how to pull the old drive out and put the railings onto this new drive and in the case it goes. Now all I need to do is the following.

partition the disk with fdisk and make two partitions
one partition is for software raid and the same size as the other raid 5 partitions
second partition is swap
label the partition types and set boot

Note: The disk is slightly smaller so I steal the extra space from the swap partition and can not use the exact same partition table as the other identical drives. The only thing that matters is that the raid 5 partition is the same size or larger. Centos tells me I have to reboot because it can’t sync the partition size so I play it safe and reboot before I add it to the raid array. Next I perform the following commands:

mkswap -L SWAP-sda2 /dev/sda2
add this label entry into /etc/fstab so swap will be added on reboot. You can test it first with the swapon -a
mdadm –manage /dev/md1 –add /dev/sda1

It starts to rebuild and about 70 minutes later, all is back. My raid 5 is healthy and I have replaced a disk. Now back to the slow sync I mentioned initially. You know what you get when your raid 5 is degraded. Yep a very slow sync. I had the same exact slowness just after I rebooted after doing the fdisk on the replacement disk. After, the raid was repaired it worked like we expect. It comes back within a few seconds but on a failed raid it can take a much longer time. Just another data point for the next time you issue a ‘sync’ and think something is odd.

Next order of business is to get 4 new disks in and migrate this server to larger disks but that will be an article for another blog.

Update:

Murphy strikes again. The server above should have sent out a status email when the raid is degraded. My ISP has had an on/off again affair with blocking port 25 traffic. They blocked it the day before the incident so my mail was queued up letting me know that my raid 5 was in a DegradedArray state. Arhh – as a result I never received that email.. Not sure how that server slipped through my sendmail config but here is how you can force your mail server to another port so you don’t have to deal with this type of spam prevention tactic used by your ISP. My other server here at the house is configured this way and I had completely forgotten that I had configured it and solved this once before. That is scary in its own right! Anyway here is the note straight out of the sendmail docs so this doesn’t happen to you.

The port used for outgoing SMTP connections can be changed via the
respective *_MAILER_ARGS macros for the various SMTP mailers. In a default
configuration, sendmail uses either the relay mailer (for e.g. SMART_HOST
when no mailer is specified) or the esmtp mailer (when sending directly to
the MX of the recipient domain).

So, if you want all outgoing SMTP connections to use port 2525, you can use
this in your .mc file:

	define(`RELAY_MAILER_ARGS', `TCP $h 2525')
	define(`ESMTP_MAILER_ARGS', `TCP $h 2525')

If you want to use an alternate port only for specific destinations, change
(e.g.) only the RELAY_MAILER_ARGS, and make sure the relay mailer is not
used for anything else. E.g. you can have sendmail use port 2525 only when
sending to your domain with this in your .mc file:
	FEATURE(`mailertable')
	define(`confRELAY_MAILER', `esmtp')
	define(`RELAY_MAILER_ARGS', `TCP $h 2525')

and then in your mailertable:
	yourdomain.com		relay:mail.yourdomain.com

This will force sendmail to use port 2525 for connections to yourdomain.com.
Of course, change 2525 to whatever alternate port number you wish to use.