RAID1 Recovery HowTo
Suitable for: e-smith 4.1.2/Mitel SME5

Author: Darrell May
Contributor:

Problem: You want to easily recover from a RAID1 failure.

Solution: Implement the steps outlined in the RAID1 Monitor HowTo. Next follow these steps:

STEP 1: Backup your computer!

I can not stress this point strongly enough. Your first priority on a failed RAID1 system should be to perform an immediate backup.

So, DO IT NOW!

[root@myezserver /root]# /sbin/e-smith/backup

STEP 2: Power down, replace the failed drive, power up.

First, before we continue, I just want to show you that for testing purposes only, to completely erase a drive, do the following:

[root@myezserver /root]# dd if=/dev/zero of=/dev/hdb

This will write zeroes across the entire /dev/hdb drive.  Remember for all command-line entries in this HowTO to substitute your correct /dev/hdX where:

/dev/hda = primary master
/dev/hdb = primary slave
/dev/hdc = secondary master
/dev/hdd = secondary slave

Step 3: Recover the partition information and use this information to quickly prepare the replacement drive.

[root@myezserver /root]# cat /root/raidmonitor/sfdisk.out
# partition table of /dev/hda
unit: sectors

/dev/hda1 : start=       63, size=  530082, Id=fd, bootable
/dev/hda2 : start=   530145, size=39487770, Id= 5
/dev/hda3 : start=        0, size=       0, Id= 0
/dev/hda4 : start=        0, size=       0, Id= 0
/dev/hda5 : start=   530208, size=   32067, Id=fd
/dev/hda6 : start=   562338, size=39455577, Id=fd
# partition table of /dev/hdb
unit: sectors

/dev/hdb1 : start=       63, size=  530082, Id=fd, bootable
/dev/hdb2 : start=   530145, size=39487770, Id= 5
/dev/hdb3 : start=        0, size=       0, Id= 0
/dev/hdb4 : start=        0, size=       0, Id= 0
/dev/hdb5 : start=   530208, size=   32067, Id=fd
/dev/hdb6 : start=   562338, size=39455577, Id=fd

Cut and paste your correct # partition table of /dev/hdX. In my case I am replacing /dev/hdb so this is the information I need to transfer into a file for quick import:

[root@myezserver /root]# pico hdb.out

Which now contains the following entries, right?:

# partition table of /dev/hdb
unit: sectors

/dev/hdb1 : start=       63, size= 530082, Id=fd, bootable
/dev/hdb2 : start=   530145, size=39487770, Id= 5
/dev/hdb3 : start=        0, size=       0, Id= 0
/dev/hdb4 : start=        0, size=       0, Id= 0
/dev/hdb5 : start=   530208, size=   32067, Id=fd
/dev/hdb6 : start=   562338, size=39455577, Id=fd

Next perform the partition table import using the sfdisk command as shown below:

[root@myezserver /root]# sfdisk /dev/hdb < hdb.out
Checking that no-one is using this disk right now ...
OK

Disk /dev/hdb: 2491 cylinders, 255 heads, 63 sectors/track
Old situation:
Empty

New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End #sectors Id System
/dev/hdb1   *        63    530144    530082 fd Linux raid autodetect
/dev/hdb2        530145 40017914 39487770   5 Extended
/dev/hdb3             0         -         0   0 Empty
/dev/hdb4             0         -         0   0 Empty
/dev/hdb5        530208    562274     32067 fd Linux raid autodetect
/dev/hdb6        562338 40017914 39455577 fd Linux raid autodetect
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

STEP 4: Review your last known good RAID configuration:

[root@myezserver /root]# /usr/local/bin/raidmonitor -v

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[0] 264960 blocks [2/1] [U_]
md0 : active raid1 hda5[0] 15936 blocks [2/1] [U_]
md1 : active raid1 hda6[0] 19727680 blocks [2/1] [U_]
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>

STEP 5: Add your newly prepared and correctly partitioned hard drive into the RAID1 array. You use the information above as your guide:

[root@myezserver /root]# /sbin/raidhotadd /dev/md2 /dev/hdb1
[root@myezserver /root]# /sbin/raidhotadd /dev/md0 /dev/hdb5
[root@myezserver /root]# /sbin/raidhotadd /dev/md1 /dev/hdb6

STEP 6: Use raidmonitor to watch the recovery process. Note this information will also be e-mailed to root every 15 min. until the recovery is completed.

[root@myezserver /root]# /usr/local/bin/raidmonitor -v

ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[0] 19727680 blocks [2/1] [U_] recovery=5% finish=10.0min
unused devices: <none>

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[1] hda1[0] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[1] hda5[0] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[1] hda6[0] 19727680 blocks [2/2] [UU]
unused devices: <none>

STEP 7:  Recover and restore the last known good master boot record (MBR) onto the drive you just replaced:

[root@myezserver /root]# /sbin/lilo -C /root/raidmonitor/lilo.conf -b /dev/hdb

STEP 8: Shutdown the server, reboot and test the RAID functions

If you have the time, you should test the RAID functionality to make sure the server will boot under simulated hdd failures.

start by booting with both drives attached
power down, disconnect one of the drives, power up, check boot
power down, reconnect the drive, power up and rebuild the array as above repeating steps 5 and 6 only
power down, disconnect the other drive, power up, check boot
power down, reconnect the drive, power up and rebuild the array as above repeating steps 5 and 6 only

OK, now you can confidently say your ready for anything. Remember if anything goes wrong here, you simply reconnect all the hardware, perform a fresh RAID install and then restore from your backup tape. You did perform STEP 1 correct?

STEP 9:  When all looks well, re-initialze raidmonitor:

[root@myezserver /root]# /usr/local/bin/raidmonitor -iv

STEP 10: Go have drink. Job well done ;->