RAID5 multi disk failure


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users RAID5 multi disk failure
# 1  
Old 01-23-2012
Error RAID5 multi disk failure

Hi there,

Don't know if my title is relevant but I'm dealing with dangerous materials that I don't really know and I'm very afraid to mess anything up.

I have a Debian 5.0.4 server with 4 x 1TB hard drives.

I have the following mdstat
Code:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      1024896 blocks [4/4] [UUUU]

md5 : active raid1 sda5[0] sdd5[3] sdb5[1] sdc5[2]
      1023872 blocks [4/4] [UUUU]

md6 : active raid1 sda6[0] sdd6[3] sdb6[1]
      1023872 blocks [4/3] [UU_U]

md7 : active raid1 sda7[0] sdd7[3] sdb7[1] sdc7[2]
      1023872 blocks [4/4] [UUUU]

md8 : active raid1 sdd8[3] sdb8[1] sdc8[2]
      1023872 blocks [4/3] [_UUU]

unused devices: <none>

That's kind of weird because I use to have a huge md10 partition with a monstruous amount of important files.

I have no idea where to start!

I tried to examine the partitions in the multi-disk :
Code:
root@titan:~# mdadm --examine /dev/sda10
/dev/sda10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Sun Jun  5 16:00:41 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ac3fac12 - correct
         Events : 2552115

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       10        0      active sync   /dev/sda10

   0     0       8       10        0      active sync   /dev/sda10
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       8       42        2      active sync   /dev/sdc10
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdb10
/dev/sdb10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Jan 23 12:05:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ade16f37 - correct
         Events : 6224199

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       26        1      active sync   /dev/sdb10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdc10
/dev/sdc10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Fri Jan 20 23:16:43 2012
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ad7f1c03 - correct
         Events : 6223465

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8       42        2      active sync   /dev/sdc10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       8       42        2      active sync   /dev/sdc10
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdd10
/dev/sdd10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Jan 23 12:05:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ade16f5b - correct
         Events : 6224199

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       58        3      active sync   /dev/sdd10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10

But that doesn't really help...
I have no idea how to interpret the results!
I'm scared with the "faulty" and "removed" warnings.
Can anyone give me a hint?
Is there any other command I can run to regain access to the data, at least read-only?

Thanks for your help.
Santiago
# 2  
Old 01-23-2012
been a while since I worked with md so I can't help you much there. I would check all disks and SMART data for any errors.

RAID is not a substitute for backups.

Are you able to mount the file systems that are using those volumes?
# 3  
Old 01-25-2012
OK, thanks to your pieces of advice, I went a little further :
I can tell that two of my 4 disks are removed from the array.
Code:
# mdadm --examine /dev/sda10 | grep 'Update Time'
    Update Time : Sun Jun  5 16:00:41 2011
# mdadm --examine /dev/sdb10 | grep 'Update Time'
    Update Time : Mon Jan 23 12:05:02 2012
# mdadm --examine /dev/sdc10 | grep 'Update Time'
    Update Time : Fri Jan 20 23:16:43 2012
# mdadm --examine /dev/sdd10 | grep 'Update Time'
    Update Time : Mon Jan 23 12:05:02 2012

One failed in june 2011, the second one failed 5 days ago.
I thought that RAID5 would turn read only as soon as one disk fails.
Does anyone knows more?
Please let's not discuss how crazy it is to have let my RAID5 run with one disk removed during 6 month. I didn't know what SMART was before now (belive me I'm reading the manual).

For more information, here is the status of the array
Code:
# mdadm --examine /dev/sdb10 | tail -6
this     1       8       26        1      active sync   /dev/sdb10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10

Is there any chance I can resync 2 disks out of 4?

Any help will be appreciated.
# 4  
Old 01-27-2012
Hi there, me again,

I think my problem is somewhere else.
I know no disk is broken given that there are a few other raid arrays using the same 4 disks:
Code:
root@titan:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      1024896 blocks [4/4] [UUUU]

md5 : active raid1 sda5[0] sdd5[3] sdb5[1] sdc5[2]
      1023872 blocks [4/4] [UUUU]

md6 : active raid1 sdc6[2] sda6[0] sdd6[3] sdb6[1]
      1023872 blocks [4/4] [UUUU]

md7 : active raid1 sda7[0] sdd7[3] sdb7[1] sdc7[2]
      1023872 blocks [4/4] [UUUU]

md8 : active raid1 sda8[0] sdd8[3] sdb8[1] sdc8[2]
      1023872 blocks [4/4] [UUUU]

unused devices: <none>

So I thought I should just check the disks.
Problem: fsck doesn't work:
Code:
root@titan:~# fsck.ext3 /dev/sdc10
#e2fsck 1.41.3 (12-Oct-2008)
fsck.ext3: Superblock invalid, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdc10

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

root@titan:~# fsck.ext3 -b 8193 /dev/sdc10
e2fsck 1.41.3 (12-Oct-2008)
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdc10

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

How can I repair the filesystem on /dev/sdc10?

Thanks for your help
Santiago
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. AIX

AIX hard disk failure

Hi all, I have encountered the issue with the hard disk, the disk is failed and need to replace by the new one. As my understanding, this is just to take out the failed disk and insert the new ones, and that's all. But the third party hardware vendor said, there should be another procedure... (9 Replies)
Discussion started by: Phat
9 Replies

2. Solaris

Poor disk performance however no sign of failure

Hello guys, I have two servers performing the same disk operations. I believe one server is having a disk's impending failure however I have no hard evidence to prove it. This is a pair of Netra 210's with 2 drives in a hardware raid mirror (LSI raid controller). While performing intensive... (4 Replies)
Discussion started by: s ladd
4 Replies

3. Red Hat

How to monitor HP server hard disk failure ?

in red hat 4, 5 any one know any commands or any scritps to monitor HP DL 380 G5/6 server and trigger alarm when hard disk failed. thanks for all support ---------- Post updated at 02:45 PM ---------- Previous update was at 12:00 PM ---------- does HP ProLiant Support Pack support is... (4 Replies)
Discussion started by: maxlee24
4 Replies

4. Solaris

Configure disk array in RAID5 and create file system

I'm new to forums, it's my first time posting. I have a sun v490 server. I just installed solaris 10.6, on the local drives. I'm being asked to do the following: For Oracle install I need “oracle” user that belong to “dba” and “oinstall” groups. File system /u01/app/oracle, 10GB (if... (6 Replies)
Discussion started by: Kjons76
6 Replies

5. Solaris

SAN disk failure

hi all, have a solaris 9 OS and a SAN disk which used to work fine is not getting picked up by my machine. can anyone point out things to check in order to troubleshoot this ?? thanks in advance. (3 Replies)
Discussion started by: cesarNZ
3 Replies

6. Filesystems, Disks and Memory

Looking for a solution to disk failure!

Hi people, I have been using my disk for quite a long time but the other day I heard the drive making some noise and had to restart the system again. But when I did that the disk would not boot and I fear that the data might be deleted or lost. So, if you people have any know about the ways to... (2 Replies)
Discussion started by: christopher4
2 Replies

7. Filesystems, Disks and Memory

Looking for a solution to disk failure!

Hi people, I have been using my disk for quite a long time but the other day I heard the drive making some noise and had to restart the system again. But when I did that the disk would not boot and I fear that the data might be deleted or lost. So, if you people have any know about the ways to get... (1 Reply)
Discussion started by: adam466
1 Replies

8. SCO

Raid5 Failure

Forgive me, I do not know much about RAID so I'm going to be as detailed as possible. This morning, our server's alarm was going. I found that one of our drives have failed. (we have 3) It is an Adaptec ATA RAID 2400A controller I'm purchasing a new SCSI drive today. My questions: ... (2 Replies)
Discussion started by: gseyforth
2 Replies

9. HP-UX

Disk Failure

I am new to being a Unix admin and have a question about replacing some hardware. I have a K class box using HP-UX 10.20 with three disks. Two of the drives are in one logical volume. Every 3 or 4 days, the syslog is showing that one of these drives is experiencing "POWERFAILED" and then recovering... (6 Replies)
Discussion started by: SemiOfCol
6 Replies

10. UNIX for Advanced & Expert Users

Disk failure

is there anu way by which i can find out if all the disks on the system are working ? Milind Shauche. (2 Replies)
Discussion started by: shauche
2 Replies
Login or Register to Ask a Question