MDADM Failure - where it came from?


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 11-23-2014
MDADM Failure - where it came from?

Hello,
i have a system with 6 sata3 seagate st3000dm01 disks running on stable Debian with software raid mdadm. i have md0 for root and md1 for swap and md2 for the files. i now want to add one more disk = sdh4 for md2 but i got this errors:
Quote:
md: recovery of RAID array md2
Nov 23 22:28:06 client02 kernel: [ 529.148663] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Nov 23 22:28:06 client02 kernel: [ 529.148669] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Nov 23 22:28:06 client02 kernel: [ 529.148687] md: using 128k window, over a total of 2918318592k.
Nov 23 22:28:06 client02 mdadm[2723]: RebuildStarted event detected on md device /dev/md/2
Nov 23 22:30:25 client02 snmpd[3012]: Connection from UDP: [192.168.0.104]:51053->[192.168.0.21]
Nov 23 22:30:25 client02 snmpd[3012]: Connection from UDP: [192.168.0.104]:53071->[192.168.0.21]
Nov 23 22:30:25 client02 snmpd[3012]: Connection from UDP: [192.168.0.104]:48954->[192.168.0.21]
Nov 23 22:30:47 client02 kernel: [ 689.824086] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 23 22:30:47 client02 kernel: [ 689.824147] ata7.00: failed command: SMART
Nov 23 22:30:47 client02 kernel: [ 689.824180] ata7.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 21 pio 512 in
Nov 23 22:30:47 client02 kernel: [ 689.824181] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 23 22:30:47 client02 kernel: [ 689.824279] ata7.00: status: { DRDY }
Nov 23 22:30:47 client02 kernel: [ 689.824308] ata7: hard resetting link
Nov 23 22:30:52 client02 kernel: [ 695.192171] ata7: link is slow to respond, please be patient (ready=0)

Message from syslogd@client02 at Nov 23 22:32:09 ...
kernel:[ 772.437567] Oops: 0000 [#1] SMP

Message from syslogd@client02 at Nov 23 22:32:09 ...
kernel:[ 772.440997] Stack:

Message from syslogd@client02 at Nov 23 22:32:09 ...
kernel:[ 772.440997] Call Trace:

Message from syslogd@client02 at Nov 23 22:32:09 ...
kernel:[ 772.440997] Code: 01 74 14 48 8b 85 b0 00 00 00 f6 c4 04 74 08 f0 80 8d b0 00 00 00 08 48 8b 85 b0 00 00 00 f6 c4 80 74 36 f0 80 a5 b0 00 00 00 f7 <49> 8b 84 24 b0 00 00 00 a8 02 75 1a 49 8d bc 24 e8 00 00 00 c7

Message from syslogd@client02 at Nov 23 22:32:09 ...
kernel:[ 772.440997] CR2: 00000000000000b0
Nov 23 22:30:57 client02 kernel: [ 699.840171] ata7: COMRESET failed (errno=-16)
Nov 23 22:30:57 client02 kernel: [ 699.840287] ata7: hard resetting link
Nov 23 22:31:02 client02 kernel: [ 705.200168] ata7: link is slow to respond, please be patient (ready=0)
Nov 23 22:31:07 client02 kernel: [ 709.848171] ata7: COMRESET failed (errno=-16)
Nov 23 22:31:07 client02 kernel: [ 709.848286] ata7: hard resetting link
Nov 23 22:31:12 client02 kernel: [ 715.208173] ata7: link is slow to respond, please be patient (ready=0)
Nov 23 22:31:18 client02 kernel: [ 720.808150] ata9.00: exception Emask 0x0 SAct 0x60000000 SErr 0x0 action 0x6 frozen
Nov 23 22:31:18 client02 kernel: [ 720.808334] ata9.00: failed command: WRITE FPDMA QUEUED
Nov 23 22:31:18 client02 kernel: [ 720.808457] ata9.00: cmd 61/00:e8:d8:73:6c/04:00:02:00:00/40 tag 29 ncq 524288 out
Nov 23 22:31:18 client02 kernel: [ 720.808461] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 23 22:31:18 client02 kernel: [ 720.808772] ata9.00: status: { DRDY }
Nov 23 22:31:18 client02 kernel: [ 720.808854] ata9.00: failed command: WRITE FPDMA QUEUED
Nov 23 22:31:18 client02 kernel: [ 720.808975] ata9.00: cmd 61/00:f0:d8:77:6c/04:00:02:00:00/40 tag 30 ncq 524288 out
Nov 23 22:31:18 client02 kernel: [ 720.808978] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 23 22:31:18 client02 kernel: [ 720.809288] ata9.00: status: { DRDY }
Nov 23 22:31:18 client02 kernel: [ 720.809373] ata9: hard resetting link
Nov 23 22:31:18 client02 kernel: [ 721.632193] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 23 22:31:23 client02 kernel: [ 726.632180] ata9.00: qc timeout (cmd 0xec)
Nov 23 22:31:24 client02 kernel: [ 727.136125] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 23 22:31:24 client02 kernel: [ 727.136135] ata9.00: revalidation failed (errno=-5)
Nov 23 22:31:24 client02 kernel: [ 727.136261] ata9: hard resetting link
Nov 23 22:31:25 client02 kernel: [ 727.960194] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 23 22:31:35 client02 kernel: [ 737.960182] ata9.00: qc timeout (cmd 0xec)
Nov 23 22:31:35 client02 kernel: [ 738.464177] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 23 22:31:35 client02 kernel: [ 738.464187] ata9.00: revalidation failed (errno=-5)
Nov 23 22:31:35 client02 kernel: [ 738.464310] ata9: limiting SATA link speed to 3.0 Gbps
Nov 23 22:31:35 client02 kernel: [ 738.464320] ata9: hard resetting link
Nov 23 22:31:36 client02 kernel: [ 739.288167] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 23 22:31:42 client02 kernel: [ 744.888104] ata7: COMRESET failed (errno=-16)
Nov 23 22:31:42 client02 kernel: [ 744.888222] ata7: limiting SATA link speed to 3.0 Gbps
Nov 23 22:31:42 client02 kernel: [ 744.888228] ata7: hard resetting link
Nov 23 22:31:47 client02 kernel: [ 749.912172] ata7: COMRESET failed (errno=-16)
Nov 23 22:31:47 client02 kernel: [ 749.912288] ata7: reset failed, giving up
Nov 23 22:31:47 client02 kernel: [ 749.912379] ata7.00: disabled
Nov 23 22:31:47 client02 kernel: [ 749.912422] ata7: EH complete
Nov 23 22:32:06 client02 kernel: [ 769.288181] ata9.00: qc timeout (cmd 0xec)
Nov 23 22:32:07 client02 kernel: [ 769.792178] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 23 22:32:07 client02 kernel: [ 769.792188] ata9.00: revalidation failed (errno=-5)
Nov 23 22:32:07 client02 kernel: [ 769.792307] ata9.00: disabled
Nov 23 22:32:07 client02 kernel: [ 769.792333] ata9.00: device reported invalid CHS sector 0
Nov 23 22:32:07 client02 kernel: [ 769.792347] ata9.00: device reported invalid CHS sector 0
Nov 23 22:32:07 client02 kernel: [ 770.296175] ata9: hard resetting link
Nov 23 22:32:08 client02 kernel: [ 771.120191] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Nov 23 22:32:08 client02 kernel: [ 771.624189] ata9: EH complete
Nov 23 22:32:08 client02 kernel: [ 771.624269] sd 8:0:0:0: [sdh] Unhandled error code
Nov 23 22:32:08 client02 kernel: [ 771.624276] sd 8:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 23 22:32:08 client02 kernel: [ 771.624285] sd 8:0:0:0: [sdh] CDB: Write(10): 2a 00 02 6c 77 d8 00 04 00 00
Nov 23 22:32:08 client02 kernel: [ 771.624304] end_request: I/O error, dev sdh, sector 40663000
Nov 23 22:32:08 client02 kernel: [ 771.624430] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Nov 23 22:32:08 client02 kernel: [ 771.624477] md/raid:md2: Disk failure on sdh4, disabling device.
Nov 23 22:32:08 client02 kernel: [ 771.624481] md/raid:md2: Operation continuing on 6 devices.
Nov 23 22:32:08 client02 kernel: [ 771.624605] sd 8:0:0:0: [sdh] Unhandled error code
Nov 23 22:32:08 client02 kernel: [ 771.624610] sd 8:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 23 22:32:08 client02 kernel: [ 771.624616] sd 8:0:0:0: [sdh] CDB: Write(10): 2a 00 02 6c 73 d8 00 04 00 00
Nov 23 22:32:08 client02 kernel: [ 771.624628] end_request: I/O error, dev sdh, sector 40661976
Nov 23 22:32:08 client02 kernel: [ 771.624815] end_request: I/O error, dev sdh, sector 194568
Nov 23 22:32:08 client02 kernel: [ 771.624821] md: super_written gets error=-5, uptodate=0
Nov 23 22:32:08 client02 kernel: [ 771.624831] md/raid1:md0: Disk failure on sdh2, disabling device.
Nov 23 22:32:08 client02 kernel: [ 771.624834] md/raid1:md0: Operation continuing on 6 devices.
Nov 23 22:32:08 client02 kernel: [ 771.625361] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Nov 23 22:32:09 client02 kernel: [ 772.437151] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
Nov 23 22:32:09 client02 kernel: [ 772.437349] IP: [<ffffffffa0115e23>] handle_stripe+0x2e2/0x1b6a [raid456]
Nov 23 22:32:09 client02 kernel: [ 772.437512] PGD 0
Nov 23 22:32:09 client02 kernel: [ 772.437567] Oops: 0000 [#1] SMP
Nov 23 22:32:09 client02 kernel: [ 772.437653] CPU 1
Nov 23 22:32:09 client02 kernel: [ 772.437699] Modules linked in: bnep rfcomm bluetooth rfkill uinput nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc xfs loop snd_pcm snd_page_alloc snd_timer radeon snd soundcore ttm pcspkr k10temp drm_kms_helper powernow_k8 mperf serio_raw drm joydev i2c_piix4 evdev power_supply i2c_algo_bit i2c_core button processor thermal_sys ext4 crc16 jbd2 mbcache raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 md_mod usbhid hid sg sd_mod crc_t10dif ohci_hcd xhci_hcd ehci_hcd ahci libahci r8169 mii libata usbcore scsi_mod usb_common [last unloaded: scsi_wait_scan]
Nov 23 22:32:09 client02 kernel: [ 772.439252]
Nov 23 22:32:09 client02 kernel: [ 772.439300] Pid: 318, comm: md2_raid6 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.63-2+deb7u1 Gigabyte Technology Co., Ltd. GA-A75M-S2V/GA-A75M-S2V
Nov 23 22:32:09 client02 kernel: [ 772.439654] RIP: 0010:[<ffffffffa0115e23>] [<ffffffffa0115e23>] handle_stripe+0x2e2/0x1b6a [raid456]
Nov 23 22:32:09 client02 kernel: [ 772.439910] RSP: 0018:ffff880210aefc90 EFLAGS: 00010002
Nov 23 22:32:09 client02 kernel: [ 772.440051] RAX: 0000000000008001 RBX: ffff880210efb2c0 RCX: ffff88021310e200
Nov 23 22:32:09 client02 kernel: [ 772.440237] RDX: 0000000000000202 RSI: ffff88021310e3a8 RDI: ffff88021310e3a8
Nov 23 22:32:09 client02 kernel: [ 772.440423] RBP: ffff880210efb778 R08: 0000000000000000 R09: 0000000000013780
Nov 23 22:32:09 client02 kernel: [ 772.440609] R10: 0000000000013780 R11: ffffffffa00e13b0 R12: 0000000000000000
Nov 23 22:32:09 client02 kernel: [ 772.440795] R13: ffff88021310e200 R14: ffff880210efb300 R15: 0000000000000008
Nov 23 22:32:09 client02 kernel: [ 772.440983] FS: 00007f824b5a47c0(0000) GS:ffff88021ec40000(0000) knlGS:0000000000000000
Nov 23 22:32:09 client02 kernel: [ 772.440997] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 23 22:32:09 client02 kernel: [ 772.440997] CR2: 00000000000000b0 CR3: 0000000001605000 CR4: 00000000000006e0
Nov 23 22:32:09 client02 kernel: [ 772.440997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 23 22:32:09 client02 kernel: [ 772.440997] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 23 22:32:09 client02 kernel: [ 772.440997] Process md2_raid6 (pid: 318, threadinfo ffff880210aee000, task ffff880212ff6880)
Nov 23 22:32:09 client02 kernel: [ 772.440997] Stack:
Nov 23 22:32:09 client02 kernel: [ 772.440997] ffff880200000000 0000000000000000 ffff880200000000 0000000012bb5000
Nov 23 22:32:09 client02 kernel: [ 772.440997] ffffffff00000006 ffff88021310e200 ffff880200000010 ffffffff81013971
Nov 23 22:32:09 client02 kernel: [ 772.440997] ffffffff81013de3 ffffffff81013df0 ffffffff81063f64 0000000000000001
Nov 23 22:32:09 client02 kernel: [ 772.440997] Call Trace:
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81013971>] ? paravirt_read_tsc+0x5/0x8
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81013de3>] ? native_sched_clock+0x27/0x2f
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81013df0>] ? sched_clock+0x5/0x8
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81063f64>] ? sched_clock_local+0xd/0x6f
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff8134ffee>] ? __mutex_unlock_slowpath+0x29/0x2f
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffffa0117ad4>] ? raid5d+0x429/0x483 [raid456]
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81071295>] ? arch_local_irq_save+0x11/0x17
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81071295>] ? arch_local_irq_save+0x11/0x17
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffffa00d4256>] ? md_thread+0x114/0x132 [md_mod]
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff8105fddb>] ? add_wait_queue+0x3c/0x3c
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffffa00d4142>] ? md_rdev_init+0xea/0xea [md_mod]
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff8105f789>] ? kthread+0x76/0x7e
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81357bf4>] ? kernel_thread_helper+0x4/0x10
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff8105f713>] ? kthread_worker_fn+0x139/0x139
Nov 23 22:32:09 client02 kernel: [ 772.440997] [<ffffffff81357bf0>] ? gs_change+0x13/0x13
Nov 23 22:32:09 client02 kernel: [ 772.440997] Code: 01 74 14 48 8b 85 b0 00 00 00 f6 c4 04 74 08 f0 80 8d b0 00 00 00 08 48 8b 85 b0 00 00 00 f6 c4 80 74 36 f0 80 a5 b0 00 00 00 f7 <49> 8b 84 24 b0 00 00 00 a8 02 75 1a 49 8d bc 24 e8 00 00 00 c7
Nov 23 22:32:09 client02 kernel: [ 772.440997] RIP [<ffffffffa0115e23>] handle_stripe+0x2e2/0x1b6a [raid456]
Nov 23 22:32:09 client02 kernel: [ 772.440997] RSP <ffff880210aefc90>
Nov 23 22:32:09 client02 kernel: [ 772.440997] CR2: 00000000000000b0
Nov 23 22:32:09 client02 kernel: [ 772.440997] ---[ end trace d3bf072b78030bf5 ]---
Nov 23 22:32:10 client02 kernel: [ 772.689988] RAID1 conf printout:
Nov 23 22:32:10 client02 kernel: [ 772.696202] --- wd:6 rd:8
Nov 23 22:32:10 client02 kernel: [ 772.702296] disk 0, wo:0, o:1, dev:sda2
Nov 23 22:32:10 client02 kernel: [ 772.708449] disk 1, wo:0, o:1, dev:sdb2
Nov 23 22:32:10 client02 kernel: [ 772.714518] disk 2, wo:1, o:0, dev:sdh2
Nov 23 22:32:10 client02 kernel: [ 772.720534] disk 3, wo:0, o:1, dev:sdd2
Nov 23 22:32:10 client02 kernel: [ 772.726435] disk 4, wo:0, o:1, dev:sdc2
Nov 23 22:32:10 client02 kernel: [ 772.732217] disk 5, wo:0, o:1, dev:sdf2
Nov 23 22:32:10 client02 kernel: [ 772.737907] disk 6, wo:0, o:1, dev:sde2
Nov 23 22:32:10 client02 kernel: [ 772.744116] RAID1 conf printout:
Nov 23 22:32:10 client02 kernel: [ 772.749927] --- wd:6 rd:8
Nov 23 22:32:10 client02 kernel: [ 772.755374] disk 0, wo:0, o:1, dev:sda2
Nov 23 22:32:10 client02 kernel: [ 772.760595] disk 1, wo:0, o:1, dev:sdb2
Nov 23 22:32:10 client02 kernel: [ 772.765533] disk 3, wo:0, o:1, dev:sdd2
Nov 23 22:32:10 client02 kernel: [ 772.770193] disk 4, wo:0, o:1, dev:sdc2
Nov 23 22:32:10 client02 kernel: [ 772.774767] disk 5, wo:0, o:1, dev:sdf2
Nov 23 22:32:10 client02 kernel: [ 772.779135] disk 6, wo:0, o:1, dev:sde2
Nov 23 22:32:10 client02 mdadm[2723]: FailSpare event detected on md device /dev/md/2, component device /dev/sdh4
Nov 23 22:32:11 client02 mdadm[2723]: Fail event detected on md device /dev/md/0
Nov 23 22:32:11 client02 mdadm[2723]: FailSpare event detected on md device /dev/md/0, component device /dev/sdh2
The new disk is connected to an 4 port sata controller which is connected onto the x4 slot on the mainboard. At first i thought it is the cable and I changed the cable for all disks, but the error occures again. The disk is new, fresh from the factory. Any ideas where the problem could be?
# 2  
Old 11-23-2014
Looks like sdh2 failed, and you got a NULL pointer dereference immediately afterwards. So much for robustness in event of a disk failure...

Check all your disks.
# 3  
Old 11-23-2014
Does this mean the disk has failure? Its a new one. How could i check this? Debian tool badblock?

edit: as i understand the timing - first ata7 gave error with smart (Nov 23 22:30:47) and later the new disk strikes, right? could it a problem with smartd?

Last edited by Sunghost; 11-24-2014 at 07:18 AM..
# 4  
Old 11-25-2014
Hey Guys,
i searched further in the web and found some possible problems with Marvell Chip 88SE9230 which is on the extra Sata Controller. Could this be the problem?
# 5  
Old 11-25-2014
This is a kernel panic. It's not supposed to happen. You're probably not going to get a concrete answer unless it happens more than once in predictable circumstances.
# 6  
Old 11-25-2014
Hi,
i got this everytime i want to add a disk to the array. Meanwhile i think its a problem with the marvel chipset on the sata controller and perhaps smart. this seems to crash the disks if they under heavy load.
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
How to fix mistake on raid: mdadm create instead of assemble? chebarbudo UNIX for Advanced & Expert Users 0 10-07-2016 10:36 AM
USB RAID 5 Problem on Joli OS 1.2 (Ubuntu) using mdadm powelltallen UNIX for Advanced & Expert Users 5 09-30-2012 04:49 AM
boot up failure unix sco after power failure fredthayer UNIX for Dummies Questions & Answers 11 03-29-2012 10:24 PM
mdadm for / and /boot ppchu99 Red Hat 2 02-18-2012 03:49 PM
SFTP Failure krishna87 Shell Programming and Scripting 0 01-21-2012 04:22 PM
kernel failure pabloli150 UNIX for Advanced & Expert Users 2 10-30-2011 05:33 PM
mdadm container! How does it work hytron UNIX for Advanced & Expert Users 0 10-29-2011 12:42 AM
Fan failure orange47 Solaris 7 10-10-2011 06:35 AM
mdadm question rmokros UNIX for Advanced & Expert Users 3 09-14-2011 02:42 PM
mdadm unable to fail a resyncing drive? Bashingaway Emergency UNIX and Linux Support 9 09-01-2011 01:24 PM
is mdadm --incremental --rebuild --run --scan destructive? Habitual Virtualization and Cloud Computing 0 07-12-2011 08:57 PM
mdadm - Swapping 500GB disks for 1TB snoop2048 Linux 1 01-11-2010 07:21 AM
su failure twk UNIX for Advanced & Expert Users 8 09-19-2009 12:02 AM
Boot failure joshighanshyam Linux 2 04-16-2008 11:05 AM
ld failure handak9 Programming 2 09-29-2004 10:56 AM