13 disk raidz2 pool lost Post: 302707289

Sponsored Content

Operating Systems Solaris 13 disk raidz2 pool lost Post 302707289 by tatxo on Friday 28th of September 2012 10:05:05 AM

09-28-2012

Registered User

13 disk raidz2 pool lost

Hi guys, I appreciate any help in this regard, we have lost sensitive data in the company.

One box with 2 disk mirrored and a 3ware controller handling 13 disks in a raidz2 pool. Suddenly the box restart and keeps "Reading ZFS config" for hours.

Unplugging disk by disk we isolate the disk was causing the system not to be able to restar and we execute 'zpool clear -F' as suggested by 'zpool status' command. During hours of proccess we get a console error from the controller, and the system hangs, so we decide to change such disk, getting the pool from DEGRADED to FAULTED. After one 'zpool clear' we get the pool again DEGRADED, but no access to data, so we try to roll back with previous disks. (we didn't commit any 'zpool replace').

The box keeps restarting, freezing and unable to boot, so we decide to plug the original 13 disks in another box with same hardware.

Now we are trying to import the pool here, after hours of proccess and huge disk activity, the box hangs and the import doesn't succeed. This is the result of 'zpool import' command:

Code:

state: DEGRADED
status: The pool was last accessed by another system.
action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        zsan08rz2     DEGRADED
          raidz2-0    DEGRADED
            c10t2d0   FAULTED  corrupted data
            c10t2d0   ONLINE
            c10t5d0   ONLINE
            c10t9d0   ONLINE
            c10t0d0   ONLINE
            c10t1d0   ONLINE
            c10t4d0   ONLINE
            c10t8d0   ONLINE
            c10t12d0  ONLINE
            c10t11d0  ONLINE
            c10t3d0   ONLINE
            c10t7d0   ONLINE
            c10t6d0   ONLINE

Any ideas? Note that c10t2d0 is duplicated, and note that during las import process we got this error from the controller in the console:

Code:

zsan08 tw: WARNING: tw0: tw_aen_task AEN 0x000a Drive error detected unit=7 port=13

This drive seems to be different than the drive c10t2d0.

Suggestions? Thanks!

tatxo

View Public Profile for tatxo

Find all posts by tatxo

7 More Discussions You Might Find Interesting

1. Infrastructure Monitoring

zfs - migrate from pool to pool

Here are the details. cnjr-opennms>root$ zfs list NAME USED AVAIL REFER MOUNTPOINT openpool 20.6G 46.3G 35.5K /openpool openpool/ROOT 15.4G 46.3G 18K legacy openpool/ROOT/rds 15.4G 46.3G 15.3G / openpool/ROOT/rds/var 102M ...

2. Ubuntu

Disk Space lost mysteriously upon breaking a process.

Hi All, Today when I was working on a script to generate custom wordlist. So I ran a script and the output was directed to /tmp. The disk space was around 19 gb. While the script was running, I decided to direct the o/p file to my 1TB drive. So I broke the run using Ctrl + C. Now when I...

3. Boot Loaders

Lost MBR on disk

trying to recover a lost partition table, where the signature (0x55AA) has been lost, though attempting to restore using a number of tools (fdisk, testdisk et al) the write fails. also the os is unable to read the disk geometry correctly, after attempting to correct the geometry, the updated...

4. Solaris

zfs raidz2 - insufficient replicas

I lost my system volume in a power outage, but fortunately I had a dual boot and I could boot into an older opensolaris version and my raidz2 7 drive pool was still fine. I even scrubbed it, no errors. However, the older os has some smb problems so I wanted to upgrade to opensolaris11. I...

5. Solaris

Lost Root Password on VXVM Encapsulated Root Disk

Hi All Hope it's okay to post on this sub-forum, couldn't find a better place I've got a 480R running solaris 8 with veritas volume manager managing all filesystems, including an encapsulated root disk (I believe the root disk is encapsulated as one of the root mirror disks has an entry under...

6. Solaris

Need to remove a disk from zfs pool

I accidently added a disk in different zpool instead of pool, where I want. root@prtdrd21:/# zpool status cvfdb2_app_pool pool: cvfdb2_app_pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM cvfdb2_app_pool ONLINE 0 0 0...

7. Solaris

How to clear a removed single-disk pool from being listed by zpool import?

On an OmniOS server, I removed a single-disk pool I was using for testing. Now, when I run zpool import it will show it as FAULTED, since that single disk not available anymore. # zpool import pool: fido id: 7452075738474086658 state: FAULTED status: The pool was last...

LEARN ABOUT DEBIAN

twe

TWE(4)							   BSD Kernel Interfaces Manual 						    TWE(4)

NAME

     twe -- 3ware 5000/6000/7000/8000 series PATA/SATA RAID adapter driver

SYNOPSIS

     To compile this driver into the kernel, place the following lines in your kernel configuration file:

	   device pci
	   device twe

     Alternatively, to load the driver as a module at boot time, place the following line in loader.conf(5):

	   twe_load="YES"

DESCRIPTION

     The twe driver provides support for AMCC's 3ware 5000/6000/7000/8000 series PATA/SATA RAID adapters.  These adapters were formerly known as
     ``3ware Escalade''.

     These devices support 2, 4, 8, or 12 ATA disk drives and provide RAID0 (striping) and RAID1 (mirroring) functionality.

HARDWARE

     The twe driver supports the following PATA/SATA RAID controllers:

     o	 AMCC's 3ware 5000 series
     o	 AMCC's 3ware 6000 series
     o	 AMCC's 3ware 7000-2
     o	 AMCC's 3ware 7006-2
     o	 AMCC's 3ware 7500-4LP
     o	 AMCC's 3ware 7500-8
     o	 AMCC's 3ware 7500-12
     o	 AMCC's 3ware 7506-4LP
     o	 AMCC's 3ware 7506-8
     o	 AMCC's 3ware 7506-12
     o	 AMCC's 3ware 8006-2LP
     o	 AMCC's 3ware 8500-4LP
     o	 AMCC's 3ware 8500-8
     o	 AMCC's 3ware 8500-12
     o	 AMCC's 3ware 8506-4LP
     o	 AMCC's 3ware 8506-8
     o	 AMCC's 3ware 8506-8MI
     o	 AMCC's 3ware 8506-12
     o	 AMCC's 3ware 8506-12MI

DIAGNOSTICS

   Controller initialisation phase
     twe%d: microcontroller not ready

     The controller's onboard CPU is not reporting that it is ready; this may be due to either a board or system failure.  Initialisation has
     failed.

     twe%d: no attention interrupt
     twe%d: can't drain AEN queue
     twe%d: reset not reported
     twe%d: controller errors detected
     twe%d: can't drain response queue
     twe%d: reset %d failed, trying again

     The controller is not responding correctly to the driver's attempts to reset and initialise it.  This process is retried several times.

     twe%d: can't initialise controller, giving up

     Several attempts to reset and initialise the controller have failed; initialisation has failed and the driver will not attach to this con-
     troller.

   Driver initialisation/shutdown phase
     twe%d: register window not available
     twe%d: can't allocate register window
     twe%d: can't allocate parent DMA tag
     twe%d: can't allocate interrupt
     twe%d: can't set up interrupt
     twe%d: can't establish configuration hook

     A resource allocation error occurred while initialising the driver; initialisation has failed and the driver will not attach to this con-
     troller.

     twe%d: can't detect attached units

     Fetching the list of attached units failed; initialisation has failed.

     twe%d: error fetching capacity for unit %d
     twe%d: error fetching state for unit %d
     twe%d: error fetching descriptor size for unit %d
     twe%d: error fetching descriptor for unit %d
     twe%d: device_add_child failed
     twe%d: bus_generic_attach returned %d

     Creation of the disk devices failed, either due to communication problems with the adapter or due to resource shortage; attachment of one or
     more units may have been aborted.

   Operational phase
     twe%d: command completed - %s

     A command was reported completed with a warning by the controller.  The warning may be one of:

     redundant/inconsequential request ignored
     failed to write zeroes to LBA 0
     failed to profile TwinStor zones

     twe%d: command failed - %s

     A command was reported as failed by the controller.  The failure message may be one of:

     aborted due to system command or reconfiguration
     aborted
     access error
     access violation
     device failure
     controller error
     timed out
     invalid unit number
     unit not available
     undefined opcode
     request incompatible with unit
     invalid request
     firmware error, reset requested

     The command will be returned to the operating system after a fatal error.

     twe%d: command failed submission - controller wedged

     A command could not be delivered to the controller because the controller is unresponsive.

     twe%d: AEN: <%s>

     The controller has reported a change in status using an AEN (Asynchronous Event Notification).  The following AENs may be reported:

     queue empty
     soft reset
     degraded mirror
     controller error
     rebuild fail
     rebuild done
     incomplete unit
     initialisation done
     unclean shutdown detected
     drive timeout
     drive error
     rebuild started
     aen queue full

     AENs are also queued internally for use by management tools.

     twe%d: error polling for signalled AENs

     The controller has reported that one or more status messages are ready for the driver, but attempting to fetch one of these has returned an
     error.

     twe%d: AEN queue overflow, lost AEN <%s>

     A status message was retrieved from the controller, but there is no more room to queue it in the driver.  The message is lost (but will be
     printed to the console).

     twe%d: missing expected status bits %s
     twe%d: unexpected status bits %s

     A check of the controller's status bits indicates an unexpected condition.

     twe%d: host interrupt

     The controller has signalled a host interrupt.  This serves an unknown purpose and is ignored.

     twe%d: command interrupt

     The controller has signalled a command interrupt.	This is not used, and will be disabled.

     twe%d: controller reset in progress...

     The controller is being reset by the driver.  Typically this is done when the driver has determined that the controller is in an unrecover-
     able state.

     twe%d: can't reset controller, giving up

     The driver has given up on resetting the controller.  No further I/O will be handled.

     controller reset done, %d commands restarted

     The controller was successfully reset, and outstanding commands were restarted.

AUTHORS

     The twe driver and manual page were written by Michael Smith <msmith@FreeBSD.org>.

     Extensive work done on the driver by Vinod Kashyap <vkashyap@FreeBSD.org> and Paul Saab <ps@FreeBSD.org>.

BUGS

     The controller cannot handle I/O transfers that are not aligned to a 512-byte boundary.  In order to support raw device access from user-
     space, the driver will perform alignment fixup on non-aligned data.  This process is inefficient, and thus in order to obtain best perfor-
     mance user-space applications accessing the device should do so with aligned buffers.

BSD
								  August 15, 2004							       BSD

7 More Discussions You Might Find Interesting

1. Infrastructure Monitoring

zfs - migrate from pool to pool

Discussion started by: pupp

2. Ubuntu

Disk Space lost mysteriously upon breaking a process.

Discussion started by: morningSunshine

3. Boot Loaders

Lost MBR on disk

Discussion started by: xaphan

4. Solaris

zfs raidz2 - insufficient replicas

Discussion started by: skk