Why didn't she panic? (Sol 10 + SVM + HDS)

11-13-2008

Registered User

11, 0

Join Date: Aug 2008

Last Activity: 30 October 2009, 5:35 AM EDT

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

Why didn't she panic? (Sol 10 + SVM + HDS)

Hi folks,

the following incident occured today:

by mistake one of our renowned administrators deleted the complete zoning for a 25K domain running solaris 10.

Thus the system lost all of it's external disks.

We've got oracle datafiles and oracle software residing on those lost disks.

The system logged read and write errors to /var/adm/messages. But it did not panic, because the write errors were qualified "retryable".

The external disks were mounted as metadevices in metasets.

Does SVM keep the system from panicing?

Background information:

uname -a:
SunOS <servername> 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-15000

cat /etc/release:
Solaris 10 8/07 s10s_u4wos_12b SPARC
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 16 August 2007

/var/adm/messages:

Nov 13 10:28:47 sv0703 md_stripe: [ID 641072 kern.warning] WARNING: md: 0703m1/d101: write error on /dev/dsk/c6t60060E801526C300000126C300003265d0s0
Nov 13 10:28:47 sv0703 md_stripe: [ID 641072 kern.warning] WARNING: md: 0703m1/d80: write error on /dev/dsk/c6t60060E801526C300000126C300002265d0s0
Nov 13 10:28:47 sv0703 md_sp: [ID 641072 kern.warning] WARNING: md: 0703m1/d101: write error on /dev/md/0703m1/dsk/d90
Nov 13 10:56:09 sv0703 Error for Command: write(10) Error Level: Retryable
...and so on...

df -h /u02:
/dev/md/<metasetname>/dsk/d100 103G 85G 17G 83% /u02

metaset -s <metasetname>
Set name = <metasetname>, Set number = 4
Host Owner
<hostname> Yes (auto)
Drive Dbase
/dev/dsk/c6t60060E801526C300000126C3000022A2d0 Yes
/dev/dsk/c6t60060E801526C300000126C3000032A2d0 Yes
/dev/dsk/c6t50060E80000000000000F8FE000000A2d0 Yes
/dev/dsk/c6t50060E80000000000000F8FE000004A2d0 Yes

For each FS we've got:

submirror-> mirror-> soft partition on metaset

Thanks & Regards

Mika

###

MikaBaghinen

View Public Profile for MikaBaghinen

Find all posts by MikaBaghinen

11-13-2008

Registered User

5,725, 311

Join Date: Jul 2006

Last Activity: 17 February 2019, 10:46 AM EST

Location: Berlin, Germany

Posts: 5,725

Thanks Given: 75

Thanked 311 Times in 297 Posts

in this case i would suspect a bug in your san firmware or the leadville driver stack. this is something to try on another mashine with the same software setup. if the error can be reproduced you can try to add newer patches or a newer san firmware... maybe this fixes your "problem".

greets,
DN2

DukeNuke2

View Public Profile for DukeNuke2

Visit DukeNuke2's homepage!

Find all posts by DukeNuke2

11-13-2008

Registered User

740, 2

Join Date: Aug 2003

Last Activity: 27 July 2018, 3:03 AM EDT

Location: Vienna / Austria (Europe) [EARTH]

Posts: 740

Thanks Given: 1

Thanked 2 Times in 2 Posts

isn't that normal? ... the panic option you can set with mount (onerror=), specifies the action about inconsistency filesystems, but not if the box can't reach the fs... i tried that some times and it did nothing, also saw retries in the log... there is a new option in sun cluster 3.2 called "reboot_on_path_failure" to prevent such issues....

# clnode set -p reboot_on_path_failure=enabled <node1> <node2>

rgds
- pressy

pressy

View Public Profile for pressy

Find all posts by pressy

11-13-2008

Registered User

11, 0

Join Date: Aug 2008

Last Activity: 30 October 2009, 5:35 AM EDT

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by pressy

isn't that normal? ... the panic option you can set with mount (onerror=), specifies the action about inconsistency filesystems, but not if the box can't reach the fs... i tried that some times and it did nothing, also saw retries in the log... there is a new option in sun cluster 3.2 called "reboot_on_path_failure" to prevent such issues....

# clnode set -p reboot_on_path_failure=enabled <node1> <node2>

rgds
- pressy

My understanding is, that a system should panic whenever cached data cannot be written to a disk device.

This is why I think the "retryable" qualifier in /var/adm/messages is the culprit.

Anyhow - thanks for all replies!

Groetjes

Mika

###

MikaBaghinen

View Public Profile for MikaBaghinen

Find all posts by MikaBaghinen

11-13-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Quote:

Originally Posted by MikaBaghinen

My understanding is, that a system should panic whenever cached data cannot be written to a disk device.

Absolutely not. A system should panic when it is so confused that attempting to write cached data may cause further damage. Imagine a system with external disks and you bump your knee into a power button, turning off the disk. All you need to do is power the drive back on. And yes, that really happened to me and I was grateful that the HP-UX system did not panic.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

11-13-2008

Registered User

647, 0

Join Date: Feb 2008

Last Activity: 22 September 2010, 2:56 PM EDT

Location: Jersey Shore

Posts: 647

Thanks Given: 0

Thanked 0 Times in 0 Posts

Perderabo - you are either really tall or that power button is super low!!!

pupp

View Public Profile for pupp

Find all posts by pupp

11-13-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Quote:

Originally Posted by pupp

Perderabo - you are either really tall or that power button is super low!!! Smilie

I was sitting in a chair using the top of the drive cabinet as a desk. There were two drives in it each was 571 MB witha separate power button for each drive. I mentioned this pair of drives before: https://www.unix.com/unix-dummies-que...ixed-size.html

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

Solaris

Why didn't she panic? (Sol 10 + SVM + HDS)

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

[Crontab] didn't work

Discussion started by: amazigh42

2. Post Here to Contact Site Administrators and Moderators

Didn't find a suggestion thread

Discussion started by: sea

3. UNIX for Advanced & Expert Users

VCS triggerring panic on 1 node, root disk under SVM

Discussion started by: amlanroy

4. Solaris

JASS - upgrading from Sol 9 to Sol 10

Discussion started by: psychocandy

5. AIX

Attaching HDS External storage to AIX Servers

Discussion started by: prtaix

6. UNIX for Advanced & Expert Users

luupgrade: Sol 8 -> Sol 10 u7 (5/09)

Discussion started by: bluescreen

7. SCO

Add HDs while preserving the data

Discussion started by: mhenry

8. UNIX for Dummies Questions & Answers

starce didn't work

Discussion started by: lakeat

9. Shell Programming and Scripting

script didn;t work in cron !!! @_@

Discussion started by: stancwong