Oracle/Xen/OVM/OCFS2/Multipath/SAN Problems

Tags
multipath, oel, oracle, ovm, solved, xen

 
Thread Tools Search this Thread
# 1  
Old 07-18-2012
[SOLVED] Oracle/Xen/OVM/OCFS2/Multipath/SAN Problems

Setup:
  1. Multiple Virtual Machines running OEL5 / Oracle RDBMS 11.2
  2. OVM 2 / Xen 3.4.0 Cluster consisting of 3 Machines
  3. Shared SAN Storage synced with OCFS2
  4. SAN connected with 4GB FC on 4 Paths per LUN
  5. SAN target are 2 EMC Clariion mirroring each other

The problems we're facing are that the OCFS2 file system and the file systems inside the VMs are reporting I/O errors, and that Cluster nodes are dying without apparent reason. The file systems are getting corrupted time and again.

dmesg fills the screen with these error messages:
Code:
end_request: I/O error, dev sdae, sector 6144
device-mapper: multipath: Failing path 65:224.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:2: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdaf, sector 6144
device-mapper: multipath: Failing path 65:240.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:1: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdae, sector 6144
device-mapper: multipath: Failing path 65:224.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:2: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdaf, sector 6144
device-mapper: multipath: Failing path 65:240.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:1: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

multipathd.conf (same on all nodes):
Code:
defaults {
          user_friendly_names yes
          default_features "fail_if_no_path"
         }
blacklist {
           devnode "^sda[[0-9]*]"
           devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
           devnode "^hd[a-z][[0-9]*]"
           devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
          }

devices {
          device {
                vendor "DGC"
                product ".*"
                product_blacklist "LUNZ"
                #path_grouping_policy "group_by_prio"
                path_grouping_policy "failover"
                path_checker "emc_clariion"
                features "0"
                hardware_handler "1 emc"
                prio "emc"
                failback "immediate"
               # no_path_retry "60"
                }
        }

multipath -ll:
Code:
mpath1 (360060160ce601d0084016ace2aa3e011) dm-4 DGC,RAID 5
[size=401G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:1  sdae 65:224 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:1  sdc  8:32   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:1  sdi  8:128  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:1  sdx  65:112 [active][ready]
mpath0 (360060160ce601d006415374d27a3e011) dm-3 DGC,RAID 5
[size=100M][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:1:0  sdad 65:208 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:0  sdb  8:16   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:0  sdh  8:112  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:0  sdt  65:48  [active][ready]
mpath9 (36006016054701d00f6fd290acbc9e111) dm-8 DGC,RAID 5
[size=780G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:5  sdal 66:80  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:5  sdaq 66:160 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:5  sdp  8:240  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:5  sdv  65:80  [active][ready]
mpath8 (36006016054701d00fe9762c9ccc9e111) dm-9 DGC,RAID 5
[size=156G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:6  sdam 66:96  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:6  sdar 66:176 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:6  sdq  65:0   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:6  sdw  65:96  [active][ready]
mpath7 (36006016054701d008a720d61c9c9e111) dm-7 DGC,RAID 5
[size=490G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:4  sdak 66:64  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:4  sdap 66:144 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:4  sdo  8:224  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:4  sdu  65:64  [active][ready]
mpath6 (360060160ce601d00d4d98a68cac9e111) dm-1 DGC,RAID 5
[size=200G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:8  sdab 65:176 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:8  sdah 66:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:8  sdf  8:80   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:8  sdl  8:176  [active][ready]
mpath5 (360060160ce601d00d65e288bc9c9e111) dm-0 DGC,RAID 5
[size=370G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:7  sdaa 65:160 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:7  sdag 66:0   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:7  sde  8:64   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:7  sdk  8:160  [active][ready]
mpath11 (36006016054701d005214f471d6c9e111) dm-10 DGC,RAID 5
[size=100G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:9  sdan 66:112 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:9  sdas 66:192 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:9  sdr  65:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:9  sdz  65:144 [active][ready]
mpath4 (360060160ce601d00f2d025a2c2c9e111) dm-5 DGC,RAID 5
[size=470G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:2  sdaf 65:240 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:2  sdd  8:48   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:2  sdj  8:144  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:2  sdy  65:128 [active][ready]
mpath10 (360060160ce601d009a1792b828a3e011) dm-2 DGC,RAID 5
[size=444G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:10 sdac 65:192 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:10 sdai 66:32  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:10 sdg  8:96   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:10 sdm  8:192  [active][ready]
mpath3 (36006016054701d001c9376bd542be111) dm-6 DGC,RAID 5
[size=100M][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:0  sdaj 66:48  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:0  sdao 66:128 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:0  sdn  8:208  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:0  sds  65:32  [active][ready]

OCFS2 cluster.conf (auto generated by OVM):
Code:
node:
	ip_port     = 7777
	ip_address  = x.y.z.1
	number      = 0
	name        = node1
	cluster     = ocfs2

node:
	ip_port     = 7777
	ip_address  = x.y.z.2
	number      = 1
	name        = node2
	cluster     = ocfs2

node:
	ip_port     = 7777
	ip_address  = x.y.z.3
	number      = 2
	name        = node3
	cluster     = ocfs2

cluster:
	node_count  = 3
	name        = ocfs2

Any help is appreciated, since we're basically trying to get everything running again for the last 24 hours, and we're out of ideas.

No, we don't have a support contract for OVM with Oracle.

If you need any more information just ask.
# 2  
Old 07-18-2012
Some questions:
  1. Was it ever running correctly? If so for how long?
  2. Have there been any recent changes to the SAN or OVM or OEL configuration (assuming it was running correctly before)?
# 3  
Old 07-18-2012
Hi pludi,
check the alert log of the failing database instances. Could you also post the content of /var/log/messages (dmesg's output may not be enough) from the time the errors occurred?
# 4  
Old 07-18-2012
@bartus
Ad 1) Yes, for the past 2 weeks
Ad 2) We removed an empty OVM repository and the associated LUN

@radoulov
I'll try to get the contents of the messages file from around the incident(s) (it's currently around 650M, so I won't post it all)
EDIT: Added the contents of syslog from the three cluster nodes.

Last edited by pludi; 07-18-2012 at 07:01 AM..
# 5  
Old 07-18-2012
I'm not able to find anything on MOS. I found various threads related to OCFS2/SAN problems on the OTN forums, but they don't seem to match your situation.
Let me know if the database alert log(s) contain something interesting.
# 6  
Old 07-23-2012
It's fixed. Basically, what a consultant told us was that we were lucky it did even last that long, as both the configuration for dm-multipath and the EMC storage were off. I don't know what exactly they did on the SAN side, but after ALUA was switched on on the Clariion, everything runs as smooth as can be. No extra settings required on the machines, as dm-multipath automagically configures itself when detecting an EMC storage with ALUA set.
This User Gave Thanks to pludi For This Post:
radoulov (07-23-2012)
# 7  
Old 07-23-2012
Thanks for sharing the solution!

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Suggestion on running critical Oracle DBs on OVM solaris_1977 UNIX and Linux Applications 7 1 Week Ago 01:38 AM
Solaris SAN Storage Multipath and Messages Check Script ygemici Solaris 0 04-17-2017 05:14 AM
Oracle Linux 7.1 Multipath penchev Linux 1 03-26-2016 04:37 PM
IBM AIX Internal HDD vs SAN HDD and Oracle filosophizer AIX 7 01-27-2014 06:33 AM
OVM Server for SPARC ... Need Your Help Guys ! Mack1982 Solaris 14 04-24-2013 08:52 AM
Problems with fetching data from Oracle database kumarjt Shell Programming and Scripting 4 03-06-2013 12:10 PM
find block size of ocfs2 file system robo UNIX for Dummies Questions & Answers 4 06-26-2012 11:02 AM
Oracle VM / T3 SAN or Local for LDOMS? jlouki01 UNIX for Advanced & Expert Users 0 08-30-2011 04:03 PM
Oracle backup failures after SAN change. nokiae63 Solaris 2 08-16-2010 07:52 AM
Problems AIX and SAN. fjgonzalez AIX 4 07-18-2007 06:15 PM
AIX SAN configuration problems/oddity praxis22 AIX 1 06-27-2007 05:47 PM
urgent problems with ESS SAN, SDD upgrade on AIX 5.3 server djdavies AIX 5 04-19-2007 11:34 AM
Thoughts/experiences of SAN attaching V880 to EMC SAN si_linux Solaris 2 02-26-2007 07:20 PM
How to configure my SAN with Sun V880 servers to run Oracle 9i oracle_dba Solaris 1 08-13-2005 11:14 PM