Oracle/Xen/OVM/OCFS2/Multipath/SAN Problems


 
Thread Tools Search this Thread
Homework and Emergencies Emergency UNIX and Linux Support Oracle/Xen/OVM/OCFS2/Multipath/SAN Problems
# 1  
Old 07-18-2012
[SOLVED] Oracle/Xen/OVM/OCFS2/Multipath/SAN Problems

Setup:
  1. Multiple Virtual Machines running OEL5 / Oracle RDBMS 11.2
  2. OVM 2 / Xen 3.4.0 Cluster consisting of 3 Machines
  3. Shared SAN Storage synced with OCFS2
  4. SAN connected with 4GB FC on 4 Paths per LUN
  5. SAN target are 2 EMC Clariion mirroring each other

The problems we're facing are that the OCFS2 file system and the file systems inside the VMs are reporting I/O errors, and that Cluster nodes are dying without apparent reason. The file systems are getting corrupted time and again.

dmesg fills the screen with these error messages:
Code:
end_request: I/O error, dev sdae, sector 6144
device-mapper: multipath: Failing path 65:224.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:2: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdaf, sector 6144
device-mapper: multipath: Failing path 65:240.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:1: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdae, sector 6144
device-mapper: multipath: Failing path 65:224.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:2: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

end_request: I/O error, dev sdaf, sector 6144
device-mapper: multipath: Failing path 65:240.
device-mapper: multipath emc: emc_pg_init: sending switch-over command
device-mapper: multipath emc: emc_pg_init: sending switch-over command
sd 6:0:1:1: Device not ready: <6>: Current: sense key: Not Ready
    Add. Sense: Logical unit not ready, manual intervention required

multipathd.conf (same on all nodes):
Code:
defaults {
          user_friendly_names yes
          default_features "fail_if_no_path"
         }
blacklist {
           devnode "^sda[[0-9]*]"
           devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
           devnode "^hd[a-z][[0-9]*]"
           devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
          }

devices {
          device {
                vendor "DGC"
                product ".*"
                product_blacklist "LUNZ"
                #path_grouping_policy "group_by_prio"
                path_grouping_policy "failover"
                path_checker "emc_clariion"
                features "0"
                hardware_handler "1 emc"
                prio "emc"
                failback "immediate"
               # no_path_retry "60"
                }
        }

multipath -ll:
Code:
mpath1 (360060160ce601d0084016ace2aa3e011) dm-4 DGC,RAID 5
[size=401G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:1  sdae 65:224 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:1  sdc  8:32   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:1  sdi  8:128  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:1  sdx  65:112 [active][ready]
mpath0 (360060160ce601d006415374d27a3e011) dm-3 DGC,RAID 5
[size=100M][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:1:0  sdad 65:208 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:0  sdb  8:16   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:0  sdh  8:112  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:0  sdt  65:48  [active][ready]
mpath9 (36006016054701d00f6fd290acbc9e111) dm-8 DGC,RAID 5
[size=780G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:5  sdal 66:80  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:5  sdaq 66:160 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:5  sdp  8:240  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:5  sdv  65:80  [active][ready]
mpath8 (36006016054701d00fe9762c9ccc9e111) dm-9 DGC,RAID 5
[size=156G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:6  sdam 66:96  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:6  sdar 66:176 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:6  sdq  65:0   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:6  sdw  65:96  [active][ready]
mpath7 (36006016054701d008a720d61c9c9e111) dm-7 DGC,RAID 5
[size=490G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:4  sdak 66:64  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:4  sdap 66:144 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:4  sdo  8:224  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:4  sdu  65:64  [active][ready]
mpath6 (360060160ce601d00d4d98a68cac9e111) dm-1 DGC,RAID 5
[size=200G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:8  sdab 65:176 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:8  sdah 66:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:8  sdf  8:80   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:8  sdl  8:176  [active][ready]
mpath5 (360060160ce601d00d65e288bc9c9e111) dm-0 DGC,RAID 5
[size=370G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:7  sdaa 65:160 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:7  sdag 66:0   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:7  sde  8:64   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:7  sdk  8:160  [active][ready]
mpath11 (36006016054701d005214f471d6c9e111) dm-10 DGC,RAID 5
[size=100G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:9  sdan 66:112 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:9  sdas 66:192 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:9  sdr  65:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:9  sdz  65:144 [active][ready]
mpath4 (360060160ce601d00f2d025a2c2c9e111) dm-5 DGC,RAID 5
[size=470G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:2  sdaf 65:240 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:2  sdd  8:48   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:2  sdj  8:144  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:0:2  sdy  65:128 [active][ready]
mpath10 (360060160ce601d009a1792b828a3e011) dm-2 DGC,RAID 5
[size=444G][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:0:10 sdac 65:192 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:1:10 sdai 66:32  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:0:10 sdg  8:96   [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:1:10 sdm  8:192  [active][ready]
mpath3 (36006016054701d001c9376bd542be111) dm-6 DGC,RAID 5
[size=100M][features=0][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=1][active]
 \_ 6:0:2:0  sdaj 66:48  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 6:0:3:0  sdao 66:128 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:2:0  sdn  8:208  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 5:0:3:0  sds  65:32  [active][ready]

OCFS2 cluster.conf (auto generated by OVM):
Code:
node:
	ip_port     = 7777
	ip_address  = x.y.z.1
	number      = 0
	name        = node1
	cluster     = ocfs2

node:
	ip_port     = 7777
	ip_address  = x.y.z.2
	number      = 1
	name        = node2
	cluster     = ocfs2

node:
	ip_port     = 7777
	ip_address  = x.y.z.3
	number      = 2
	name        = node3
	cluster     = ocfs2

cluster:
	node_count  = 3
	name        = ocfs2

Any help is appreciated, since we're basically trying to get everything running again for the last 24 hours, and we're out of ideas.

No, we don't have a support contract for OVM with Oracle.

If you need any more information just ask.
# 2  
Old 07-18-2012
Some questions:
  1. Was it ever running correctly? If so for how long?
  2. Have there been any recent changes to the SAN or OVM or OEL configuration (assuming it was running correctly before)?
# 3  
Old 07-18-2012
Hi pludi,
check the alert log of the failing database instances. Could you also post the content of /var/log/messages (dmesg's output may not be enough) from the time the errors occurred?
# 4  
Old 07-18-2012
@bartus
Ad 1) Yes, for the past 2 weeks
Ad 2) We removed an empty OVM repository and the associated LUN

@radoulov
I'll try to get the contents of the messages file from around the incident(s) (it's currently around 650M, so I won't post it all)
EDIT: Added the contents of syslog from the three cluster nodes.

Last edited by pludi; 07-18-2012 at 07:01 AM..
# 5  
Old 07-18-2012
I'm not able to find anything on MOS. I found various threads related to OCFS2/SAN problems on the OTN forums, but they don't seem to match your situation.
Let me know if the database alert log(s) contain something interesting.
# 6  
Old 07-23-2012
It's fixed. Basically, what a consultant told us was that we were lucky it did even last that long, as both the configuration for dm-multipath and the EMC storage were off. I don't know what exactly they did on the SAN side, but after ALUA was switched on on the Clariion, everything runs as smooth as can be. No extra settings required on the machines, as dm-multipath automagically configures itself when detecting an EMC storage with ALUA set.
This User Gave Thanks to pludi For This Post:
# 7  
Old 07-23-2012
Thanks for sharing the solution!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX and Linux Applications

Suggestion on running critical Oracle DBs on OVM

Hello, We have few critical databases running on T series servers. Setup consists of LDOMS and zonemanager running on them and databases are running on their zones. We want to get away from this setup and looking for alternate solution to run our Oracle databases. One option is OVM from Oracle.... (7 Replies)
Discussion started by: solaris_1977
7 Replies

2. Solaris

Solaris SAN Storage Multipath and Messages Check Script

Hi everybody, i am working on the new script , its job discover the Storage Env ( especially MultiPATH ) and FC cards for solaris 11 sparc systems for now.. script is seem working ( but may contain any mistakes or bug ) on the oracle Qlogic fc cards on Emc_VMAx systems and Solaris 11 Sparc_64... (0 Replies)
Discussion started by: ygemici
0 Replies

3. Linux

Oracle Linux 7.1 Multipath

Need attach Oracle Linux 7.1 to IBM Storage DS5300 in AIX use sddpcm package for storage driver, but in oracle linux i can not find similar driver in IBM site. Someone confronted with a like case ? (1 Reply)
Discussion started by: penchev
1 Replies

4. UNIX for Advanced & Expert Users

Oracle VM / T3 SAN or Local for LDOMS?

I have some T3's we just purchased and we are looking to carve these up into LDOMS's. Just wondering if anyone can give me a quick run down or pro / cons on SAN vs local internal for the LDOM itself. These will external SAN storage for their applications. (0 Replies)
Discussion started by: jlouki01
0 Replies

5. Solaris

Oracle backup failures after SAN change.

Hi, We are moving the RMAN oracle backup to tape on daily basis.After they went for SAN change in the solaris server , we started receiveing ora-600 errors and backup is getting fails. I am from Oracle DBA team.I need to know whether any command for check SAN change or root cause for this... (2 Replies)
Discussion started by: nokiae63
2 Replies

6. AIX

Problems AIX and SAN.

Sorry for my english. We have a IBM BLADES JS21. AIX 5.3 update to 6. Our JS21 has 2 FC (fcs0 and fcs1). We have one DS4072, one Disk system with 2 controllers and 2 FC by controllers. This means, all AIX FC see all Disk systems controllers by 2 FC switchs. (one fc two roads) FC AIX... (4 Replies)
Discussion started by: fjgonzalez
4 Replies

7. AIX

AIX SAN configuration problems/oddity

Hi, I have a strange problem. we're trying to connect an IBM pseries, to a Brocade switch, for SAN acess, using a badged emulex card, (IBM FC6239) WE can configure the device to see the fabric. The only problem we have is that the Brocade sees the HBA as storage, and not as a HBA. We've zoned... (1 Reply)
Discussion started by: praxis22
1 Replies

8. AIX

urgent problems with ESS SAN, SDD upgrade on AIX 5.3 server

Hi all, Sorry if this is in the wrong place but needed to make sure lots of people saw this so that hopefully someone will be able to help. Basically i've upgraded a test server from 4.3 to 5.3 TL04. The server has hdisk0 and 1 as rootvg locally but then has another vg setup on our ESS... (5 Replies)
Discussion started by: djdavies
5 Replies

9. Solaris

How to configure my SAN with Sun V880 servers to run Oracle 9i

Hi: I am in the process of configuring the SAN for Solaris to host 6 oracle 9i databases. We have 30 -146 GB disks stiped with RAID 10 for SAN. Of which 11 are dedicated for databsaes related things. Then we have 2 v880 Sun Servers with 16 -73 GB disks and 24 GB memory. The questions... (1 Reply)
Discussion started by: oracle_dba
1 Replies
Login or Register to Ask a Question