Quorum and hdisk issue


 
Thread Tools Search this Thread
Operating Systems AIX Quorum and hdisk issue
# 1  
Old 06-27-2013
Quorum and hdisk issue

Hi. I found an issue with my appvg present in my server.my server is a single node and not part of hacmp.
Code:
pmut3# lspv
hdisk0          00c5c9cf92ebb96a                    rootvg          active
hdisk1          00c5c9cfcf30eee9                    appvg
hdisk2          00c5c9cfcf30ef98                    appvg
hdisk3          00c5c9cfba868e2c                    rootvg          active

Code:
pmut3# lsvg -o
appvg
rootvg

Code:
pmut3# lsvg appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

The below mentioned 2 filesystems(/websp and /opt/websp) are present under appvg

/dev/fslv06 5.00 1.92 62% 4491 1% /websp
/dev/fslv07 10.00 4.20 58% 37905 4% /opt/websp
Code:
pmut3# lslv -m fslv07
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

Code:
pmut3# lslv -m fslv06
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

I guess there is something wrong with the hdisks present under appvg.

Can someone tell which hdisk(hdisk1 or hdisk2) under appvg is having a problem ?
Also what should I do to fix this quorum issue ?

Please let me know if you need the output of any commands from this server. I have informed the application team that
I need downtime to fix the issue in this server and I'm waiting for their reply. I'm afraid that I may loose the data present under appvg.

Code:
pmut3# lslv fslv06
LOGICAL VOLUME:     fslv06                 VOLUME GROUP:   appvg
LV IDENTIFIER:      00c5c9cf00004c0000000116cf30f4b8.2 PERMISSION:     ?
VG STATE:           active/complete        LV STATE:       ?
TYPE:               jfs2                   WRITE VERIFY:   ?
MAX LPs:            ?                      PP SIZE:        ?
COPIES:             ?                      SCHED POLICY:   ?
LPs:                ?                      PPs:            ?
STALE PPs:          ?                      BB POLICY:      ?
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    1024
MOUNT POINT:        /websp                 LABEL:          /websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     ?
INFINITE RETRY:     ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE:      DS_LVZ

Code:
pmut3# lslv fslv07
LOGICAL VOLUME:     fslv07                 VOLUME GROUP:   appvg
LV IDENTIFIER:      00c5c9cf00004c0000000116cf30f4b8.3 PERMISSION:     ?
VG STATE:           active/complete        LV STATE:       ?
TYPE:               jfs2                   WRITE VERIFY:   ?
MAX LPs:            ?                      PP SIZE:        ?
COPIES:             ?                      SCHED POLICY:   ?
LPs:                ?                      PPs:            ?
STALE PPs:          ?                      BB POLICY:      ?
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    1024
MOUNT POINT:        /opt/websp             LABEL:          /opt/websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     ?
INFINITE RETRY:     ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE:      DS_LVZ

Code:
pmut3# errpt -aj CAD234BE | pg
---------------------------------------------------------------------------
LABEL:          LVM_SA_QUORCLOSE
IDENTIFIER:     CAD234BE

Date/Time:       Thu Jun 20 10:44:39 GMT+01:00 2013
Sequence Number: 8144
Machine Id:      00C5C9CF4C00
Node Id:         pmut3
Class:           H
Type:            UNKN
WPAR:            Global
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:

Description
QUORUM LOST, VOLUME GROUP CLOSING

Probable Causes
PHYSICAL VOLUME UNAVAILABLE

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0028 0000 0000
QUORUM COUNT
           2
ACTIVE COUNT
           1
SENSE DATA
0000 0000 0000 0645 00C5 C9CF 0000 4C00 0000 0116 CF30 F4B8 0000 0000 0000 0000

# 2  
Old 06-27-2013
The system told you to issue a "varyoffvg" and then a "varyonvg". Have you done that? What was the outcome? Were there any error messages?

Which disk (if a disk at all) has maybe caused a possible problem i can't tell from here, because my line of sight to Bangalore is blocked and my crystal ball is in repair.

I suggest you start advanced troubleshooting instead, by applying your reading skills to the OS output. Your data are as safe as they can be, given the circumstances, because an inactive VG with only unaccessible filesystems can't get any worse than it already is: either you can revive it or the data on it is already lost.

I hope this helps.

bakunin
# 3  
Old 06-27-2013
Quorum and hdisk issue

Hi Bakunin

Thanks for your reply. before i do the varyoffvg i need to unmount the 2 filesystems present under that VG.

I have informed the application team already about this issue. probably when they give me the downtime I will unmount and then try to varyoff the VG.

Is the below procedure looking fine for you

umount /websp
umount /opt/websp
varyoffvg appvg
varyonvg appvg
# 4  
Old 06-27-2013
You should really describe your problem better / more exactly. The following was in no way obvious from you first posting:

Quote:
Originally Posted by newtoaixos
before i do the varyoffvg i need to unmount the 2 filesystems present under that VG.
Are the FSs mounted and accessible?? What is the output of:

Code:
lsvg -l appvg
lsvg -p appvg

Somehow i doubt that the filesystems are still available when the VG has been closed.

What does "errpt" tell you? The "quorum" is the minimum number of disks that have to present to make a VG valid. Once less disks than this quorum are there the VG is forced offline, which means all the FSs belonging to it are unmounted (which is why i doubt they really are there). Further, there must be some entry in the "errpt" log regarding a hdisk device failing, otherwise the quorum wouldn't have been lost.

Quote:
I have informed the application team already about this issue. probably when they give me the downtime I will unmount and then try to varyoff the VG.
NO!

When they decided to commission a system where disks are not redundant they forfeited any right to have an uninterruptible service. Hardware is failing from time to time, that is old news. Either you have hardware (regardless of what it is: network cards, disks, processors, power supplies, ... ) redundant, so that when one part fails the other is still there or you have hardware not redundant: then you have to expect the service to be interrupted from time to time. Everything else is "wash me, but don't make me wet in the process": rubbish. No admin in his right mind lets get himself in such a double-bind situation.

Your 2 disks could not have been redundant, because in this case the quorum should have been deactivated: a VG consisting of two mirrored disks is safe even if there is only one of these disk present. (If the disks were indeed mirrored: i suggest firing the idiot who configured such a horse manure on the spot for proven incompetence.)

Additional question: what are these disks? LUNs? (provided via VIOS?, NPIV? other?) Physical disks? RAID-sets? Show the output of these commands:

Quote:
lsdev -Cc disk
lsattr -El hdisk1 / hdisk2
Background is: is there a chance that the inavailability of the disk(s) might be temporary in nature? It might work if you issue

Code:
varyonvg -bu appvg

in this case.

I hope this helps.

bakunin
# 5  
Old 06-30-2013
Quorum and hdisk issue

hi bakunin

thanks for your help.

what you said is correct ? one disk error came up in errpt. below are the outputs you have requested :

Code:
pmut3# lsvg -l appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

pmut3# lsvg -p appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

pmut3# lsdev -Cc disk
hdisk0 Available 06-08-01-3,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 06-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive
hdisk3 Available 06-08-01-8,0 16 Bit LVD SCSI Disk Drive

pmut3# lsattr -El hdisk1
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30eee90000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       260800023B980AST373455LC08IBM   H0scsi Unique device identifier      False

pmut3# lsattr -El hdisk2
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30ef980000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       2608000239C40AST373455LC08IBM   H0scsi Unique device identifier      False




pmut3# errpt -aj 8647C4E2 | pg
---------------------------------------------------------------------------
LABEL:          DISK_ERR3
IDENTIFIER:     8647C4E2

Date/Time:       Sun Jun 30 00:24:07 GMT+01:00 2013
Sequence Number: 8370
Machine Id:      00C5C9CF4C00
Node Id:         pmut3
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   hdisk1
Resource Class:  disk
Resource Type:   scsd
Location:        U788C.001.AAB1650-P1-T11-L4-L0

VPD:
        Manufacturer................IBM   H0
        Machine Type and Model......ST373455LC
        FRU Number..................03N6347
        ROS Level and ID............43383036
        Serial Number...............00023B98
        EC Level....................D76038
        Part Number.................03N6346
        Device Specific.(Z0)........000004129F000136
        Device Specific.(Z1)........0309C806
        Device Specific.(Z2)........0002
        Device Specific.(Z3)........07202
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........D76038
        Brand.......................H0

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE
STORAGE DEVICE CABLE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
STORAGE DEVICE CABLE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0604 0000 0800 0046 0100 0000 0000 0000 0200 0400 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0014 B596 0009

Diagnostic Analysis
Diagnostic Log sequence number: 50
Resource tested:        hdisk1
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               U788C.001.AAB1650-P1-T11-L4-L0
SRN:                    000-129
Description:            Error log analysis indicates a SCSI bus problem.

# 6  
Old 06-30-2013
OK.

The output confirms that hdisk1 is the cause of your problem. Your volume group is definitely offline and gone with it are the filesystems it may have (once) contained. If they appear to be still mounted: don't believe it, they are gone.

What you see here is a description of the disk (hdisk1) in increasing detail:

Quote:
Originally Posted by newtoaixos
Code:
pmut3# lsdev -Cc disk
hdisk0 Available 06-08-01-3,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 06-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive
hdisk3 Available 06-08-01-8,0 16 Bit LVD SCSI Disk Drive

pmut3# lsattr -El hdisk1
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30eee90000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       260800023B980AST373455LC08IBM   H0scsi Unique device identifier      False

And this is the probable cause for hdisk1 failing. I SNIPped to the interesting part:

Quote:
Originally Posted by newtoaixos
Code:
pmut3# errpt -aj 8647C4E2 | pg
<...SNIP....>
Resource tested:        hdisk1
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               U788C.001.AAB1650-P1-T11-L4-L0
SRN:                    000-129
Description:            Error log analysis indicates a SCSI bus problem.

Looks like your SCSI disk was failing somehow - this could be everything from a broken cable, a terminator gone bad to the disk itself become broken. First, make shure that the SCSI link is up again. Delete the hdisk1 devices and run "cfgmgr" to rediscover it. If it won't come back the disk is not connected (or broken), if it is in status "available" the disconnection is gone. You still should investigate, because a symptom gone is not a problem solved. Find the reason for the disconnection, only this will solve your problem.

Still, don't be shy to start repair action - this server will do nothing without the data necessary for carrying out its function anyway. If business complains: see above. If they are too greedy to pay for mirrored disks they will have to live with failing ones and the time necessary for repair. If the disks are indeed mirrored whoever forgot to (un)set the quorum is to blame and business will have every right to be angry. This is administration basics and should not happen at all.

I hope this helps.

bakunin
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Solaris

Precaution during Quorum Server Reboot

Hi I need to know what are the precaution we should take during quorum server reboot as this quorum server is providing quorum devices to five different solaris two node clusters. Also let me know do I have to follow below procedure as well before and after reboot of quorum server Do I... (3 Replies)
Discussion started by: sb200
3 Replies

2. AIX

How can I map hdisk# to rhdisk#?

Some storage/disks have been added to an existing AIX 6.1 server. The admin sent me the list of hdisk#'s for the new disks, but I need the corresponding rhdisk# for the same hdisk. (I know from past experience that the rhdisk that maps to an hdisk is not always the same number. For instance,... (5 Replies)
Discussion started by: sbrower
5 Replies

3. Red Hat

Centos/rhel 5 cluster 3 nodes with out Quorum

Hi all, i have 3 nodes cluster (Centos 5 cluster suit) with out quorum disk, node vote = 1, the value of a quorum = 2, when 2 nodes going offline, cluster services are destoys. How i can save the cluster and all services(move all services to one alive node) with out quorum disk when other... (3 Replies)
Discussion started by: Flomaster
3 Replies

4. AIX

Flashcopy, ghost hdisk ??

Hi all, I'm getting some errors on AIX regarding Flashcopy and volume group hard disks. The script that activates flashcopy showed this errors: Recreating Flashcopy for lun01_A1 Performing syntax check... Syntax check complete. Executing script... Script execution complete. SMcli... (1 Reply)
Discussion started by: enux
1 Replies

5. AIX

Quorum in lsvg output

Hi there, I have three servers and I'm puzzled by the oputput I get from lsvg rootvg. Server 1 : QUORUM: 2 (Enabled) Server 2 : QUORUM: 1 (Disabled) Server 3 : QUORUM: 1 All VG are build on 2 PV and are mirroring. What could cause the number to be different?... (2 Replies)
Discussion started by: petervg
2 Replies

6. AIX

LVM - Quorum

Hi all Just a question about quorum. I am running AIX 5.3 Rootvg has 2 PV - not mirrored. quorum is switched on. What happens when one disk fails?, can i replace the disk and bring the entire VG back up. with all the data intact. knowing that the VG will be unavailable until i replace the... (3 Replies)
Discussion started by: Andre54
3 Replies

7. AIX

Dummy hdisk in AIX 6.1

How do you create a dummy hdisk with AIX 6.1? In previous versions, I've used this and works, but now I get this error. hostname:/:# mkdev -l hdisk57 -c disk -t osdisk -s scsi -p fscsi0 -w 0,10 -d Method error (/etc/methods/define): 0514-022 The specified connection is not valid. Any... (2 Replies)
Discussion started by: kah00na
2 Replies

8. Emergency UNIX and Linux Support

AIX APPVG - QUORUM LOST, VOLUME GROUP CLOSING

Hi, I am running AIX 5.3 TL8. After a disk failure, one of my mirrored application volumegroups went down. Unfortunately we have quorum switched on on this VG and the defective disk holds the majority. I have set MISSINGPV_VARYON to TRUE and tried a forced varyon but it's still failing. I... (3 Replies)
Discussion started by: zxmaus
3 Replies

9. AIX

Check quorum for volume group

Hi all, I would like to ensure that a volume group has an effective quorum setting of 1 (or off). I know you can change the quorum setting using the chvg -Q command but want to know if the setting has been changed before the vg was varied on or a reboot. In other words how can I ensure that... (3 Replies)
Discussion started by: backslash
3 Replies

10. AIX

vpath to an hdisk

Is there a simply way for me to map a vpath to an hdisk on AIX 5.2? (5 Replies)
Discussion started by: 2dumb
5 Replies
Login or Register to Ask a Question