Quorum and hdisk issue

06-27-2013

Registered User

96, 0

Join Date: Jul 2011

Last Activity: 31 August 2016, 5:57 AM EDT

Location: Bangalore

Posts: 96

Thanks Given: 29

Thanked 0 Times in 0 Posts

Quorum and hdisk issue

Hi. I found an issue with my appvg present in my server.my server is a single node and not part of hacmp.

Code:

pmut3# lspv
hdisk0          00c5c9cf92ebb96a                    rootvg          active
hdisk1          00c5c9cfcf30eee9                    appvg
hdisk2          00c5c9cfcf30ef98                    appvg
hdisk3          00c5c9cfba868e2c                    rootvg          active

Code:

pmut3# lsvg -o
appvg
rootvg

Code:

pmut3# lsvg appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

The below mentioned 2 filesystems(/websp and /opt/websp) are present under appvg

/dev/fslv06 5.00 1.92 62% 4491 1% /websp
/dev/fslv07 10.00 4.20 58% 37905 4% /opt/websp

Code:

pmut3# lslv -m fslv07
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

Code:

pmut3# lslv -m fslv06
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

I guess there is something wrong with the hdisks present under appvg.

Can someone tell which hdisk(hdisk1 or hdisk2) under appvg is having a problem ?
Also what should I do to fix this quorum issue ?

Please let me know if you need the output of any commands from this server. I have informed the application team that
I need downtime to fix the issue in this server and I'm waiting for their reply. I'm afraid that I may loose the data present under appvg.

Code:

pmut3# lslv fslv06
LOGICAL VOLUME:     fslv06                 VOLUME GROUP:   appvg
LV IDENTIFIER:      00c5c9cf00004c0000000116cf30f4b8.2 PERMISSION:     ?
VG STATE:           active/complete        LV STATE:       ?
TYPE:               jfs2                   WRITE VERIFY:   ?
MAX LPs:            ?                      PP SIZE:        ?
COPIES:             ?                      SCHED POLICY:   ?
LPs:                ?                      PPs:            ?
STALE PPs:          ?                      BB POLICY:      ?
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    1024
MOUNT POINT:        /websp                 LABEL:          /websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     ?
INFINITE RETRY:     ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE:      DS_LVZ

Code:

pmut3# lslv fslv07
LOGICAL VOLUME:     fslv07                 VOLUME GROUP:   appvg
LV IDENTIFIER:      00c5c9cf00004c0000000116cf30f4b8.3 PERMISSION:     ?
VG STATE:           active/complete        LV STATE:       ?
TYPE:               jfs2                   WRITE VERIFY:   ?
MAX LPs:            ?                      PP SIZE:        ?
COPIES:             ?                      SCHED POLICY:   ?
LPs:                ?                      PPs:            ?
STALE PPs:          ?                      BB POLICY:      ?
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    1024
MOUNT POINT:        /opt/websp             LABEL:          /opt/websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     ?
INFINITE RETRY:     ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE:      DS_LVZ

Code:

pmut3# errpt -aj CAD234BE | pg
---------------------------------------------------------------------------
LABEL:          LVM_SA_QUORCLOSE
IDENTIFIER:     CAD234BE

Date/Time:       Thu Jun 20 10:44:39 GMT+01:00 2013
Sequence Number: 8144
Machine Id:      00C5C9CF4C00
Node Id:         pmut3
Class:           H
Type:            UNKN
WPAR:            Global
Resource Name:   LVDD
Resource Class:  NONE
Resource Type:   NONE
Location:

Description
QUORUM LOST, VOLUME GROUP CLOSING

Probable Causes
PHYSICAL VOLUME UNAVAILABLE

Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0028 0000 0000
QUORUM COUNT
           2
ACTIVE COUNT
           1
SENSE DATA
0000 0000 0000 0645 00C5 C9CF 0000 4C00 0000 0116 CF30 F4B8 0000 0000 0000 0000

newtoaixos

View Public Profile for newtoaixos

Find all posts by newtoaixos

06-27-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

The system told you to issue a "varyoffvg" and then a "varyonvg". Have you done that? What was the outcome? Were there any error messages?

Which disk (if a disk at all) has maybe caused a possible problem i can't tell from here, because my line of sight to Bangalore is blocked and my crystal ball is in repair.

I suggest you start advanced troubleshooting instead, by applying your reading skills to the OS output. Your data are as safe as they can be, given the circumstances, because an inactive VG with only unaccessible filesystems can't get any worse than it already is: either you can revive it or the data on it is already lost.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

06-27-2013

Registered User

96, 0

Join Date: Jul 2011

Last Activity: 31 August 2016, 5:57 AM EDT

Location: Bangalore

Posts: 96

Thanks Given: 29

Thanked 0 Times in 0 Posts

Quorum and hdisk issue

Hi Bakunin

Thanks for your reply. before i do the varyoffvg i need to unmount the 2 filesystems present under that VG.

I have informed the application team already about this issue. probably when they give me the downtime I will unmount and then try to varyoff the VG.

Is the below procedure looking fine for you

umount /websp
umount /opt/websp
varyoffvg appvg
varyonvg appvg

newtoaixos

View Public Profile for newtoaixos

Find all posts by newtoaixos

06-27-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

You should really describe your problem better / more exactly. The following was in no way obvious from you first posting:

Quote:

Originally Posted by newtoaixos

before i do the varyoffvg i need to unmount the 2 filesystems present under that VG.

Are the FSs mounted and accessible?? What is the output of:

Code:

lsvg -l appvg
lsvg -p appvg

Somehow i doubt that the filesystems are still available when the VG has been closed.

What does "errpt" tell you? The "quorum" is the minimum number of disks that have to present to make a VG valid. Once less disks than this quorum are there the VG is forced offline, which means all the FSs belonging to it are unmounted (which is why i doubt they really are there). Further, there must be some entry in the "errpt" log regarding a hdisk device failing, otherwise the quorum wouldn't have been lost.

Quote:

I have informed the application team already about this issue. probably when they give me the downtime I will unmount and then try to varyoff the VG.

NO!

When they decided to commission a system where disks are not redundant they forfeited any right to have an uninterruptible service. Hardware is failing from time to time, that is old news. Either you have hardware (regardless of what it is: network cards, disks, processors, power supplies, ... ) redundant, so that when one part fails the other is still there or you have hardware not redundant: then you have to expect the service to be interrupted from time to time. Everything else is "wash me, but don't make me wet in the process": rubbish. No admin in his right mind lets get himself in such a double-bind situation.

Your 2 disks could not have been redundant, because in this case the quorum should have been deactivated: a VG consisting of two mirrored disks is safe even if there is only one of these disk present. (If the disks were indeed mirrored: i suggest firing the idiot who configured such a horse manure on the spot for proven incompetence.)

Additional question: what are these disks? LUNs? (provided via VIOS?, NPIV? other?) Physical disks? RAID-sets? Show the output of these commands:

Quote:

lsdev -Cc disk
lsattr -El hdisk1 / hdisk2

Background is: is there a chance that the inavailability of the disk(s) might be temporary in nature? It might work if you issue

Code:

varyonvg -bu appvg

in this case.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

06-30-2013

Registered User

96, 0

Join Date: Jul 2011

Last Activity: 31 August 2016, 5:57 AM EDT

Location: Bangalore

Posts: 96

Thanks Given: 29

Thanked 0 Times in 0 Posts

Quorum and hdisk issue

hi bakunin

thanks for your help.

what you said is correct ? one disk error came up in errpt. below are the outputs you have requested :

Code:

pmut3# lsvg -l appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

pmut3# lsvg -p appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.

pmut3# lsdev -Cc disk
hdisk0 Available 06-08-01-3,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 06-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive
hdisk3 Available 06-08-01-8,0 16 Bit LVD SCSI Disk Drive

pmut3# lsattr -El hdisk1
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30eee90000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       260800023B980AST373455LC08IBM   H0scsi Unique device identifier      False

pmut3# lsattr -El hdisk2
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30ef980000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       2608000239C40AST373455LC08IBM   H0scsi Unique device identifier      False




pmut3# errpt -aj 8647C4E2 | pg
---------------------------------------------------------------------------
LABEL:          DISK_ERR3
IDENTIFIER:     8647C4E2

Date/Time:       Sun Jun 30 00:24:07 GMT+01:00 2013
Sequence Number: 8370
Machine Id:      00C5C9CF4C00
Node Id:         pmut3
Class:           H
Type:            PERM
WPAR:            Global
Resource Name:   hdisk1
Resource Class:  disk
Resource Type:   scsd
Location:        U788C.001.AAB1650-P1-T11-L4-L0

VPD:
        Manufacturer................IBM   H0
        Machine Type and Model......ST373455LC
        FRU Number..................03N6347
        ROS Level and ID............43383036
        Serial Number...............00023B98
        EC Level....................D76038
        Part Number.................03N6346
        Device Specific.(Z0)........000004129F000136
        Device Specific.(Z1)........0309C806
        Device Specific.(Z2)........0002
        Device Specific.(Z3)........07202
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........D76038
        Brand.......................H0

Description
DISK OPERATION ERROR

Probable Causes
DASD DEVICE
STORAGE DEVICE CABLE

Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
STORAGE DEVICE CABLE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0604 0000 0800 0046 0100 0000 0000 0000 0200 0400 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0014 B596 0009

Diagnostic Analysis
Diagnostic Log sequence number: 50
Resource tested:        hdisk1
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               U788C.001.AAB1650-P1-T11-L4-L0
SRN:                    000-129
Description:            Error log analysis indicates a SCSI bus problem.

newtoaixos

View Public Profile for newtoaixos

Find all posts by newtoaixos

06-30-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

OK.

The output confirms that hdisk1 is the cause of your problem. Your volume group is definitely offline and gone with it are the filesystems it may have (once) contained. If they appear to be still mounted: don't believe it, they are gone.

What you see here is a description of the disk (hdisk1) in increasing detail:

Quote:

Originally Posted by newtoaixos

Code:

pmut3# lsdev -Cc disk
hdisk0 Available 06-08-01-3,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 06-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive
hdisk3 Available 06-08-01-8,0 16 Bit LVD SCSI Disk Drive

pmut3# lsattr -El hdisk1
PCM             PCM/friend/scsiscsd                    Path Control Module           False
algorithm       fail_over                              Algorithm                     True
dist_err_pcnt   0                                      Distributed Error Percentage  True
dist_tw_width   50                                     Distributed Error Sample Time True
hcheck_interval 0                                      Health Check Interval         True
hcheck_mode     nonactive                              Health Check Mode             True
max_transfer    0x40000                                Maximum TRANSFER Size         True
pvid            00c5c9cfcf30eee90000000000000000       Physical volume identifier    False
queue_depth     3                                      Queue DEPTH                   False
reserve_policy  single_path                            Reserve Policy                True
size_in_mb      73400                                  Size in Megabytes             False
unique_id       260800023B980AST373455LC08IBM   H0scsi Unique device identifier      False

And this is the probable cause for hdisk1 failing. I SNIPped to the interesting part:

Quote:

Originally Posted by newtoaixos

Code:

pmut3# errpt -aj 8647C4E2 | pg
<...SNIP....>
Resource tested:        hdisk1
Resource Description:   16 Bit LVD SCSI Disk Drive
Location:               U788C.001.AAB1650-P1-T11-L4-L0
SRN:                    000-129
Description:            Error log analysis indicates a SCSI bus problem.

Looks like your SCSI disk was failing somehow - this could be everything from a broken cable, a terminator gone bad to the disk itself become broken. First, make shure that the SCSI link is up again. Delete the hdisk1 devices and run "cfgmgr" to rediscover it. If it won't come back the disk is not connected (or broken), if it is in status "available" the disconnection is gone. You still should investigate, because a symptom gone is not a problem solved. Find the reason for the disconnection, only this will solve your problem.

Still, don't be shy to start repair action - this server will do nothing without the data necessary for carrying out its function anyway. If business complains: see above. If they are too greedy to pay for mirrored disks they will have to live with failing ones and the time necessary for repair. If the disks are indeed mirrored whoever forgot to (un)set the quorum is to blame and business will have every right to be angry. This is administration basics and should not happen at all.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

AIX

Quorum and hdisk issue

10 More Discussions You Might Find Interesting

1. Solaris

Precaution during Quorum Server Reboot

Discussion started by: sb200

2. AIX

How can I map hdisk# to rhdisk#?

Discussion started by: sbrower

3. Red Hat

Centos/rhel 5 cluster 3 nodes with out Quorum

Discussion started by: Flomaster

4. AIX

Flashcopy, ghost hdisk ??

Discussion started by: enux

5. AIX

Quorum in lsvg output

Discussion started by: petervg

6. AIX

LVM - Quorum

Discussion started by: Andre54

7. AIX

Dummy hdisk in AIX 6.1

Discussion started by: kah00na

8. Emergency UNIX and Linux Support

AIX APPVG - QUORUM LOST, VOLUME GROUP CLOSING

Discussion started by: zxmaus

9. AIX

Check quorum for volume group

Discussion started by: backslash

10. AIX

vpath to an hdisk

Discussion started by: 2dumb