Hi. I found an issue with my appvg present in my server.my server is a single node and not part of hacmp.
Code:
pmut3# lspv
hdisk0 00c5c9cf92ebb96a rootvg active
hdisk1 00c5c9cfcf30eee9 appvg
hdisk2 00c5c9cfcf30ef98 appvg
hdisk3 00c5c9cfba868e2c rootvg active
Code:
pmut3# lsvg -o
appvg
rootvg
Code:
pmut3# lsvg appvg
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.
The below mentioned 2 filesystems(/websp and /opt/websp) are present under appvg
pmut3# lslv -m fslv07
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.
Code:
pmut3# lslv -m fslv06
0516-034 : Failed to open VG special file. Probable cause
is the VG was forced offline. Execute the varyoffvg and varyonvg
commands to bring the VG online.
I guess there is something wrong with the hdisks present under appvg.
Can someone tell which hdisk(hdisk1 or hdisk2) under appvg is having a problem ?
Also what should I do to fix this quorum issue ?
Please let me know if you need the output of any commands from this server. I have informed the application team that
I need downtime to fix the issue in this server and I'm waiting for their reply. I'm afraid that I may loose the data present under appvg.
Code:
pmut3# lslv fslv06
LOGICAL VOLUME: fslv06 VOLUME GROUP: appvg
LV IDENTIFIER: 00c5c9cf00004c0000000116cf30f4b8.2 PERMISSION: ?
VG STATE: active/complete LV STATE: ?
TYPE: jfs2 WRITE VERIFY: ?
MAX LPs: ? PP SIZE: ?
COPIES: ? SCHED POLICY: ?
LPs: ? PPs: ?
STALE PPs: ? BB POLICY: ?
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 1024
MOUNT POINT: /websp LABEL: /websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: ?
INFINITE RETRY: ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE: DS_LVZ
Code:
pmut3# lslv fslv07
LOGICAL VOLUME: fslv07 VOLUME GROUP: appvg
LV IDENTIFIER: 00c5c9cf00004c0000000116cf30f4b8.3 PERMISSION: ?
VG STATE: active/complete LV STATE: ?
TYPE: jfs2 WRITE VERIFY: ?
MAX LPs: ? PP SIZE: ?
COPIES: ? SCHED POLICY: ?
LPs: ? PPs: ?
STALE PPs: ? BB POLICY: ?
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 1024
MOUNT POINT: /opt/websp LABEL: /opt/websp
MIRROR WRITE CONSISTENCY: ?
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: ?
INFINITE RETRY: ?
lslv: open(): There is a request to a device or address that does not exist.
DEVICESUBTYPE: DS_LVZ
Code:
pmut3# errpt -aj CAD234BE | pg
---------------------------------------------------------------------------
LABEL: LVM_SA_QUORCLOSE
IDENTIFIER: CAD234BE
Date/Time: Thu Jun 20 10:44:39 GMT+01:00 2013
Sequence Number: 8144
Machine Id: 00C5C9CF4C00
Node Id: pmut3
Class: H
Type: UNKN
WPAR: Global
Resource Name: LVDD
Resource Class: NONE
Resource Type: NONE
Location:
Description
QUORUM LOST, VOLUME GROUP CLOSING
Probable Causes
PHYSICAL VOLUME UNAVAILABLE
Detail Data
MAJOR/MINOR DEVICE NUMBER
8000 0028 0000 0000
QUORUM COUNT
2
ACTIVE COUNT
1
SENSE DATA
0000 0000 0000 0645 00C5 C9CF 0000 4C00 0000 0116 CF30 F4B8 0000 0000 0000 0000
The system told you to issue a "varyoffvg" and then a "varyonvg". Have you done that? What was the outcome? Were there any error messages?
Which disk (if a disk at all) has maybe caused a possible problem i can't tell from here, because my line of sight to Bangalore is blocked and my crystal ball is in repair.
I suggest you start advanced troubleshooting instead, by applying your reading skills to the OS output. Your data are as safe as they can be, given the circumstances, because an inactive VG with only unaccessible filesystems can't get any worse than it already is: either you can revive it or the data on it is already lost.
You should really describe your problem better / more exactly. The following was in no way obvious from you first posting:
Quote:
Originally Posted by newtoaixos
before i do the varyoffvg i need to unmount the 2 filesystems present under that VG.
Are the FSs mounted and accessible?? What is the output of:
Code:
lsvg -l appvg
lsvg -p appvg
Somehow i doubt that the filesystems are still available when the VG has been closed.
What does "errpt" tell you? The "quorum" is the minimum number of disks that have to present to make a VG valid. Once less disks than this quorum are there the VG is forced offline, which means all the FSs belonging to it are unmounted (which is why i doubt they really are there). Further, there must be some entry in the "errpt" log regarding a hdisk device failing, otherwise the quorum wouldn't have been lost.
Quote:
I have informed the application team already about this issue. probably when they give me the downtime I will unmount and then try to varyoff the VG.
NO!
When they decided to commission a system where disks are not redundant they forfeited any right to have an uninterruptible service. Hardware is failing from time to time, that is old news. Either you have hardware (regardless of what it is: network cards, disks, processors, power supplies, ... ) redundant, so that when one part fails the other is still there or you have hardware not redundant: then you have to expect the service to be interrupted from time to time. Everything else is "wash me, but don't make me wet in the process": rubbish. No admin in his right mind lets get himself in such a double-bind situation.
Your 2 disks could not have been redundant, because in this case the quorum should have been deactivated: a VG consisting of two mirrored disks is safe even if there is only one of these disk present. (If the disks were indeed mirrored: i suggest firing the idiot who configured such a horse manure on the spot for proven incompetence.)
Additional question: what are these disks? LUNs? (provided via VIOS?, NPIV? other?) Physical disks? RAID-sets? Show the output of these commands:
Quote:
lsdev -Cc disk
lsattr -El hdisk1 / hdisk2
Background is: is there a chance that the inavailability of the disk(s) might be temporary in nature? It might work if you issue
The output confirms that hdisk1 is the cause of your problem. Your volume group is definitely offline and gone with it are the filesystems it may have (once) contained. If they appear to be still mounted: don't believe it, they are gone.
What you see here is a description of the disk (hdisk1) in increasing detail:
Quote:
Originally Posted by newtoaixos
Code:
pmut3# lsdev -Cc disk
hdisk0 Available 06-08-01-3,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 06-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive
hdisk3 Available 06-08-01-8,0 16 Bit LVD SCSI Disk Drive
pmut3# lsattr -El hdisk1
PCM PCM/friend/scsiscsd Path Control Module False
algorithm fail_over Algorithm True
dist_err_pcnt 0 Distributed Error Percentage True
dist_tw_width 50 Distributed Error Sample Time True
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid 00c5c9cfcf30eee90000000000000000 Physical volume identifier False
queue_depth 3 Queue DEPTH False
reserve_policy single_path Reserve Policy True
size_in_mb 73400 Size in Megabytes False
unique_id 260800023B980AST373455LC08IBM H0scsi Unique device identifier False
And this is the probable cause for hdisk1 failing. I SNIPped to the interesting part:
Quote:
Originally Posted by newtoaixos
Code:
pmut3# errpt -aj 8647C4E2 | pg
<...SNIP....>
Resource tested: hdisk1
Resource Description: 16 Bit LVD SCSI Disk Drive
Location: U788C.001.AAB1650-P1-T11-L4-L0
SRN: 000-129
Description: Error log analysis indicates a SCSI bus problem.
Looks like your SCSI disk was failing somehow - this could be everything from a broken cable, a terminator gone bad to the disk itself become broken. First, make shure that the SCSI link is up again. Delete the hdisk1 devices and run "cfgmgr" to rediscover it. If it won't come back the disk is not connected (or broken), if it is in status "available" the disconnection is gone. You still should investigate, because a symptom gone is not a problem solved. Find the reason for the disconnection, only this will solve your problem.
Still, don't be shy to start repair action - this server will do nothing without the data necessary for carrying out its function anyway. If business complains: see above. If they are too greedy to pay for mirrored disks they will have to live with failing ones and the time necessary for repair. If the disks are indeed mirrored whoever forgot to (un)set the quorum is to blame and business will have every right to be angry. This is administration basics and should not happen at all.
Hi
I need to know what are the precaution we should take during quorum server reboot as this quorum server is providing quorum devices to five different solaris two node clusters.
Also let me know do I have to follow below procedure as well before and after reboot of quorum server
Do I... (3 Replies)
Some storage/disks have been added to an existing AIX 6.1 server. The admin sent me the list of hdisk#'s for the new disks, but I need the corresponding rhdisk# for the same hdisk. (I know from past experience that the rhdisk that maps to an hdisk is not always the same number. For instance,... (5 Replies)
Hi all, i have 3 nodes cluster (Centos 5 cluster suit) with out quorum disk,
node vote = 1,
the value of a quorum = 2,
when 2 nodes going offline, cluster services are destoys.
How i can save the cluster and all services(move all services to one alive node)
with out quorum disk when other... (3 Replies)
Hi all,
I'm getting some errors on AIX regarding Flashcopy and volume group hard disks.
The script that activates flashcopy showed this errors:
Recreating Flashcopy for lun01_A1
Performing syntax check...
Syntax check complete.
Executing script...
Script execution complete.
SMcli... (1 Reply)
Hi there,
I have three servers and I'm puzzled by the oputput I get from lsvg rootvg.
Server 1 : QUORUM: 2 (Enabled)
Server 2 : QUORUM: 1 (Disabled)
Server 3 : QUORUM: 1
All VG are build on 2 PV and are mirroring.
What could cause the number to be different?... (2 Replies)
Hi all
Just a question about quorum.
I am running AIX 5.3
Rootvg has 2 PV - not mirrored. quorum is switched on.
What happens when one disk fails?, can i replace the disk and bring the entire VG back up. with all the data intact. knowing that the VG will be unavailable until i replace the... (3 Replies)
How do you create a dummy hdisk with AIX 6.1? In previous versions, I've used this and works, but now I get this error.
hostname:/:# mkdev -l hdisk57 -c disk -t osdisk -s scsi -p fscsi0 -w 0,10 -d
Method error (/etc/methods/define):
0514-022 The specified connection is not valid.
Any... (2 Replies)
Hi,
I am running AIX 5.3 TL8. After a disk failure, one of my mirrored application volumegroups went down. Unfortunately we have quorum switched on on this VG and the defective disk holds the majority.
I have set MISSINGPV_VARYON to TRUE and tried a forced varyon but it's still failing. I... (3 Replies)
Hi all,
I would like to ensure that a volume group has an effective quorum setting of 1 (or off). I know you can change the quorum setting using the chvg -Q command but want to know if the setting has been changed before the vg was varied on or a reboot.
In other words how can I ensure that... (3 Replies)