MPIO reliability

08-26-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

MPIO reliability

Hi,

we have a vew boxes using MPIO and they are connected to some virtualization software managing some disk subsystems, offering volumes to the AIX boxes.
Sometimes when a cable has been plugged out for a test or when a real problem occurs, using lspath to show the state of the paths shows correct, that for example 1 path is failed, the other enabled. When the cable is plugged back in again or the problem has been recovered, that path still shows that it is failed. Even waiting some time, this will not recover. No matter what we tried will change that but a reboot of the box. I do not remember exactly if the path being shown as "failed" did still work (I thought I issued a fcstat and there was bytes counting up, not sure though, too long ago) even though the lspath showed.

Did anybody have had any similar experience with MPIO? We thought that since MPIO is some years on the market now, that an obvious problem like not updating the status of a path should be obsolete. So we came to the conclusion that it might be some kind of incompability with our virtualization software.

I never saw something like it on a box using Powerpath.

Additionally, this problem does not happen every time and not on all of the MPIO boxes.

Our boxes are running AIX 5.3 TL11 SP4.

Any hints are welcome.

---------- Post updated at 09:08 AM ---------- Previous update was at 08:54 AM ----------

Here the config of a path from a box that had no problem so far - the other boxes have same parameters for health check etc.:

Code:

> lsattr -El hdisk2
PCM             PCM/friend/dcfcpother                              Path Control Module              False
algorithm       fail_over                                          Algorithm                        True
clr_q           no                                                 Device CLEARS its Queue on error True
dist_err_pcnt   0                                                  Distributed Error Percentage     True
dist_tw_width   50                                                 Distributed Error Sample Time    True
hcheck_cmd      inquiry                                            Health Check Command             True
hcheck_interval 60                                                 Health Check Interval            True
hcheck_mode     nonactive                                          Health Check Mode                True
location                                                           Location Label                   True
lun_id          0x1000000000000                                    Logical Unit Number ID           False
max_transfer    0x40000                                            Maximum TRANSFER Size            True
node_name       0x20070030d910849e                                 FC Node Name                     False
pvid            00c6c34f19954aed0000000000000000                   Physical volume identifier       False
q_err           yes                                                Use QERR bit                     True
q_type          simple                                             Queuing TYPE                     True
queue_depth     16                                                 Queue DEPTH                      True
reassign_to     120                                                REASSIGN time out value          True
reserve_policy  single_path                                        Reserve Policy                   True
rw_timeout      70                                                 READ/WRITE time out value        True
scsi_id         0x829980                                           SCSI ID                          False
start_timeout   60                                                 START unit time out value        True
unique_id       3214fi220001_somelunidentifier                     Unique device identifier         False
ww_name         0x210100e08ba2958f                                 FC World Wide Name               False

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

08-26-2010

Registered User

465, 14

Join Date: Mar 2008

Last Activity: 24 January 2011, 4:56 AM EST

Posts: 465

Thanks Given: 1

Thanked 14 Times in 13 Posts

I wonder if you could you post the adapter settings as well?

shockneck

View Public Profile for shockneck

Find all posts by shockneck

08-26-2010

Registered User

682, 31

Join Date: Nov 2006

Last Activity: 7 June 2017, 11:09 AM EDT

Location: Austria/Vienna

Posts: 682

Thanks Given: 25

Thanked 31 Times in 30 Posts

Hi, I know this problem, then you have to manually set the path online
we use

Code:

smitty mpio -> mpio path management -> enable paths for a device

but in my case, the paths come from 2 vio servers, which are connected to a IBM DS8300

when directly on the vio-servers, there are driver commands for setting paths online again, after replacing a damaged adapter for example

with sddpcm it's

Code:

pcmpath set adapter x online

funksen

View Public Profile for funksen

Find all posts by funksen

08-26-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

@funksen
Thanks so far for the info - I don't remember if we tried that one but I will try that next time I get a chance.

Quote:

Originally Posted by shockneck

I wonder if you could you post the adapter settings as well?

Neither cost nor effort spared:

Code:

> lsattr -El fcs0
bus_intr_lvl  65765      Bus interrupt level                                False
bus_io_addr   0xefc00    Bus I/O address                                    False
bus_mem_addr  0xf0040000 Bus memory address                                 False
init_link     pt2pt      INIT Link flags                                    True
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 200        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True

The other adapter has the same settings.

Here is the fscsi device:

Code:

> lsattr -El fscsi0
attach       switch    How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id      0xa9f00   Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True

The other device has the same settings.

Thanks.

Edit:
Just a note - I have currently no way to test/reproduce it so don't put too much effort into it. Any hint is good though.

Last edited by zaxxon; 08-26-2010 at 06:02 AM..

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

08-26-2010

Registered User

40, 1

Join Date: Dec 2008

Last Activity: 8 May 2014, 4:16 PM EDT

Posts: 40

Thanks Given: 0

Thanked 1 Time in 1 Post

In my case it takes some time for MPIO to rebuild path (VIO + N_portID). We got script that :
-lsdev (look for defined disk) & rmdev (if any)
-lspath (look for missing path) & rmpath
-cfgmgr

borek

View Public Profile for borek

Find all posts by borek

09-03-2010

Moderator

869, 117

Join Date: May 2008

Last Activity: 3 June 2020, 5:57 PM EDT

Location: Lone Star State, USA

Posts: 869

Thanks Given: 26

Thanked 117 Times in 94 Posts

did you set different priorities to your paths ? We had similar problems as long as all our paths had the same priority ...

Regards
zxmaus

zxmaus

View Public Profile for zxmaus

Find all posts by zxmaus

09-03-2010

Registered User

6,575, 572

Join Date: Sep 2007

Last Activity: 5 November 2019, 9:08 AM EST

Location: St. Gallen, Switzerland

Posts: 6,575

Thanks Given: 179

Thanked 572 Times in 484 Posts

No clue if that was the case back then. Currently I found mixed settings like paths having the same priority and paths on another box with different priorities according to which virtualized storage they primarily talk to (while having algroithm=fail_over).
I also asked a coworker about it some seconds ago who told me he has the task to check and set all paths to different priorities.
I will keep it in mind, checking for path priority, just in case we have those strange effects again.

zaxxon

View Public Profile for zaxxon

Find all posts by zaxxon

AIX

MPIO reliability

8 More Discussions You Might Find Interesting

1. AIX

DISK and MPIO

Discussion started by: Phat

2. AIX

Need Help with SDD / SDDPCM / MPIO

Discussion started by: filosophizer

3. Solaris

Reasons for NOT using LDOMs? reliability?

Discussion started by: User121

4. AIX

MPIO Driver

Discussion started by: clking

5. High Performance Computing

High reliability web server - cluster, redundancy, etc

Discussion started by: bsaadmin

6. AIX

AIX native MPIO

Discussion started by: zaxxon

7. UNIX for Advanced & Expert Users

AIX MPIO and EMC

Discussion started by: vxg0wa3

8. Filesystems, Disks and Memory

Optimizing the system reliability

Discussion started by: Deepa