MPIO reliability


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Old 08-26-2010
MPIO reliability

Hi,

we have a vew boxes using MPIO and they are connected to some virtualization software managing some disk subsystems, offering volumes to the AIX boxes.
Sometimes when a cable has been plugged out for a test or when a real problem occurs, using lspath to show the state of the paths shows correct, that for example 1 path is failed, the other enabled. When the cable is plugged back in again or the problem has been recovered, that path still shows that it is failed. Even waiting some time, this will not recover. No matter what we tried will change that but a reboot of the box. I do not remember exactly if the path being shown as "failed" did still work (I thought I issued a fcstat and there was bytes counting up, not sure though, too long ago) even though the lspath showed.

Did anybody have had any similar experience with MPIO? We thought that since MPIO is some years on the market now, that an obvious problem like not updating the status of a path should be obsolete. So we came to the conclusion that it might be some kind of incompability with our virtualization software.

I never saw something like it on a box using Powerpath.

Additionally, this problem does not happen every time and not on all of the MPIO boxes.

Our boxes are running AIX 5.3 TL11 SP4.

Any hints are welcome.

---------- Post updated at 09:08 AM ---------- Previous update was at 08:54 AM ----------

Here the config of a path from a box that had no problem so far - the other boxes have same parameters for health check etc.:
Code:
> lsattr -El hdisk2
PCM             PCM/friend/dcfcpother                              Path Control Module              False
algorithm       fail_over                                          Algorithm                        True
clr_q           no                                                 Device CLEARS its Queue on error True
dist_err_pcnt   0                                                  Distributed Error Percentage     True
dist_tw_width   50                                                 Distributed Error Sample Time    True
hcheck_cmd      inquiry                                            Health Check Command             True
hcheck_interval 60                                                 Health Check Interval            True
hcheck_mode     nonactive                                          Health Check Mode                True
location                                                           Location Label                   True
lun_id          0x1000000000000                                    Logical Unit Number ID           False
max_transfer    0x40000                                            Maximum TRANSFER Size            True
node_name       0x20070030d910849e                                 FC Node Name                     False
pvid            00c6c34f19954aed0000000000000000                   Physical volume identifier       False
q_err           yes                                                Use QERR bit                     True
q_type          simple                                             Queuing TYPE                     True
queue_depth     16                                                 Queue DEPTH                      True
reassign_to     120                                                REASSIGN time out value          True
reserve_policy  single_path                                        Reserve Policy                   True
rw_timeout      70                                                 READ/WRITE time out value        True
scsi_id         0x829980                                           SCSI ID                          False
start_timeout   60                                                 START unit time out value        True
unique_id       3214fi220001_somelunidentifier                     Unique device identifier         False
ww_name         0x210100e08ba2958f                                 FC World Wide Name               False

# 3  
Old 08-26-2010
Hi, I know this problem, then you have to manually set the path online
we use

Code:
smitty mpio -> mpio path management -> enable paths for a device

but in my case, the paths come from 2 vio servers, which are connected to a IBM DS8300



when directly on the vio-servers, there are driver commands for setting paths online again, after replacing a damaged adapter for example

with sddpcm it's
Code:
pcmpath set adapter x online

# 4  
Old 08-26-2010
@funksen
Thanks so far for the info - I don't remember if we tried that one but I will try that next time I get a chance.

Quote:
Originally Posted by shockneck
I wonder if you could you post the adapter settings as well?
Neither cost nor effort spared:
Code:
> lsattr -El fcs0
bus_intr_lvl  65765      Bus interrupt level                                False
bus_io_addr   0xefc00    Bus I/O address                                    False
bus_mem_addr  0xf0040000 Bus memory address                                 False
init_link     pt2pt      INIT Link flags                                    True
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 200        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True

The other adapter has the same settings.

Here is the fscsi device:
Code:
> lsattr -El fscsi0
attach       switch    How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id      0xa9f00   Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True

The other device has the same settings.

Thanks.

Edit:
Just a note - I have currently no way to test/reproduce it so don't put too much effort into it. Any hint is good though.

Last edited by zaxxon; 08-26-2010 at 05:02 AM..
# 5  
Old 08-26-2010
In my case it takes some time for MPIO to rebuild path (VIO + N_portID). We got script that :
-lsdev (look for defined disk) & rmdev (if any)
-lspath (look for missing path) & rmpath
-cfgmgr
# 6  
Old 09-03-2010
did you set different priorities to your paths ? We had similar problems as long as all our paths had the same priority ...

Regards
zxmaus
# 7  
Old 09-03-2010
No clue if that was the case back then. Currently I found mixed settings like paths having the same priority and paths on another box with different priorities according to which virtualized storage they primarily talk to (while having algroithm=fail_over).
I also asked a coworker about it some seconds ago who told me he has the task to check and set all paths to different priorities.
I will keep it in mind, checking for path priority, just in case we have those strange effects again.
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Need Help with SDD / SDDPCM / MPIO filosophizer AIX 3 05-21-2016 03:44 PM
Failed mpio path on AIX5.3 murali969 AIX 13 12-04-2013 08:03 AM
Reasons for NOT using LDOMs? reliability? User121 Solaris 3 08-30-2013 06:58 AM
how to Enable VxDMP & Disable MPIO? nkiran AIX 0 07-26-2012 06:01 AM
Upgrading from native MPIO to SDDPCM AIX 6.1 mk8570 AIX 3 04-14-2011 07:00 AM
MPIO - list of supported arrays andy55 AIX 5 03-29-2011 10:20 AM
Uninstall native MPIO driver on AIX ronykris AIX 4 04-24-2010 07:08 PM
MPIO Driver clking AIX 0 07-23-2009 09:16 AM
SDD SDDPCM MPIO lspath Jargon apra143 AIX 0 05-02-2009 08:33 AM
High reliability web server - cluster, redundancy, etc bsaadmin High Performance Computing 3 03-30-2009 01:38 PM
AIX native MPIO zaxxon AIX 3 11-17-2008 06:39 AM
AIX MPIO and EMC vxg0wa3 UNIX for Advanced & Expert Users 2 11-17-2006 10:07 AM
Optimizing the system reliability Deepa Filesystems, Disks and Memory 2 12-11-2002 10:09 PM