MPIO reliability


 
Thread Tools Search this Thread
Operating Systems AIX MPIO reliability
# 1  
Old 08-26-2010
MPIO reliability

Hi,

we have a vew boxes using MPIO and they are connected to some virtualization software managing some disk subsystems, offering volumes to the AIX boxes.
Sometimes when a cable has been plugged out for a test or when a real problem occurs, using lspath to show the state of the paths shows correct, that for example 1 path is failed, the other enabled. When the cable is plugged back in again or the problem has been recovered, that path still shows that it is failed. Even waiting some time, this will not recover. No matter what we tried will change that but a reboot of the box. I do not remember exactly if the path being shown as "failed" did still work (I thought I issued a fcstat and there was bytes counting up, not sure though, too long ago) even though the lspath showed.

Did anybody have had any similar experience with MPIO? We thought that since MPIO is some years on the market now, that an obvious problem like not updating the status of a path should be obsolete. So we came to the conclusion that it might be some kind of incompability with our virtualization software.

I never saw something like it on a box using Powerpath.

Additionally, this problem does not happen every time and not on all of the MPIO boxes.

Our boxes are running AIX 5.3 TL11 SP4.

Any hints are welcome.

---------- Post updated at 09:08 AM ---------- Previous update was at 08:54 AM ----------

Here the config of a path from a box that had no problem so far - the other boxes have same parameters for health check etc.:
Code:
> lsattr -El hdisk2
PCM             PCM/friend/dcfcpother                              Path Control Module              False
algorithm       fail_over                                          Algorithm                        True
clr_q           no                                                 Device CLEARS its Queue on error True
dist_err_pcnt   0                                                  Distributed Error Percentage     True
dist_tw_width   50                                                 Distributed Error Sample Time    True
hcheck_cmd      inquiry                                            Health Check Command             True
hcheck_interval 60                                                 Health Check Interval            True
hcheck_mode     nonactive                                          Health Check Mode                True
location                                                           Location Label                   True
lun_id          0x1000000000000                                    Logical Unit Number ID           False
max_transfer    0x40000                                            Maximum TRANSFER Size            True
node_name       0x20070030d910849e                                 FC Node Name                     False
pvid            00c6c34f19954aed0000000000000000                   Physical volume identifier       False
q_err           yes                                                Use QERR bit                     True
q_type          simple                                             Queuing TYPE                     True
queue_depth     16                                                 Queue DEPTH                      True
reassign_to     120                                                REASSIGN time out value          True
reserve_policy  single_path                                        Reserve Policy                   True
rw_timeout      70                                                 READ/WRITE time out value        True
scsi_id         0x829980                                           SCSI ID                          False
start_timeout   60                                                 START unit time out value        True
unique_id       3214fi220001_somelunidentifier                     Unique device identifier         False
ww_name         0x210100e08ba2958f                                 FC World Wide Name               False

# 2  
Old 08-26-2010
I wonder if you could you post the adapter settings as well?
# 3  
Old 08-26-2010
Hi, I know this problem, then you have to manually set the path online
we use

Code:
smitty mpio -> mpio path management -> enable paths for a device

but in my case, the paths come from 2 vio servers, which are connected to a IBM DS8300



when directly on the vio-servers, there are driver commands for setting paths online again, after replacing a damaged adapter for example

with sddpcm it's
Code:
pcmpath set adapter x online

# 4  
Old 08-26-2010
@funksen
Thanks so far for the info - I don't remember if we tried that one but I will try that next time I get a chance.

Quote:
Originally Posted by shockneck
I wonder if you could you post the adapter settings as well?
Neither cost nor effort spared:
Code:
> lsattr -El fcs0
bus_intr_lvl  65765      Bus interrupt level                                False
bus_io_addr   0xefc00    Bus I/O address                                    False
bus_mem_addr  0xf0040000 Bus memory address                                 False
init_link     pt2pt      INIT Link flags                                    True
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 200        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True

The other adapter has the same settings.

Here is the fscsi device:
Code:
> lsattr -El fscsi0
attach       switch    How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id      0xa9f00   Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True

The other device has the same settings.

Thanks.

Edit:
Just a note - I have currently no way to test/reproduce it so don't put too much effort into it. Any hint is good though.

Last edited by zaxxon; 08-26-2010 at 06:02 AM..
# 5  
Old 08-26-2010
In my case it takes some time for MPIO to rebuild path (VIO + N_portID). We got script that :
-lsdev (look for defined disk) & rmdev (if any)
-lspath (look for missing path) & rmpath
-cfgmgr
# 6  
Old 09-03-2010
did you set different priorities to your paths ? We had similar problems as long as all our paths had the same priority ...

Regards
zxmaus
# 7  
Old 09-03-2010
No clue if that was the case back then. Currently I found mixed settings like paths having the same priority and paths on another box with different priorities according to which virtualized storage they primarily talk to (while having algroithm=fail_over).
I also asked a coworker about it some seconds ago who told me he has the task to check and set all paths to different priorities.
I will keep it in mind, checking for path priority, just in case we have those strange effects again.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. AIX

DISK and MPIO

Hello, I have some concerns over the disk management of my AIX system. For example server1 / > lspv hdisk0 00fa6d1288c820aa rootvg active hdisk1 00fa6d1288c8213c vg_2 active hdisk2 00c1cc14d6de272b ... (6 Replies)
Discussion started by: Phat
6 Replies

2. AIX

Need Help with SDD / SDDPCM / MPIO

This is getting very confusing for me, and appreciate if someone can help. Platform: Power VM ( Virtual I/O Server) ioslevel 2.1.3.10-FP23 # oslevel -s 6100-05-00-0000 Storage: IBM DS4300 Two HBAs - Dual Port Fibre Adapter Channels Each card has two ports , so a total of 4 ports going... (3 Replies)
Discussion started by: filosophizer
3 Replies

3. Solaris

Reasons for NOT using LDOMs? reliability?

Dear Solaris Experts, We are upgrading from sun4u to T4 systems and one proposal is to use LDOMs and also zones within LDOMs. Someone advised using only zones and not LDOMs because the new machines have fewer chips and if a chip or a core fails then it doesn't impact the zones, but impacts... (3 Replies)
Discussion started by: User121
3 Replies

4. AIX

MPIO Driver

On a particular LPAR, I was running AIX 5.3 TL 3. On Monday I did an update of the LPAR to 5.3 TL 9 SP2. The install was smooth, but then I ran into a problem. The MPIO driver does not work with LSI's StoreAge (SVM4). I did some looking, and looks like 5.3 TL3 = IBM.MPIO 5.3.0.30 5.3... (0 Replies)
Discussion started by: clking
0 Replies

5. High Performance Computing

High reliability web server - cluster, redundancy, etc

Hi. I am IT manager/developer for a small organization. I have been doing as-needed linux server administration for several years and am by no means an expert. I've built several of my own servers, and our org is currently using hosting services for our servers and I am relatively happy. We... (3 Replies)
Discussion started by: bsaadmin
3 Replies

6. AIX

AIX native MPIO

Hi folks, does anybody have a link to a documentation how to implement native MPIO on AIX? We are using EMC PowerPath and Datacore SanSymphony/Cambex for this so far and I wasn't able to find a good description on that topic. All I know so far is that mkpath, chpath and lspath are used to... (3 Replies)
Discussion started by: zaxxon
3 Replies

7. UNIX for Advanced & Expert Users

AIX MPIO and EMC

We are looking at running MPIO for it's redundancy and load balancing benefits. Does anyone know what pieces of software or modules are needed on the VIO server to get load balancing to work. Remember we are using EMC's DMX3500 storage system. We no longer want to use Powerpath. :rolleyes: ... (2 Replies)
Discussion started by: vxg0wa3
2 Replies

8. Filesystems, Disks and Memory

Optimizing the system reliability

My product have around 10-15 programs/services running in the sun box, which together completes a task, sequentially. Several instances of the each program/service are running in the unix box, to manage the load and for risk-management reasons. As of now, we dont follow a strict strategy in... (2 Replies)
Discussion started by: Deepa
2 Replies
Login or Register to Ask a Question