Sense key unit attention & iostat hardware and transport errors on SAN disks

Tags
solaris

Login to Reply

 
Thread Tools Search this Thread
# 1  
Old 07-23-2013
Sense key unit attention & iostat hardware and transport errors on SAN disks

Hello, Iím trying to get to the bottom of SAN disk errors weíve been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.

We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachiís analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didnít see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.

Code:
 Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 141200                    Error Block: 141200
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931367
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 14223776                  Error Block: 14223776
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931366
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,10 (sd90):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 12622176                  Error Block: 12622176
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931360
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,1c (sd70):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13265440                  Error Block: 13265440
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE93136C
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,18 (sd74):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13264016                  Error Block: 13264016
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931368
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
   
  iostat -en | egrep "device|errors|c3"
    ---- errors --- 
    s/w h/w trn tot device
      0   1   0   1 c3t50060E8006FE93BBd39
      0   1   0   1 c3t50060E8006FE93BBd38
      0   1   0   1 c3t50060E8006FE93BBd37
      0   1   0   1 c3t50060E8006FE93BBd36
      0   1   0   1 c3t50060E8006FE93BBd35
      0   1   0   1 c3t50060E8006FE93BBd34
      0   1   0   1 c3t50060E8006FE93BBd33
      0   1   0   1 c3t50060E8006FE93BBd32
      0   1   0   1 c3t50060E8006FE93BBd31
      0   1   0   1 c3t50060E8006FE93BBd30
      0   1   0   1 c3t50060E8006FE93BBd29
      0   1   0   1 c3t50060E8006FE93BBd28
      0   1   0   1 c3t50060E8006FE93BBd27
      0   1   0   1 c3t50060E8006FE93BBd26
      0   1   0   1 c3t50060E8006FE93BBd25
      0   1   0   1 c3t50060E8006FE93BBd24
      0   2   1   3 c3t50060E8006FE93BBd23
      0   2   1   3 c3t50060E8006FE93BBd22
      0   1   0   1 c3t50060E8006FE93BBd21
      0   1   0   1 c3t50060E8006FE93BBd20
      0   1   0   1 c3t50060E8006FE93BBd19
      0   1   0   1 c3t50060E8006FE93BBd18
      0   1   0   1 c3t50060E8006FE93BBd17
      0   1   0   1 c3t50060E8006FE93BBd16
      0   1   0   1 c3t50060E8006FE93BBd15
      0   0   0   0 c3t50060E8006FE93BBd14
      0   0   0   0 c3t50060E8006FE93BBd13
      0   0   0   0 c3t50060E8006FE93BBd12
      0   0   0   0 c3t50060E8006FE93BBd11
      0   0   0   0 c3t50060E8006FE93BBd10
      0   0   0   0 c3t50060E8006FE93BBd9
      0   0   0   0 c3t50060E8006FE93BBd8
      0   0   0   0 c3t50060E8006FE93BBd7
      0   0   0   0 c3t50060E8006FE93BBd6
      0   0   0   0 c3t50060E8006FE93BBd5
      0   0   0   0 c3t50060E8006FE93BBd4
      0   0   0   0 c3t50060E8006FE93BBd3
      0   0   0   0 c3t50060E8006FE93BBd2
      0   0   0   0 c3t50060E8006FE93BBd1
      0   0   0   0 c3t50060E8006FE93BBd0
   
  Disk instance (sd) names to device names (cXtXdX).
  Excluding md|st|nfs and including c3|c4 - SAN controller/paths:
  sd43=/dev/dsk/c4t50060E8006FE93ABd39
  sd44=/dev/dsk/c4t50060E8006FE93ABd38
  sd45=/dev/dsk/c4t50060E8006FE93ABd37
  sd46=/dev/dsk/c4t50060E8006FE93ABd36
  sd47=/dev/dsk/c4t50060E8006FE93ABd35
  sd48=/dev/dsk/c4t50060E8006FE93ABd34
  sd49=/dev/dsk/c4t50060E8006FE93ABd33
  sd50=/dev/dsk/c4t50060E8006FE93ABd32
  sd51=/dev/dsk/c4t50060E8006FE93ABd31
  sd52=/dev/dsk/c3t50060E8006FE93BBd39
  sd53=/dev/dsk/c3t50060E8006FE93BBd38
  sd54=/dev/dsk/c4t50060E8006FE93ABd30
  sd55=/dev/dsk/c3t50060E8006FE93BBd37
  sd56=/dev/dsk/c3t50060E8006FE93BBd36
  sd57=/dev/dsk/c4t50060E8006FE93ABd29
  sd58=/dev/dsk/c3t50060E8006FE93BBd35
  sd59=/dev/dsk/c4t50060E8006FE93ABd28
  sd60=/dev/dsk/c3t50060E8006FE93BBd34
  sd61=/dev/dsk/c3t50060E8006FE93BBd33
  sd62=/dev/dsk/c4t50060E8006FE93ABd27
  sd63=/dev/dsk/c3t50060E8006FE93BBd32
  sd64=/dev/dsk/c3t50060E8006FE93BBd31
  sd65=/dev/dsk/c4t50060E8006FE93ABd26
  sd66=/dev/dsk/c3t50060E8006FE93BBd30
  sd67=/dev/dsk/c4t50060E8006FE93ABd25
  sd68=/dev/dsk/c3t50060E8006FE93BBd29
  sd69=/dev/dsk/c4t50060E8006FE93ABd24
  sd70=/dev/dsk/c3t50060E8006FE93BBd28
  sd71=/dev/dsk/c3t50060E8006FE93BBd27
  sd72=/dev/dsk/c3t50060E8006FE93BBd26
  sd73=/dev/dsk/c3t50060E8006FE93BBd25
  sd74=/dev/dsk/c3t50060E8006FE93BBd24
  sd75=/dev/dsk/c3t50060E8006FE93BBd23
  sd76=/dev/dsk/c4t50060E8006FE93ABd23
  sd77=/dev/dsk/c3t50060E8006FE93BBd22
  sd78=/dev/dsk/c4t50060E8006FE93ABd22
  sd79=/dev/dsk/c4t50060E8006FE93ABd21
  sd80=/dev/dsk/c3t50060E8006FE93BBd21
  sd81=/dev/dsk/c4t50060E8006FE93ABd20
  sd82=/dev/dsk/c3t50060E8006FE93BBd20
  sd83=/dev/dsk/c4t50060E8006FE93ABd19
  sd84=/dev/dsk/c3t50060E8006FE93BBd19
  sd85=/dev/dsk/c3t50060E8006FE93BBd18
  sd86=/dev/dsk/c4t50060E8006FE93ABd18
  sd87=/dev/dsk/c4t50060E8006FE93ABd17
  sd88=/dev/dsk/c3t50060E8006FE93BBd17
  sd89=/dev/dsk/c4t50060E8006FE93ABd16
  sd90=/dev/dsk/c3t50060E8006FE93BBd16
  sd91=/dev/dsk/c3t50060E8006FE93BBd15
  sd92=/dev/dsk/c4t50060E8006FE93ABd15
  sd93=/dev/dsk/c3t50060E8006FE93BBd14
  sd94=/dev/dsk/c4t50060E8006FE93ABd14
  sd95=/dev/dsk/c3t50060E8006FE93BBd13
  sd96=/dev/dsk/c4t50060E8006FE93ABd13
  sd97=/dev/dsk/c4t50060E8006FE93ABd12
  sd98=/dev/dsk/c3t50060E8006FE93BBd12
  sd99=/dev/dsk/c4t50060E8006FE93ABd11
  sd100=/dev/dsk/c3t50060E8006FE93BBd11
  sd101=/dev/dsk/c3t50060E8006FE93BBd10
  sd102=/dev/dsk/c4t50060E8006FE93ABd10
  sd103=/dev/dsk/c3t50060E8006FE93BBd9
  sd104=/dev/dsk/c4t50060E8006FE93ABd9
  sd105=/dev/dsk/c3t50060E8006FE93BBd8
  sd106=/dev/dsk/c4t50060E8006FE93ABd8
  sd107=/dev/dsk/c3t50060E8006FE93BBd7
  sd108=/dev/dsk/c4t50060E8006FE93ABd7
  sd109=/dev/dsk/c3t50060E8006FE93BBd6
  sd110=/dev/dsk/c4t50060E8006FE93ABd6
  sd111=/dev/dsk/c3t50060E8006FE93BBd5
  sd112=/dev/dsk/c4t50060E8006FE93ABd5
  sd113=/dev/dsk/c3t50060E8006FE93BBd4
  sd114=/dev/dsk/c4t50060E8006FE93ABd4
  sd115=/dev/dsk/c4t50060E8006FE93ABd3
  sd116=/dev/dsk/c3t50060E8006FE93BBd3
  sd117=/dev/dsk/c4t50060E8006FE93ABd2
  sd118=/dev/dsk/c3t50060E8006FE93BBd2
  sd119=/dev/dsk/c4t50060E8006FE93ABd1
  sd120=/dev/dsk/c3t50060E8006FE93BBd1
  sd121=/dev/dsk/c3t50060E8006FE93BBd0
  sd122=/dev/dsk/c4t50060E8006FE93ABd0

# 2  
Old 07-24-2013
It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.

The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.
This User Gave Thanks to DukeNuke2 For This Post:
TKD (07-24-2013)
# 3  
Old 07-24-2013
Are your switches zoned? Or is your SAN set up as "everyone sees everything"?

I wonder if those resets correlate to any server on the SAN rebooting.
This User Gave Thanks to achenle For This Post:
TKD (07-24-2013)
# 4  
Old 07-24-2013
Thanks for the replies.

Iím waiting for our Oracle support to be renewed, so until then itís hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have thatís connected to SAN storage, most of our SAN-connected servers are SPARC. So Iím less familiar with this Intel hardware.

The SAN switches are zoned to specific servers.
# 5  
Old 07-26-2013
I think that this can be related to the queue_depth settings on the environment.

Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.

In solaris, the queue_depth concept is named max_throttle

Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value

Review your Storage recommendation, and set this parameter,

On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time

Code:
set ssd:ssd_max_throttle=x
set ssd:ssd_io_time=180

The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.

For example I can see that your Storage is Hitachi,

for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512

Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)

Code:
Queue Depth = 2048 / Total number of LUNs defined on the port <= 32

I hope that this help you

Only for reference, the next is a text from Hitachi Knowledge Base

Code:
Fix                   :                                                 All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack. 
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds.  In most cases these 
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command

Set as follows:
/etc/system
set ssd:ssd_io_time=180
set ssd:ssd_max_throttle=n
Where n is 8 or 256/# of active initiators/# of LUNs

(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.) 

Login to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Similar Threads More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Hardware RAID using three disks jegaraman Solaris 10 01-22-2015 06:57 AM
transport errors in iostat dyavuzy1 UNIX for Dummies Questions & Answers 5 01-25-2012 12:22 PM
Identify SAN disks not in use asanchez Red Hat 8 06-02-2011 04:45 AM
Identify SAN disks asanchez UNIX for Dummies Questions & Answers 4 05-31-2011 10:01 PM
iostat -En errors zaza Solaris 3 10-22-2010 01:32 PM
What is the difference between softerrors,harderrors,transport errors? tv.praveenkumar Solaris 3 06-03-2010 08:16 PM
Help with finding WWN of SAN disks rkruck Filesystems, Disks and Memory 1 09-17-2009 06:42 PM
SAN DISKS - Number of slices ? sbk1972 Solaris 3 08-03-2009 04:27 AM
Sense Key: Media Error summerboy Solaris 8 03-11-2009 10:52 AM
iostat -nE with Hard Errors nickychung Solaris 1 12-15-2008 02:40 AM
Configurin EMC SAN disks on AIX ronellevan AIX 1 06-10-2008 11:26 AM
Problem with accessing SAN disks michael.chow Solaris 15 05-22-2008 09:50 AM
mksysb restore ( vgs on SAN disks) Skyybugg AIX 0 04-18-2007 07:07 PM
problem with disks on SAN xiamin Linux 2 06-15-2005 01:15 AM
All times are GMT -4. The time now is 09:39 PM.

Unix & Linux Forums Content Copyright 1993-2018. All Rights Reserved.
UNIX.COM Login
Username:
Password:  
Show Password





Not a Forum Member?
Forgot Password?