Sense key unit attention & iostat hardware and transport errors on SAN disks

07-23-2013

Registered User

19, 1

Join Date: Jan 2011

Last Activity: 26 July 2013, 10:35 AM EDT

Posts: 19

Thanks Given: 4

Thanked 1 Time in 1 Post

Sense key unit attention & iostat hardware and transport errors on SAN disks

Hello, I'm trying to get to the bottom of SAN disk errors we've been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.

We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachi's analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didn't see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.

Code:

 Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 141200                    Error Block: 141200
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931367
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 14223776                  Error Block: 14223776
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931366
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,10 (sd90):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 12622176                  Error Block: 12622176
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931360
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,1c (sd70):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13265440                  Error Block: 13265440
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE93136C
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,18 (sd74):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13264016                  Error Block: 13264016
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931368
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
   
  iostat -en | egrep "device|errors|c3"
    ---- errors --- 
    s/w h/w trn tot device
      0   1   0   1 c3t50060E8006FE93BBd39
      0   1   0   1 c3t50060E8006FE93BBd38
      0   1   0   1 c3t50060E8006FE93BBd37
      0   1   0   1 c3t50060E8006FE93BBd36
      0   1   0   1 c3t50060E8006FE93BBd35
      0   1   0   1 c3t50060E8006FE93BBd34
      0   1   0   1 c3t50060E8006FE93BBd33
      0   1   0   1 c3t50060E8006FE93BBd32
      0   1   0   1 c3t50060E8006FE93BBd31
      0   1   0   1 c3t50060E8006FE93BBd30
      0   1   0   1 c3t50060E8006FE93BBd29
      0   1   0   1 c3t50060E8006FE93BBd28
      0   1   0   1 c3t50060E8006FE93BBd27
      0   1   0   1 c3t50060E8006FE93BBd26
      0   1   0   1 c3t50060E8006FE93BBd25
      0   1   0   1 c3t50060E8006FE93BBd24
      0   2   1   3 c3t50060E8006FE93BBd23
      0   2   1   3 c3t50060E8006FE93BBd22
      0   1   0   1 c3t50060E8006FE93BBd21
      0   1   0   1 c3t50060E8006FE93BBd20
      0   1   0   1 c3t50060E8006FE93BBd19
      0   1   0   1 c3t50060E8006FE93BBd18
      0   1   0   1 c3t50060E8006FE93BBd17
      0   1   0   1 c3t50060E8006FE93BBd16
      0   1   0   1 c3t50060E8006FE93BBd15
      0   0   0   0 c3t50060E8006FE93BBd14
      0   0   0   0 c3t50060E8006FE93BBd13
      0   0   0   0 c3t50060E8006FE93BBd12
      0   0   0   0 c3t50060E8006FE93BBd11
      0   0   0   0 c3t50060E8006FE93BBd10
      0   0   0   0 c3t50060E8006FE93BBd9
      0   0   0   0 c3t50060E8006FE93BBd8
      0   0   0   0 c3t50060E8006FE93BBd7
      0   0   0   0 c3t50060E8006FE93BBd6
      0   0   0   0 c3t50060E8006FE93BBd5
      0   0   0   0 c3t50060E8006FE93BBd4
      0   0   0   0 c3t50060E8006FE93BBd3
      0   0   0   0 c3t50060E8006FE93BBd2
      0   0   0   0 c3t50060E8006FE93BBd1
      0   0   0   0 c3t50060E8006FE93BBd0
   
  Disk instance (sd) names to device names (cXtXdX).
  Excluding md|st|nfs and including c3|c4 - SAN controller/paths:
  sd43=/dev/dsk/c4t50060E8006FE93ABd39
  sd44=/dev/dsk/c4t50060E8006FE93ABd38
  sd45=/dev/dsk/c4t50060E8006FE93ABd37
  sd46=/dev/dsk/c4t50060E8006FE93ABd36
  sd47=/dev/dsk/c4t50060E8006FE93ABd35
  sd48=/dev/dsk/c4t50060E8006FE93ABd34
  sd49=/dev/dsk/c4t50060E8006FE93ABd33
  sd50=/dev/dsk/c4t50060E8006FE93ABd32
  sd51=/dev/dsk/c4t50060E8006FE93ABd31
  sd52=/dev/dsk/c3t50060E8006FE93BBd39
  sd53=/dev/dsk/c3t50060E8006FE93BBd38
  sd54=/dev/dsk/c4t50060E8006FE93ABd30
  sd55=/dev/dsk/c3t50060E8006FE93BBd37
  sd56=/dev/dsk/c3t50060E8006FE93BBd36
  sd57=/dev/dsk/c4t50060E8006FE93ABd29
  sd58=/dev/dsk/c3t50060E8006FE93BBd35
  sd59=/dev/dsk/c4t50060E8006FE93ABd28
  sd60=/dev/dsk/c3t50060E8006FE93BBd34
  sd61=/dev/dsk/c3t50060E8006FE93BBd33
  sd62=/dev/dsk/c4t50060E8006FE93ABd27
  sd63=/dev/dsk/c3t50060E8006FE93BBd32
  sd64=/dev/dsk/c3t50060E8006FE93BBd31
  sd65=/dev/dsk/c4t50060E8006FE93ABd26
  sd66=/dev/dsk/c3t50060E8006FE93BBd30
  sd67=/dev/dsk/c4t50060E8006FE93ABd25
  sd68=/dev/dsk/c3t50060E8006FE93BBd29
  sd69=/dev/dsk/c4t50060E8006FE93ABd24
  sd70=/dev/dsk/c3t50060E8006FE93BBd28
  sd71=/dev/dsk/c3t50060E8006FE93BBd27
  sd72=/dev/dsk/c3t50060E8006FE93BBd26
  sd73=/dev/dsk/c3t50060E8006FE93BBd25
  sd74=/dev/dsk/c3t50060E8006FE93BBd24
  sd75=/dev/dsk/c3t50060E8006FE93BBd23
  sd76=/dev/dsk/c4t50060E8006FE93ABd23
  sd77=/dev/dsk/c3t50060E8006FE93BBd22
  sd78=/dev/dsk/c4t50060E8006FE93ABd22
  sd79=/dev/dsk/c4t50060E8006FE93ABd21
  sd80=/dev/dsk/c3t50060E8006FE93BBd21
  sd81=/dev/dsk/c4t50060E8006FE93ABd20
  sd82=/dev/dsk/c3t50060E8006FE93BBd20
  sd83=/dev/dsk/c4t50060E8006FE93ABd19
  sd84=/dev/dsk/c3t50060E8006FE93BBd19
  sd85=/dev/dsk/c3t50060E8006FE93BBd18
  sd86=/dev/dsk/c4t50060E8006FE93ABd18
  sd87=/dev/dsk/c4t50060E8006FE93ABd17
  sd88=/dev/dsk/c3t50060E8006FE93BBd17
  sd89=/dev/dsk/c4t50060E8006FE93ABd16
  sd90=/dev/dsk/c3t50060E8006FE93BBd16
  sd91=/dev/dsk/c3t50060E8006FE93BBd15
  sd92=/dev/dsk/c4t50060E8006FE93ABd15
  sd93=/dev/dsk/c3t50060E8006FE93BBd14
  sd94=/dev/dsk/c4t50060E8006FE93ABd14
  sd95=/dev/dsk/c3t50060E8006FE93BBd13
  sd96=/dev/dsk/c4t50060E8006FE93ABd13
  sd97=/dev/dsk/c4t50060E8006FE93ABd12
  sd98=/dev/dsk/c3t50060E8006FE93BBd12
  sd99=/dev/dsk/c4t50060E8006FE93ABd11
  sd100=/dev/dsk/c3t50060E8006FE93BBd11
  sd101=/dev/dsk/c3t50060E8006FE93BBd10
  sd102=/dev/dsk/c4t50060E8006FE93ABd10
  sd103=/dev/dsk/c3t50060E8006FE93BBd9
  sd104=/dev/dsk/c4t50060E8006FE93ABd9
  sd105=/dev/dsk/c3t50060E8006FE93BBd8
  sd106=/dev/dsk/c4t50060E8006FE93ABd8
  sd107=/dev/dsk/c3t50060E8006FE93BBd7
  sd108=/dev/dsk/c4t50060E8006FE93ABd7
  sd109=/dev/dsk/c3t50060E8006FE93BBd6
  sd110=/dev/dsk/c4t50060E8006FE93ABd6
  sd111=/dev/dsk/c3t50060E8006FE93BBd5
  sd112=/dev/dsk/c4t50060E8006FE93ABd5
  sd113=/dev/dsk/c3t50060E8006FE93BBd4
  sd114=/dev/dsk/c4t50060E8006FE93ABd4
  sd115=/dev/dsk/c4t50060E8006FE93ABd3
  sd116=/dev/dsk/c3t50060E8006FE93BBd3
  sd117=/dev/dsk/c4t50060E8006FE93ABd2
  sd118=/dev/dsk/c3t50060E8006FE93BBd2
  sd119=/dev/dsk/c4t50060E8006FE93ABd1
  sd120=/dev/dsk/c3t50060E8006FE93BBd1
  sd121=/dev/dsk/c3t50060E8006FE93BBd0
  sd122=/dev/dsk/c4t50060E8006FE93ABd0

TKD

View Public Profile for TKD

Find all posts by TKD

07-24-2013

Registered User

5,725, 311

Join Date: Jul 2006

Last Activity: 17 February 2019, 10:46 AM EST

Location: Berlin, Germany

Posts: 5,725

Thanks Given: 75

Thanked 311 Times in 297 Posts

It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.

The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.

This User Gave Thanks to DukeNuke2 For This Post:

DukeNuke2

View Public Profile for DukeNuke2

Visit DukeNuke2's homepage!

Find all posts by DukeNuke2

07-24-2013

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

Are your switches zoned? Or is your SAN set up as "everyone sees everything"?

I wonder if those resets correlate to any server on the SAN rebooting.

This User Gave Thanks to achenle For This Post:

achenle

View Public Profile for achenle

Find all posts by achenle

07-24-2013

Registered User

19, 1

Join Date: Jan 2011

Last Activity: 26 July 2013, 10:35 AM EDT

Posts: 19

Thanks Given: 4

Thanked 1 Time in 1 Post

Thanks for the replies.

I'm waiting for our Oracle support to be renewed, so until then it's hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have that's connected to SAN storage, most of our SAN-connected servers are SPARC. So I'm less familiar with this Intel hardware.

The SAN switches are zoned to specific servers.

TKD

View Public Profile for TKD

Find all posts by TKD

07-26-2013

Registered User

50, 5

Join Date: Nov 2008

Last Activity: 2 March 2020, 5:57 AM EST

Location: Mexico CDMX

Posts: 50

Thanks Given: 0

Thanked 5 Times in 5 Posts

I think that this can be related to the queue_depth settings on the environment.

Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.

In solaris, the queue_depth concept is named max_throttle

Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value

Review your Storage recommendation, and set this parameter,

On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time

Code:

set ssd:ssd_max_throttle=x
set ssd:ssd_io_time=180

The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.

For example I can see that your Storage is Hitachi,

for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512

Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)

Code:

Queue Depth = 2048 / Total number of LUNs defined on the port <= 32

I hope that this help you

Only for reference, the next is a text from Hitachi Knowledge Base

Code:

Fix                   :                                                 All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack. 
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds.  In most cases these 
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command

Set as follows:
/etc/system
set ssd:ssd_io_time=180
set ssd:ssd_max_throttle=n
Where n is 8 or 256/# of active initiators/# of LUNs

(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.)

cerber0

View Public Profile for cerber0

Find all posts by cerber0

Solaris

Sense key unit attention & iostat hardware and transport errors on SAN disks

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

transport errors in iostat

Discussion started by: dyavuzy1

2. Red Hat

Identify SAN disks not in use

Discussion started by: asanchez

3. UNIX for Dummies Questions & Answers

Identify SAN disks

Discussion started by: asanchez

4. Solaris

iostat -En errors

Discussion started by: zaza

5. Solaris

What is the difference between softerrors,harderrors,transport errors?

Discussion started by: tv.praveenkumar

6. Solaris

SAN DISKS - Number of slices ?

Discussion started by: sbk1972

7. Solaris

Sense Key: Media Error

Discussion started by: summerboy

8. Solaris

iostat -nE with Hard Errors

Discussion started by: nickychung

9. Linux

problem with disks on SAN

Discussion started by: xiamin