Sense key unit attention & iostat hardware and transport errors on SAN disks


 
Thread Tools Search this Thread
Operating Systems Solaris Sense key unit attention & iostat hardware and transport errors on SAN disks
# 1  
Old 07-23-2013
Sense key unit attention & iostat hardware and transport errors on SAN disks

Hello, I'm trying to get to the bottom of SAN disk errors we've been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.

We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachi's analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didn't see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.

Code:
 Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 141200                    Error Block: 141200
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931367
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 14223776                  Error Block: 14223776
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931366
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,10 (sd90):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 12622176                  Error Block: 12622176
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931360
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,1c (sd70):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13265440                  Error Block: 13265440
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE93136C
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,18 (sd74):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13264016                  Error Block: 13264016
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931368
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
   
  iostat -en | egrep "device|errors|c3"
    ---- errors --- 
    s/w h/w trn tot device
      0   1   0   1 c3t50060E8006FE93BBd39
      0   1   0   1 c3t50060E8006FE93BBd38
      0   1   0   1 c3t50060E8006FE93BBd37
      0   1   0   1 c3t50060E8006FE93BBd36
      0   1   0   1 c3t50060E8006FE93BBd35
      0   1   0   1 c3t50060E8006FE93BBd34
      0   1   0   1 c3t50060E8006FE93BBd33
      0   1   0   1 c3t50060E8006FE93BBd32
      0   1   0   1 c3t50060E8006FE93BBd31
      0   1   0   1 c3t50060E8006FE93BBd30
      0   1   0   1 c3t50060E8006FE93BBd29
      0   1   0   1 c3t50060E8006FE93BBd28
      0   1   0   1 c3t50060E8006FE93BBd27
      0   1   0   1 c3t50060E8006FE93BBd26
      0   1   0   1 c3t50060E8006FE93BBd25
      0   1   0   1 c3t50060E8006FE93BBd24
      0   2   1   3 c3t50060E8006FE93BBd23
      0   2   1   3 c3t50060E8006FE93BBd22
      0   1   0   1 c3t50060E8006FE93BBd21
      0   1   0   1 c3t50060E8006FE93BBd20
      0   1   0   1 c3t50060E8006FE93BBd19
      0   1   0   1 c3t50060E8006FE93BBd18
      0   1   0   1 c3t50060E8006FE93BBd17
      0   1   0   1 c3t50060E8006FE93BBd16
      0   1   0   1 c3t50060E8006FE93BBd15
      0   0   0   0 c3t50060E8006FE93BBd14
      0   0   0   0 c3t50060E8006FE93BBd13
      0   0   0   0 c3t50060E8006FE93BBd12
      0   0   0   0 c3t50060E8006FE93BBd11
      0   0   0   0 c3t50060E8006FE93BBd10
      0   0   0   0 c3t50060E8006FE93BBd9
      0   0   0   0 c3t50060E8006FE93BBd8
      0   0   0   0 c3t50060E8006FE93BBd7
      0   0   0   0 c3t50060E8006FE93BBd6
      0   0   0   0 c3t50060E8006FE93BBd5
      0   0   0   0 c3t50060E8006FE93BBd4
      0   0   0   0 c3t50060E8006FE93BBd3
      0   0   0   0 c3t50060E8006FE93BBd2
      0   0   0   0 c3t50060E8006FE93BBd1
      0   0   0   0 c3t50060E8006FE93BBd0
   
  Disk instance (sd) names to device names (cXtXdX).
  Excluding md|st|nfs and including c3|c4 - SAN controller/paths:
  sd43=/dev/dsk/c4t50060E8006FE93ABd39
  sd44=/dev/dsk/c4t50060E8006FE93ABd38
  sd45=/dev/dsk/c4t50060E8006FE93ABd37
  sd46=/dev/dsk/c4t50060E8006FE93ABd36
  sd47=/dev/dsk/c4t50060E8006FE93ABd35
  sd48=/dev/dsk/c4t50060E8006FE93ABd34
  sd49=/dev/dsk/c4t50060E8006FE93ABd33
  sd50=/dev/dsk/c4t50060E8006FE93ABd32
  sd51=/dev/dsk/c4t50060E8006FE93ABd31
  sd52=/dev/dsk/c3t50060E8006FE93BBd39
  sd53=/dev/dsk/c3t50060E8006FE93BBd38
  sd54=/dev/dsk/c4t50060E8006FE93ABd30
  sd55=/dev/dsk/c3t50060E8006FE93BBd37
  sd56=/dev/dsk/c3t50060E8006FE93BBd36
  sd57=/dev/dsk/c4t50060E8006FE93ABd29
  sd58=/dev/dsk/c3t50060E8006FE93BBd35
  sd59=/dev/dsk/c4t50060E8006FE93ABd28
  sd60=/dev/dsk/c3t50060E8006FE93BBd34
  sd61=/dev/dsk/c3t50060E8006FE93BBd33
  sd62=/dev/dsk/c4t50060E8006FE93ABd27
  sd63=/dev/dsk/c3t50060E8006FE93BBd32
  sd64=/dev/dsk/c3t50060E8006FE93BBd31
  sd65=/dev/dsk/c4t50060E8006FE93ABd26
  sd66=/dev/dsk/c3t50060E8006FE93BBd30
  sd67=/dev/dsk/c4t50060E8006FE93ABd25
  sd68=/dev/dsk/c3t50060E8006FE93BBd29
  sd69=/dev/dsk/c4t50060E8006FE93ABd24
  sd70=/dev/dsk/c3t50060E8006FE93BBd28
  sd71=/dev/dsk/c3t50060E8006FE93BBd27
  sd72=/dev/dsk/c3t50060E8006FE93BBd26
  sd73=/dev/dsk/c3t50060E8006FE93BBd25
  sd74=/dev/dsk/c3t50060E8006FE93BBd24
  sd75=/dev/dsk/c3t50060E8006FE93BBd23
  sd76=/dev/dsk/c4t50060E8006FE93ABd23
  sd77=/dev/dsk/c3t50060E8006FE93BBd22
  sd78=/dev/dsk/c4t50060E8006FE93ABd22
  sd79=/dev/dsk/c4t50060E8006FE93ABd21
  sd80=/dev/dsk/c3t50060E8006FE93BBd21
  sd81=/dev/dsk/c4t50060E8006FE93ABd20
  sd82=/dev/dsk/c3t50060E8006FE93BBd20
  sd83=/dev/dsk/c4t50060E8006FE93ABd19
  sd84=/dev/dsk/c3t50060E8006FE93BBd19
  sd85=/dev/dsk/c3t50060E8006FE93BBd18
  sd86=/dev/dsk/c4t50060E8006FE93ABd18
  sd87=/dev/dsk/c4t50060E8006FE93ABd17
  sd88=/dev/dsk/c3t50060E8006FE93BBd17
  sd89=/dev/dsk/c4t50060E8006FE93ABd16
  sd90=/dev/dsk/c3t50060E8006FE93BBd16
  sd91=/dev/dsk/c3t50060E8006FE93BBd15
  sd92=/dev/dsk/c4t50060E8006FE93ABd15
  sd93=/dev/dsk/c3t50060E8006FE93BBd14
  sd94=/dev/dsk/c4t50060E8006FE93ABd14
  sd95=/dev/dsk/c3t50060E8006FE93BBd13
  sd96=/dev/dsk/c4t50060E8006FE93ABd13
  sd97=/dev/dsk/c4t50060E8006FE93ABd12
  sd98=/dev/dsk/c3t50060E8006FE93BBd12
  sd99=/dev/dsk/c4t50060E8006FE93ABd11
  sd100=/dev/dsk/c3t50060E8006FE93BBd11
  sd101=/dev/dsk/c3t50060E8006FE93BBd10
  sd102=/dev/dsk/c4t50060E8006FE93ABd10
  sd103=/dev/dsk/c3t50060E8006FE93BBd9
  sd104=/dev/dsk/c4t50060E8006FE93ABd9
  sd105=/dev/dsk/c3t50060E8006FE93BBd8
  sd106=/dev/dsk/c4t50060E8006FE93ABd8
  sd107=/dev/dsk/c3t50060E8006FE93BBd7
  sd108=/dev/dsk/c4t50060E8006FE93ABd7
  sd109=/dev/dsk/c3t50060E8006FE93BBd6
  sd110=/dev/dsk/c4t50060E8006FE93ABd6
  sd111=/dev/dsk/c3t50060E8006FE93BBd5
  sd112=/dev/dsk/c4t50060E8006FE93ABd5
  sd113=/dev/dsk/c3t50060E8006FE93BBd4
  sd114=/dev/dsk/c4t50060E8006FE93ABd4
  sd115=/dev/dsk/c4t50060E8006FE93ABd3
  sd116=/dev/dsk/c3t50060E8006FE93BBd3
  sd117=/dev/dsk/c4t50060E8006FE93ABd2
  sd118=/dev/dsk/c3t50060E8006FE93BBd2
  sd119=/dev/dsk/c4t50060E8006FE93ABd1
  sd120=/dev/dsk/c3t50060E8006FE93BBd1
  sd121=/dev/dsk/c3t50060E8006FE93BBd0
  sd122=/dev/dsk/c4t50060E8006FE93ABd0

# 2  
Old 07-24-2013
It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.

The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.
This User Gave Thanks to DukeNuke2 For This Post:
# 3  
Old 07-24-2013
Are your switches zoned? Or is your SAN set up as "everyone sees everything"?

I wonder if those resets correlate to any server on the SAN rebooting.
This User Gave Thanks to achenle For This Post:
# 4  
Old 07-24-2013
Thanks for the replies.

I'm waiting for our Oracle support to be renewed, so until then it's hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have that's connected to SAN storage, most of our SAN-connected servers are SPARC. So I'm less familiar with this Intel hardware.

The SAN switches are zoned to specific servers.
# 5  
Old 07-26-2013
I think that this can be related to the queue_depth settings on the environment.

Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.

In solaris, the queue_depth concept is named max_throttle

Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value

Review your Storage recommendation, and set this parameter,

On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time

Code:
set ssd:ssd_max_throttle=x
set ssd:ssd_io_time=180

The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.

For example I can see that your Storage is Hitachi,

for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512

Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)

Code:
Queue Depth = 2048 / Total number of LUNs defined on the port <= 32

I hope that this help you

Only for reference, the next is a text from Hitachi Knowledge Base

Code:
Fix                   :                                                 All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack. 
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds.  In most cases these 
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command

Set as follows:
/etc/system
set ssd:ssd_io_time=180
set ssd:ssd_max_throttle=n
Where n is 8 or 256/# of active initiators/# of LUNs

(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.) 

Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

transport errors in iostat

Hi Unix experts, I have a question regarding a disk failure seen in "iostat -Enm" output: # iostat -Enm c1t0d0 Soft Errors: 0 Hard Errors: 7 Transport Errors: 9 Vendor: FUJITSU Product: MAU3073NCSUN72G Revision: 0802 Serial No: 0514F005M0 Size: 73.40GB <73400057856 bytes> Media... (5 Replies)
Discussion started by: dyavuzy1
5 Replies

2. Red Hat

Identify SAN disks not in use

Hello, How can I identify SAN disks not in use by the OS? Thank you. (8 Replies)
Discussion started by: asanchez
8 Replies

3. UNIX for Dummies Questions & Answers

Identify SAN disks

Hello everybody, I'm using the binary inqraid (Linux RHEL) in order to retrieve information about SAN disks. The questions are: Given an LDEV, how do I know if the SAN disk related to this LDEV is being used by the OS? I mean, how can I demonstrate to "Storage department" that all disks of... (4 Replies)
Discussion started by: asanchez
4 Replies

4. Solaris

iostat -En errors

I all, I would like to know what are the causes of : -soft error -harderror -transport error and how to avoid and repare them. I got the iostat out put below: atng-mm01% iostat -En | grep -i hard c0t0d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0 c0t0d1 ... (3 Replies)
Discussion started by: zaza
3 Replies

5. Solaris

What is the difference between softerrors,harderrors,transport errors?

what is the difference between softerrors,harderrors,transport errors? (3 Replies)
Discussion started by: tv.praveenkumar
3 Replies

6. Solaris

SAN DISKS - Number of slices ?

Good morning to one and all :-) Thank god its Friday, as its bee na rubbish week for me ! So, a quick question. Disks ! Ive got a few local disks, and a few SAN disks used on my solaris server. Whats confusing me, and Im not sure if there's an issue at the SAN end, or my end, regarding the... (3 Replies)
Discussion started by: sbk1972
3 Replies

7. Solaris

Sense Key: Media Error

Hi all, We have below WARNING in /var/adm/messages file from our Solaris server. WARNING: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0 (sd0): Error for Command: write(10) Error Level: Fatal Requested Block: 16745265 Error Block: 16745269 Vendor: SEAGATE Serial Number:... (8 Replies)
Discussion started by: summerboy
8 Replies

8. Solaris

iostat -nE with Hard Errors

iostat -nE returns the followings I want to know what is happening to my StorEDGE A1000? Can someone help me? It is a critical device. (1 Reply)
Discussion started by: nickychung
1 Replies

9. Linux

problem with disks on SAN

Hi I have a linux box attched to a SAN storage from EMC with RAID 5 .I understand that it has 3g cache howver a 20gb file creation takes too much time here are my results any ideas why time dd if=/dev/zero of=disk.img bs=1048576 count=20000 20000+0 records in 20000+0 records out 997.59s... (2 Replies)
Discussion started by: xiamin
2 Replies
Login or Register to Ask a Question