Solaris

The Solaris Operating System, usually known simply as Solaris, is a Unix-based operating system introduced by Sun Microsystems. The Solaris OS is now owned by Oracle.

Sense key unit attention & iostat hardware and transport errors on SAN disks

👤 Login to reply

 
Thread Tools Search this Thread Display Modes
    #1  
Old 07-23-2013
TKD TKD is offline
Registered User
 
Join Date: Jan 2011
Last Activity: 26 July 2013, 10:35 AM EDT
Posts: 19
Thanks: 4
Thanked 1 Time in 1 Post
Sense key unit attention & iostat hardware and transport errors on SAN disks

Hello, Iím trying to get to the bottom of SAN disk errors weíve been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.

We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachiís analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didnít see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.

Code:
 Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    SCSI transport failed: reason 'tran_err': retrying command
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,17 (sd75):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 141200                    Error Block: 141200
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931367
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,16 (sd77):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 14223776                  Error Block: 14223776
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931366
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,10 (sd90):
  Jul 22 11:10:26 cscgbwndc004    Error for Command: write(10)               Error Level: Retryable
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 12622176                  Error Block: 12622176
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931360
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:26 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,1c (sd70):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13265440                  Error Block: 13265440
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE93136C
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1077,171@0,1/fp@0,0/disk@w50060e8006fe93bb,18 (sd74):
  Jul 22 11:10:29 cscgbwndc004    Error for Command: read(10)                Error Level: Retryable
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Requested Block: 13264016                  Error Block: 13264016
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Vendor: HITACHI                            Serial Number: 50 0FE931368
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      Sense Key: Unit_Attention
  Jul 22 11:10:29 cscgbwndc004 scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
   
  iostat -en | egrep "device|errors|c3"
    ---- errors --- 
    s/w h/w trn tot device
      0   1   0   1 c3t50060E8006FE93BBd39
      0   1   0   1 c3t50060E8006FE93BBd38
      0   1   0   1 c3t50060E8006FE93BBd37
      0   1   0   1 c3t50060E8006FE93BBd36
      0   1   0   1 c3t50060E8006FE93BBd35
      0   1   0   1 c3t50060E8006FE93BBd34
      0   1   0   1 c3t50060E8006FE93BBd33
      0   1   0   1 c3t50060E8006FE93BBd32
      0   1   0   1 c3t50060E8006FE93BBd31
      0   1   0   1 c3t50060E8006FE93BBd30
      0   1   0   1 c3t50060E8006FE93BBd29
      0   1   0   1 c3t50060E8006FE93BBd28
      0   1   0   1 c3t50060E8006FE93BBd27
      0   1   0   1 c3t50060E8006FE93BBd26
      0   1   0   1 c3t50060E8006FE93BBd25
      0   1   0   1 c3t50060E8006FE93BBd24
      0   2   1   3 c3t50060E8006FE93BBd23
      0   2   1   3 c3t50060E8006FE93BBd22
      0   1   0   1 c3t50060E8006FE93BBd21
      0   1   0   1 c3t50060E8006FE93BBd20
      0   1   0   1 c3t50060E8006FE93BBd19
      0   1   0   1 c3t50060E8006FE93BBd18
      0   1   0   1 c3t50060E8006FE93BBd17
      0   1   0   1 c3t50060E8006FE93BBd16
      0   1   0   1 c3t50060E8006FE93BBd15
      0   0   0   0 c3t50060E8006FE93BBd14
      0   0   0   0 c3t50060E8006FE93BBd13
      0   0   0   0 c3t50060E8006FE93BBd12
      0   0   0   0 c3t50060E8006FE93BBd11
      0   0   0   0 c3t50060E8006FE93BBd10
      0   0   0   0 c3t50060E8006FE93BBd9
      0   0   0   0 c3t50060E8006FE93BBd8
      0   0   0   0 c3t50060E8006FE93BBd7
      0   0   0   0 c3t50060E8006FE93BBd6
      0   0   0   0 c3t50060E8006FE93BBd5
      0   0   0   0 c3t50060E8006FE93BBd4
      0   0   0   0 c3t50060E8006FE93BBd3
      0   0   0   0 c3t50060E8006FE93BBd2
      0   0   0   0 c3t50060E8006FE93BBd1
      0   0   0   0 c3t50060E8006FE93BBd0
   
  Disk instance (sd) names to device names (cXtXdX).
  Excluding md|st|nfs and including c3|c4 - SAN controller/paths:
  sd43=/dev/dsk/c4t50060E8006FE93ABd39
  sd44=/dev/dsk/c4t50060E8006FE93ABd38
  sd45=/dev/dsk/c4t50060E8006FE93ABd37
  sd46=/dev/dsk/c4t50060E8006FE93ABd36
  sd47=/dev/dsk/c4t50060E8006FE93ABd35
  sd48=/dev/dsk/c4t50060E8006FE93ABd34
  sd49=/dev/dsk/c4t50060E8006FE93ABd33
  sd50=/dev/dsk/c4t50060E8006FE93ABd32
  sd51=/dev/dsk/c4t50060E8006FE93ABd31
  sd52=/dev/dsk/c3t50060E8006FE93BBd39
  sd53=/dev/dsk/c3t50060E8006FE93BBd38
  sd54=/dev/dsk/c4t50060E8006FE93ABd30
  sd55=/dev/dsk/c3t50060E8006FE93BBd37
  sd56=/dev/dsk/c3t50060E8006FE93BBd36
  sd57=/dev/dsk/c4t50060E8006FE93ABd29
  sd58=/dev/dsk/c3t50060E8006FE93BBd35
  sd59=/dev/dsk/c4t50060E8006FE93ABd28
  sd60=/dev/dsk/c3t50060E8006FE93BBd34
  sd61=/dev/dsk/c3t50060E8006FE93BBd33
  sd62=/dev/dsk/c4t50060E8006FE93ABd27
  sd63=/dev/dsk/c3t50060E8006FE93BBd32
  sd64=/dev/dsk/c3t50060E8006FE93BBd31
  sd65=/dev/dsk/c4t50060E8006FE93ABd26
  sd66=/dev/dsk/c3t50060E8006FE93BBd30
  sd67=/dev/dsk/c4t50060E8006FE93ABd25
  sd68=/dev/dsk/c3t50060E8006FE93BBd29
  sd69=/dev/dsk/c4t50060E8006FE93ABd24
  sd70=/dev/dsk/c3t50060E8006FE93BBd28
  sd71=/dev/dsk/c3t50060E8006FE93BBd27
  sd72=/dev/dsk/c3t50060E8006FE93BBd26
  sd73=/dev/dsk/c3t50060E8006FE93BBd25
  sd74=/dev/dsk/c3t50060E8006FE93BBd24
  sd75=/dev/dsk/c3t50060E8006FE93BBd23
  sd76=/dev/dsk/c4t50060E8006FE93ABd23
  sd77=/dev/dsk/c3t50060E8006FE93BBd22
  sd78=/dev/dsk/c4t50060E8006FE93ABd22
  sd79=/dev/dsk/c4t50060E8006FE93ABd21
  sd80=/dev/dsk/c3t50060E8006FE93BBd21
  sd81=/dev/dsk/c4t50060E8006FE93ABd20
  sd82=/dev/dsk/c3t50060E8006FE93BBd20
  sd83=/dev/dsk/c4t50060E8006FE93ABd19
  sd84=/dev/dsk/c3t50060E8006FE93BBd19
  sd85=/dev/dsk/c3t50060E8006FE93BBd18
  sd86=/dev/dsk/c4t50060E8006FE93ABd18
  sd87=/dev/dsk/c4t50060E8006FE93ABd17
  sd88=/dev/dsk/c3t50060E8006FE93BBd17
  sd89=/dev/dsk/c4t50060E8006FE93ABd16
  sd90=/dev/dsk/c3t50060E8006FE93BBd16
  sd91=/dev/dsk/c3t50060E8006FE93BBd15
  sd92=/dev/dsk/c4t50060E8006FE93ABd15
  sd93=/dev/dsk/c3t50060E8006FE93BBd14
  sd94=/dev/dsk/c4t50060E8006FE93ABd14
  sd95=/dev/dsk/c3t50060E8006FE93BBd13
  sd96=/dev/dsk/c4t50060E8006FE93ABd13
  sd97=/dev/dsk/c4t50060E8006FE93ABd12
  sd98=/dev/dsk/c3t50060E8006FE93BBd12
  sd99=/dev/dsk/c4t50060E8006FE93ABd11
  sd100=/dev/dsk/c3t50060E8006FE93BBd11
  sd101=/dev/dsk/c3t50060E8006FE93BBd10
  sd102=/dev/dsk/c4t50060E8006FE93ABd10
  sd103=/dev/dsk/c3t50060E8006FE93BBd9
  sd104=/dev/dsk/c4t50060E8006FE93ABd9
  sd105=/dev/dsk/c3t50060E8006FE93BBd8
  sd106=/dev/dsk/c4t50060E8006FE93ABd8
  sd107=/dev/dsk/c3t50060E8006FE93BBd7
  sd108=/dev/dsk/c4t50060E8006FE93ABd7
  sd109=/dev/dsk/c3t50060E8006FE93BBd6
  sd110=/dev/dsk/c4t50060E8006FE93ABd6
  sd111=/dev/dsk/c3t50060E8006FE93BBd5
  sd112=/dev/dsk/c4t50060E8006FE93ABd5
  sd113=/dev/dsk/c3t50060E8006FE93BBd4
  sd114=/dev/dsk/c4t50060E8006FE93ABd4
  sd115=/dev/dsk/c4t50060E8006FE93ABd3
  sd116=/dev/dsk/c3t50060E8006FE93BBd3
  sd117=/dev/dsk/c4t50060E8006FE93ABd2
  sd118=/dev/dsk/c3t50060E8006FE93BBd2
  sd119=/dev/dsk/c4t50060E8006FE93ABd1
  sd120=/dev/dsk/c3t50060E8006FE93BBd1
  sd121=/dev/dsk/c3t50060E8006FE93BBd0
  sd122=/dev/dsk/c4t50060E8006FE93ABd0

Sponsored Links
    #2  
Old 07-24-2013
DukeNuke2's Unix or Linux Image
DukeNuke2 DukeNuke2 is offline Forum Staff  
Soulman
 
Join Date: Jul 2006
Last Activity: 18 July 2018, 1:27 PM EDT
Location: Berlin, Germany
Posts: 5,725
Thanks: 75
Thanked 308 Times in 294 Posts
It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.

The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.
The Following User Says Thank You to DukeNuke2 For This Useful Post:
TKD (07-24-2013)
Sponsored Links
    #3  
Old 07-24-2013
achenle achenle is offline Forum Advisor  
Registered User
 
Join Date: Jun 2009
Last Activity: 25 June 2018, 8:15 AM EDT
Posts: 1,015
Thanks: 3
Thanked 156 Times in 148 Posts
Are your switches zoned? Or is your SAN set up as "everyone sees everything"?

I wonder if those resets correlate to any server on the SAN rebooting.
The Following User Says Thank You to achenle For This Useful Post:
TKD (07-24-2013)
    #4  
Old 07-24-2013
TKD TKD is offline
Registered User
 
Join Date: Jan 2011
Last Activity: 26 July 2013, 10:35 AM EDT
Posts: 19
Thanks: 4
Thanked 1 Time in 1 Post
Thanks for the replies.

Iím waiting for our Oracle support to be renewed, so until then itís hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have thatís connected to SAN storage, most of our SAN-connected servers are SPARC. So Iím less familiar with this Intel hardware.

The SAN switches are zoned to specific servers.
Sponsored Links
    #5  
Old 07-26-2013
cerber0 cerber0 is offline
Registered User
 
Join Date: Nov 2008
Last Activity: 23 June 2018, 2:18 AM EDT
Location: Mexico DF
Posts: 47
Thanks: 0
Thanked 5 Times in 5 Posts
I think that this can be related to the queue_depth settings on the environment.

Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.

In solaris, the queue_depth concept is named max_throttle

Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value

Review your Storage recommendation, and set this parameter,

On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time

Code:
set ssd:ssd_max_throttle=x
set ssd:ssd_io_time=180

The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.

For example I can see that your Storage is Hitachi,

for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512

Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)

Code:
Queue Depth = 2048 / Total number of LUNs defined on the port <= 32

I hope that this help you

Only for reference, the next is a text from Hitachi Knowledge Base

Code:
Fix                   :                                                 All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack. 
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds.  In most cases these 
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command

Set as follows:
/etc/system
set ssd:ssd_io_time=180
set ssd:ssd_max_throttle=n
Where n is 8 or 256/# of active initiators/# of LUNs

(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.) 

Sponsored Links
👤 Login to reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
transport errors in iostat dyavuzy1 UNIX for Dummies Questions & Answers 5 01-25-2012 11:22 AM
iostat -En errors zaza Solaris 3 10-22-2010 12:32 PM
What is the difference between softerrors,harderrors,transport errors? tv.praveenkumar Solaris 3 06-03-2010 07:16 PM
Sense Key: Media Error summerboy Solaris 8 03-11-2009 09:52 AM
iostat -nE with Hard Errors nickychung Solaris 1 12-15-2008 01:40 AM



All times are GMT -4. The time now is 12:10 AM.

Unix & Linux Forums Content Copyright©1993-2018. All Rights Reserved.
×
UNIX.COM Login
Username:
Password:  
Show Password





Not a Forum Member?
Forgot Password?