Sense key unit attention & iostat hardware and transport errors on SAN disks
Hello, Iím trying to get to the bottom of SAN disk errors weíve been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.
We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachiís analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didnít see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.
It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.
The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.
Iím waiting for our Oracle support to be renewed, so until then itís hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have thatís connected to SAN storage, most of our SAN-connected servers are SPARC. So Iím less familiar with this Intel hardware.
I think that this can be related to the queue_depth settings on the environment.
Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.
In solaris, the queue_depth concept is named max_throttle
Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value
Review your Storage recommendation, and set this parameter,
On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time
The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.
For example I can see that your Storage is Hitachi,
for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512
Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)
Queue Depth = 2048 / Total number of LUNs defined on the port <= 32
I hope that this help you
Only for reference, the next is a text from Hitachi Knowledge Base
Fix : All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack.
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds. In most cases these
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command
Set as follows:/etc/system
Where n is 8 or 256/# of active initiators/# of LUNs
(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.)