Sense key unit attention & iostat hardware and transport errors on SAN disks
Hello, I'm trying to get to the bottom of SAN disk errors we've been seeing.
Server is Sun Fire X4270 M2 running Solaris 10 8/11 u10 X86 since April 2012. SAN HBAs are SG-PCIE2FC-QF8-Z-Sun-branded Qlogic. SAN storage system is Hitachi VSP. We have 32 LUNs in use and another 8 LUNs not brought into Symantec Storage Foundation yet.
We started seeing hardware and transport errors on the LUNs July 2 which lead to corruption of 3 Veritas filesystems. I got that resolved on the third and we had to restore from tape 3 filesystems. The SAN team found no SAN switch errors and Hitachi's analysis showed no disk errors.
We originally had Solaris MPxIO enabled by default for multipathing, along with Veritas DMP. Symantec was saying that the two multipathing systems could co-exist, but the errors returned so I disabled MPxIO and rebooted on July 17. I didn't see any more errors until yesterday at 1110am. Is this a problem with the SAN HBAs? What do these errors mean? Any help would be appreciated.
It could be the whole way of transport... From HBA to the transceiver over the cables to the other end of the system. it can also be a hardware error on the HBA itself. Also there can be a faulty device which may cause bus resets and therfore a rescan of the bus all the time.
The first thing I would try is to change two transceivers (from two HBAs; if you have) and see if the error is going with the transceiver or staying on the same controller.
I'm waiting for our Oracle support to be renewed, so until then it's hard to get something done with the existing hardware support team unless I can specifically point out some hardware is defective. This is the only Intel X4270 we have that's connected to SAN storage, most of our SAN-connected servers are SPARC. So I'm less familiar with this Intel hardware.
I think that this can be related to the queue_depth settings on the environment.
Basically, the Storage Port is saturated and is rejecting HBA requested, normally identified by the ASC=0x29 messages and transport_errors.
In solaris, the queue_depth concept is named max_throttle
Each Storage Vendor (netapp, emc, hds, ibm) have different recommendation for this value
Review your Storage recommendation, and set this parameter,
On Systems using SUN drivers the HBA, you can use the next line on /etc/system to set the max_throttle and io_time
Code:
set ssd:ssd_max_throttle=x
set ssd:ssd_io_time=180
The queue_depth, basically is the max simultaneous operations that a Storage port can be receive.
For example I can see that your Storage is Hitachi,
for the enterprise line (USPV, VSP) this value is 2048
for the midrage line (AMS,HUS) this value is 512
Then, the max_throtlle (queue_depth) can be calculated with the next formula (for enterprise)
Code:
Queue Depth = 2048 / Total number of LUNs defined on the port <= 32
I hope that this help you
Only for reference, the next is a text from Hitachi Knowledge Base
Code:
Fix : All SUN HBAs 6799, 6727, 6748, 6757, 6767, 6768 are supported by the SUN device driver stack.
It is an SSD device driver.
Sun Solaris has a default Queue_Depth of 64 and I/O timeout of 30 seconds. In most cases these
parameters MUST be changed to avoid transport errors, such as:
SCSI transport failed: reason 'tran_err': retrying command
Set as follows:/etc/system
set ssd:ssd_io_time=180
set ssd:ssd_max_throttle=n
Where n is 8 or 256/# of active initiators/# of LUNs
(Refer to the 9900V or 9900 Series Hitachi Freedom Storage Lightning Sun Solaris Configuration Guides.)
Hi Unix experts,
I have a question regarding a disk failure seen in "iostat -Enm" output:
# iostat -Enm
c1t0d0 Soft Errors: 0 Hard Errors: 7 Transport Errors: 9
Vendor: FUJITSU Product: MAU3073NCSUN72G Revision: 0802 Serial No: 0514F005M0
Size: 73.40GB <73400057856 bytes>
Media... (5 Replies)
Hello everybody,
I'm using the binary inqraid (Linux RHEL) in order to retrieve information about SAN disks. The questions are:
Given an LDEV, how do I know if the SAN disk related to this LDEV is being used by the OS? I mean, how can I demonstrate to "Storage department" that all disks of... (4 Replies)
I all,
I would like to know what are the causes of :
-soft error
-harderror
-transport error
and how to avoid and repare them.
I got the iostat out put below:
atng-mm01% iostat -En | grep -i hard
c0t0d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
c0t0d1 ... (3 Replies)
Good morning to one and all :-) Thank god its Friday, as its bee na rubbish week for me !
So, a quick question. Disks ! Ive got a few local disks, and a few SAN disks used on my solaris server. Whats confusing me, and Im not sure if there's an issue at the SAN end, or my end, regarding the... (3 Replies)
Hi all,
We have below WARNING in /var/adm/messages file from our Solaris server.
WARNING: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0 (sd0):
Error for Command: write(10) Error Level: Fatal
Requested Block: 16745265 Error Block: 16745269
Vendor: SEAGATE Serial Number:... (8 Replies)
Hi
I have a linux box attched to a SAN storage from EMC with RAID 5 .I understand that it has 3g cache howver a 20gb file creation takes too much time here are my results any ideas why
time dd if=/dev/zero of=disk.img bs=1048576 count=20000
20000+0 records in
20000+0 records out
997.59s... (2 Replies)