Receiving: 4B436A3D 0313233216 T H fscsi0 LINK ERROR
Hey All,
I'm receiving the following error off of a Power5 9133-55A after I write 2-5 files to the LUN:
4B436A3D 0313233216 T H fscsi0 LINK ERROR
I can create the filesystem, volume groups etc etc. All goes well until there is sustained activity to the LUN then the above error shows up with no messages on the target.
The initiator card on the above is an LP11002 card and the target card is a QLogic 2464 card. I tried all sorts of things over the last 2 months but no luck. Still I get the above error. The connection breaks each time a significant amount of data is being transferred (1-4 GB). I'm wondering how to debug that card further? I'm aware of an APAR on some AIX versions that throw the above but I upgraded the OS as suggested yet the error still remains. Any other way to debug the above? I tried P2P and the cards negotiate for a few seconds then the connection is dropped. Arbitrary loop seems to work best but the connection fails on sustained writes.
These two error messages below always accompany each other. So I'll post both to get feedback from others as I work through the solutions on this page IBM Technical support search - United States. However I tried to disable dynamic tracking already (Both I then I & T / I = Initiator and T = Target in this context), from that page, but that didn't help with the issue:
Cheers,
DH
---------- Post updated at 08:17 PM ---------- Previous update was at 08:15 PM ----------
Just checking the time and notice all of these ended up getting logged at the exact same time:
Just let me know if you need to see the first two. They seem symptomatic however.
Cheers,
DH
---------- Post updated at 09:15 PM ---------- Previous update was at 08:17 PM ----------
the first error you receive - FCP_ERR4 4B436A3D - according to the sense information provided means, that AIX driver sent RESET command to the SAN device and didn't receive an answer. Usually it means, you have a SAN problem and you should open a case with your SAN switch or better - storage device vendor.
But as far as I see from the output of lsattr -El fscsi0 you don't have SAN. You have a direct-attached storage. If you have a SAN fabric, not a direct-attached storage, then you have a problem connecting to the fabric, mostly a broken cable is the cause.
If you really have a direct-attached storage, then I have some other question:
- how many LDEVs/LUNs do you receive from the storage?
- does the problem happen only with this LDEV (Nr. 00:00:00:00:00:00:00:02) or also with other LDEVs?
- is the storage connected through multiple adapters or is it the only adapter to the storage?
- how many different storages are connected using this adapter?
If it is a single storage directly connected through the single adapter, I would recommend:
- to check the cable
- to switch off dyntrk and fc_err_recov
- to minimize max_xfer_size and corresponding parameters on the hdisk
It's fiber card to fiber card and I'm zoning a single FILEIO device, which itself is sitting on a RAID 6 / XFS storage ( 6 disk ). I tried disabling dynamic tracking, no luck. It tried to change the cable, no luck. I'll read about the other options you mentioned as well. There's only one LUN involved and I'm able to write to it fine until some large data is being written but failure is 100% in each case.
The target system is SCST (Apologies I thought I mentioned but as I read above, I haven't yet.). Funny thing is that on restart of that SCST subsystem, I can get a LUN back following a failure. (Maybe memory leak.) I might try LIO / targetcli next if the above doesn't work.
I see this thread has been open for over a day without resolution so, although I'm not qualified to answer the specifics, I thought I'd chip in anyway.
Firstly, my disclaimer. I'm not an AIX expert by any means and I have no knowledge of the LP11002. However, I do know the QL2464 very well and I was the technical director of a storage distributor many years ago and we shipped loads of fibre channel kit. So all I can do is tell you where I'd be looking in the first instance. I could well be completely wrong but here goes...........
The symptoms you describe indicate that everything is fine until the link gets really busy, then it screws up. Normal FC payload is 2112 giving a MTU of 2148 bytes total allowing for headers, etc. Some FC adapters support "jumbo" packets with a payload up to 9000 giving a MTU of 9036 bytes with headers. If the adapter supports jumbos, whether jumbo packets are enabled or not is a setting in the adapter BIOS. So if one adapter is set for jumbo and the other doesn't support jumbo then everything will work find with low traffic but when things really get going one of the adapters suddenly sends a jumbo packet that the other adapter cannot understand. So if I was fighting this issue I would look at both adapters and set the max payload to 2112 or the max MTU to 2148 or set the "support jumbo packets=no". Then test to see if the problem has gone away.
Needless to say, should you get to a known good working situation only change one thing at a time afterwards and fully test that it hasn't screwed up again.
I have no clue whether this will help you or not.
Good luck anyway.
Last edited by hicksd8; 03-15-2016 at 05:35 PM..
These 2 Users Gave Thanks to hicksd8 For This Post:
Hi All
I am facing an issue with our new solaris machine.
in /var/adm/messages
Apr 22 16:43:05 Prod-App1 in.routed: interface net0 to 172.16.101.1 turned off
Apr 22 16:43:33 Prod-App1 mac: NOTICE: nxge0 link up, 1000 Mbps, full duplex
Apr 22 16:43:34 Prod-App1 mac: NOTICE: nxge0 link... (2 Replies)
Hi All
I am facing an issue with our new solaris machine.
in /var/adm/messages
root@Prod-App1:/var/tmp#
root@Prod-App1:/var/tmp#
root@Prod-App1:/var/tmp# cat /var/adm//messages
Apr 20 03:10:01 Prod-App1 syslogd: line 25: WARNING: loghost could not be resolved
Apr 20 08:24:18 Prod-App1... (0 Replies)
Hi everybody,
I read about treads realted to this issue but they did not resovle issue given below.
Please help me resolve issue given below
I have html file under /srv/www/htdocs/actual_folder
ls actual_folder/
test.html
and following link works... (0 Replies)
Hello,
One one of my AIX boxes I'm having the following errror:
fcstat fcs0:
Port Speed (supported): 4 GBIT
Error opening device: /dev/fscsi0
errno: 0000003d
Has anyone encountered similar errors?
Thank you! (1 Reply)
Hello All,
I've encountered a strange behaviour from g++ that doesn't make sense to me. Maybe you can shed some light on it:
I have a bunch of source files and want to compile them and link them with a static library liba.a located in /usr/local/lib64 into an executable
Approach 1 works... (0 Replies)
Hi All
I am quite new to Unix. Following is a shell script that i have written and getting the subject mentioned error.
#!/bin/ksh
#-------------------------------------------------------------------------
# File: ang_stdnld.ksh
#
# Desc: UNIX shell script to extract Store information.... (3 Replies)
Hi all,
This is my first shell script, so I'm hoping the problem is that I'm just missing something, and not something bigger. I have a Java application that I wrote in WSAD that reads data from an Excel file and inserts values into a DB2 database. I'm able to run it successfully in WSAD. I... (4 Replies)
Dear linuxers,
I'm a novice in C++ programming.
I wrote a ReadFile class in file ReadFile.cpp.
After that, I wrote a test.cpp, which contains a main function, to test whether my class work well.
I follow the following steps to compile the ReadFile.cpp file.
g++296 -c ReadFile.cpp -o... (4 Replies)
Hi,
After I installed gcc on my machine and issued a command to compile a program, I did
gcc -c 'prog'.c
The object file was created, then I did
gcc -o 'prog' 'prog'.o
Then I got this error message I have been reading man pages and searching the internet but have not been able to... (1 Reply)