It's a puzzle


 
Thread Tools Search this Thread
Operating Systems Linux It's a puzzle
# 1  
Old 09-06-2008
Question It's a puzzle

Hi,

Recently I installed Fedora 9 on the following hardware
- Asus A8N-SLI Deluxe motherboard bios version 1805
- 2GB twinmos ram
- AMD 4400 CPU
- Tagan PSU 550 W
- Asus EN6200LE video card
- WD 74 GB Raptor
- Areca ARC-1222 raid controller
- 4x 1TB Seagate Baracudas
- Symbios Logic 53C875J SCSI controller card (made for Compaq)
- HP surestore DAT40 tape drive

Fedora installed, booted and worked fine for a couple of days. With yum I installed all relevant updates.

Trouble started when using Amanda for the first backups to tape. Amanda would work ok a few times, but then
the entire machine crashed. I mean really crashed. The machine would not get through the bios post.
So I cleared cmos by removing battery , setting jumper appropriately and a wait for 15 secs. No avail, motherboard dead.

I ordered a replacement identical motherboard and put everything back together. Linux boots fine and I did not touch Amanda
for a week. All was well, so I thought. I used the machine intensively, copying over 1 TB of data to the raid array, installing
Horde packages and all kinds of other fun stuff. No problems what so ever.

I did look through the logs obviously. The only entries of note were related to the scsi controller card. A couple of
SCSI bus resets just prior to the crash. I did find a few articles from 2005 on the net about 53c8XX driver problems:
Please fix bug #1852 (hald causes SYM53C8xx SCSI errors, device disconnects + GNOME hang). Surely this problems was fixed a long time ago ?
For double measure I also checked all termination caps and scsi cables.
I am pretty sure, but not absolutely sure, these resets were related to the 53c875J scsi controller card and not to the
Areca raid card. Anyhow, I had no problems with the raid array at all, even when using it intensively.

The next weekend I ran an Amanda backup again. Two amflush jobs went fine, so old backups on holding disk were flushed to tape ok.
Then I proceeded with a new backup (amdump). After some time the machine crashed again. Absolutely identical symptoms.

This time, I stripped the machine down to bare minimum.
Only motherboard, PSU, 1 GB ram, AMD 4400 CPU, old pci videocard, keyboard, monitor.
Result only one beep (that's good) and colorful gibberish on the monitor, not even the bios mem check and such.

So, it appears that some error related to using the backup software (scsi?) causes the motherboard to die.

Presently, I have two courses of action I can think of:
1. I ordered a new bios chip, hoping that the board will then get through post.
If this works it suggests to me that some error in scsi subsystem can actually overwrite (flash!) the motherboard bios.
Two weeks ago, I had not believed this possible, but here it is.
2. If option 1. does not work I will order yet another replacement motherboard and think of a new backup strategy.
I do not mind chasing bugs, but loosing a motherboard at every step of the way is not very appealing.
So out with the scsi card.

BTW until I get the machine up and running again I cannot look at the logs and present more detailed error reports.
This is all from memory.

I have spent quite some time googling this particular problem. I cannot find any similar cases.
So anyone out there, does this ring any bells?

Thanks
Jos
It's a puzzle-2008-09-06jpg
# 2  
Old 09-07-2008
Follow-up it's a puzzle

Well, the story continues.

Replacing the bios chip did not work. So motherboard was indeed dead.
Replaced motherboard yet again (this is the third) and all is well again.

The following entries are from the log:

(Fedora 9, kernel 2.6.25.14-108.fc9.i686)

(First A8N-SLI Deluxe motherboard)

Aug 29 21:46:01 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0xc2b0
Aug 29 21:46:01 medusix kernel: sym0:2: ERROR (81:0) (8-0-0) (10/95/0) @ (scripta 2f0:d2340004).
Aug 29 21:46:01 medusix kernel: sym0: script cmd = f3340004
Aug 29 21:46:01 medusix kernel: sym0: regdump: da 00 00 95 47 10 02 0b 00 08 82 00 80 00 0f 0a 0d 82 08 cf 02 ff ff df.
Aug 29 21:46:01 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Aug 29 21:46:01 medusix kernel: sym0: SCSI BUS reset detected.
Aug 29 21:46:01 medusix kernel: sym0: SCSI BUS has been reset.
Aug 29 21:46:01 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Aug 29 21:46:01 medusix kernel: sym0:12: ERROR (81:0) (8-0-0) (7d/56/0) @ (mem c:d000e81a).
Aug 29 21:46:01 medusix kernel: sym0: regdump: ca 00 00 56 47 7d 0c 0b 00 08 02 00 80 00 08 0a 00 40 3d 17 20 ff ff df.
Aug 29 21:46:01 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Aug 29 21:46:01 medusix kernel: sym0: SCSI BUS reset detected.
Aug 29 21:46:01 medusix kernel: sym0: SCSI BUS has been reset.


(After replacing first with second A8N-SLI Deluxe motherboard)

Sep 5 12:01:26 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x7 status=0x82b0
Sep 5 12:01:26 medusix kernel: sym0:2: ERROR (81:0) (0-87-80) (10/9d/0) @ (scripta 410:d3030001).
Sep 5 12:01:26 medusix kernel: sym0: script cmd = f3050001
Sep 5 12:01:26 medusix kernel: sym0: regdump: da 10 80 9d 47 10 02 0b 82 00 82 87 80 00 07 0b 8b 76 59 8f 08 ff ff df.
Sep 5 12:01:26 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Sep 5 12:01:26 medusix kernel: sym0: SCSI BUS reset detected.
Sep 5 12:01:26 medusix kernel: sym0: SCSI BUS has been reset.
Sep 5 12:01:26 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Sep 5 12:01:26 medusix kernel: sym0:10: ERROR (81:0) (8-0-0) (5a/4e/0) @ (mem c:d000e81a).
Sep 5 12:01:26 medusix kernel: sym0: regdump: ca 00 00 4e 47 5a 0a 0b 00 08 02 00 80 00 00 0a 00 40 3e 17 20 ff ff df.
Sep 5 12:01:26 medusix kernel: skge 0000:05:0c.0: PCI error cmd=0x147 status=0x82b0
Sep 5 12:01:26 medusix kernel: sym0: SCSI BUS reset detected.
Sep 5 12:01:26 medusix kernel: sym0: SCSI BUS has been reset.

(at this point second A8N-SLI Deluxe motherboard is dead too.)

The third motherboard is working fine as of yet. The only strange entries in the logs are:

Sep 7 18:14:38 medusix kernel: ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0xE next cpb idx 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 0: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 1: ctl_flags 0x1f, resp_flags 0x2
Sep 7 18:14:38 medusix kernel: ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 4: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 5: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 6: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 7: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 8: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 9: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 10: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 11: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 12: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 13: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: CPB 14: ctl_flags 0x1f, resp_flags 0x0
Sep 7 18:14:38 medusix kernel: ata1: timeout waiting for ADMA IDLE, stat=0x400
Sep 7 18:14:38 medusix kernel: ata1: timeout waiting for ADMA LEGACY, stat=0x400
Sep 7 18:14:38 medusix kernel: ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x2180000 action 0x2 frozen
Sep 7 18:14:38 medusix kernel: ata1: SError: { 10B8B Dispar UnrecFIS }
Sep 7 18:14:38 medusix kernel: ata1.00: cmd 60/00:00:01:ea:9f/01:00:08:00:00/40 tag 0 ncq 131072 in
Sep 7 18:14:38 medusix kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 7 18:14:38 medusix kernel: ata1.00: status: { DRDY }

As far as I can tell this is unrelated to the crashes.

With some trepidation I have started an Amanda backup run again.
It is running right now ...

So was this a flake of two defective motherboards in a row?
Just bad luck? I do not know...
# 3  
Old 09-07-2008
Crashed again ...

To all,

Just now during an amanda backup run the machine crashed again for the third time.

It looks like the motherboard is dead again. It will not go thru post. Last message captured to an other machine is:

Third A8N-SLI Deluxe motherboard
Sep 7 19:57:07 medusix kernel: sym0:2: ERROR (81:0) (0-a7-80) (0/9d/0) @ (mem d6044410:f3030001).

I am getting pretty desperate. Tomorrow I will hunt for yet another mb. I am at a loss what's going on.

Please help.

Jos
# 4  
Old 09-07-2008
I know I would not be trying amanda again at this point. Try a different controller card perhaps?
# 5  
Old 09-08-2008
this would be the first time i see a mother dying because of a software problem.
i dont have a clue what might be there, but all my instinct tells me that soft cannot harm hard
# 6  
Old 09-08-2008
Hardware Compatibility Check

1) Definitely : Software does NOT degrade Hardware ;

2) Using software to violate hardware recommendations (like writing out-of-range values to video-card registers) is not an excuse ;

3) The most probable couse for your problem is the SCSI card - which , as you say , is made for COMPAQs ;

4) You MUST (or should) double-check the Hardware-Compatibility-List provided by the manufacturer of the SCSI card ;

5) As a general rule, have in mind that ANY hardware from major manufatcurers (eg. COMPAQ, IBM, HP, DELL) is crafted specifically for their equpiments ;

5.1) using their boards/parts on generic PC is strongly discouraged ;

5.2) using generic boards/parts on their PCs is strongly discouraged as well ;

6) my strongest advice at this time is to remove your scsi card and use your equpiment for a while (even without backups), just to prove that it runs alright without it ;

good luck, and success !
# 7  
Old 09-20-2008
Issue solved

Closure

I replaced the motherboard with a A8N32-sli model. This board offers two full 16 pci-express lanes.

I replaced the scsi controller with an Adaptec AHA-2940U2 (ultra2) model

I do not know - or care - which of these two interventions solved my problem, but solved it is.

The Amanda backup process works like a charm again.

Finally.

Thanks to all who reponded.

Jos
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Another sed Syntax Puzzle . . .

Greetings! Have a quick question for the community today; this time looking at a nifty little sed puzzle ;) Consider the following file content to be worked through:What needs to happen is theblock should be removed up to and including the following blank line, leavingI have bits and pieces... (8 Replies)
Discussion started by: LinQ
8 Replies

2. Solaris

Swap puzzle

I'm getting confused by swap # swap -l swapfile dev swaplo blocks free /dev/zvol/dsk/rpool/swap 256,2 16 16777200 16777200 /dev/zvol/dsk/swappool/swap2 256,1 16 50331632 50331632 # swap -s total: 6710256k bytes allocated + 3402944k reserved = 10113200k used,... (6 Replies)
Discussion started by: redstone
6 Replies

3. Shell Programming and Scripting

A puzzle with a printing function executing in background

Somebody on a thread in the (french) Mandriva Forum recently suggested a script, designed to provide a tool to display kind of "temporisation widgets" on the console (to be ultimately pasted in other more complex scripts). One version of this script was something like the following, which seems... (6 Replies)
Discussion started by: klease
6 Replies

4. UNIX for Advanced & Expert Users

Chroot jail environment puzzle

I have a simple sandbox program which runs a command as user "nobody" in a chroot jail. It sets resource limits with setrlimit, changes the user id with setuid, changes the root dir with chroot, and then calls exec to execute the command given as command line parameters. It is of course a... (8 Replies)
Discussion started by: john.english
8 Replies

5. Programming

The puzzle for malloc some spaces for a key

Hi, all, I am writing a BST (Binary Search Tree). What I am concerned about is typedef struct BST{ struct BST *p_left; struct BST *p_right; void *p_data; char *p_key; unsigned int *length; }BST; I have to malloc some space for p_key. How many of chars... (4 Replies)
Discussion started by: mythmgn
4 Replies

6. IP Networking

Puzzle about sctp_bindx in UNP

It writes in Section 9.3 in Unix Network programming about SCTP: "The sctp_bindx call can be used on a bound or unbound socket." And then it writes: "The port number in all the socket address structures must be the same and must match any port number that is already bound; if it doesn't, then... (0 Replies)
Discussion started by: tomdean001
0 Replies

7. Shell Programming and Scripting

Alias escape puzzle

Here is "escape puzzle" from real life task: Conditions: We need to create an alias which will Save current directory path Will ssh to particular server Then will cd to saved path (it's mounted via NFS) Then will find all files with name patter as "All*.bld" and run particular editor... (0 Replies)
Discussion started by: BaruchLi
0 Replies
Login or Register to Ask a Question