Sun Fire V490 occasionally downs


 
Thread Tools Search this Thread
Operating Systems Solaris Sun Fire V490 occasionally downs
# 1  
Old 11-30-2009
Sun Fire V490 occasionally downs

I can't find why but I suppose it is because of some hardware failures... I have a Sun Fire V490 with Solaris 10 5/08 which runs Sun Cluster 3.2 and it downs occasionally with a circumstances I can't define... Also I note some strange behavior after server failures: when I ran any diagnostics from OBP, it shows a lot of error:

Code:
{12} ok test-all
Testing /pci@9,600000/SUNW,qlc@2

   ERROR   : RISC RAM failed to load from host buffer.
   DEVICE  : /pci@9,600000/SUNW,qlc@2
   SUBTEST : selftest:mats-test
   CALLERS : (f010633c)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:25  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,600000/SUNW,qlc@2 selftest failed, return code = 1
Testing /pci@9,600000/network@1

   ERROR   : TX DMA block never received packet.
   DEVICE  : /pci@9,600000/network@1
   SUBTEST : selftest:mltpkt-gmii-int-lpb-test
   CALLERS : (f010a928)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:37  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,600000/network@1 selftest failed, return code = 1
Testing /pci@9,700000/network@2

   ERROR   : TX DMA block never received packet.
   DEVICE  : /pci@9,700000/network@2
   SUBTEST : selftest:mltpkt-gmii-int-lpb-test
   CALLERS : (f010e9e0)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:49  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,700000/network@2 selftest failed, return code = 1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1

   ERROR   : DMA control status register 1
   SUMMARY : Obs=0xef Exp=0x00 XOR=0xef Addr=0x0
   DEVICE  : /pci@9,700000/ebus@1
   SUBTEST : selftest:dma-func-test
   CALLERS : (f010162c)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:50  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,700000/ebus@1 selftest failed, return code = 1

But after power-off and some time in power-off state, diagnostics doesn't show any issues... Also system log and fmdump -vu show some interesting outputs:
Code:
Nov 30 12:00:45 server1 EVENT-TIME: Fri Mar 27 13:56:44 MSK 2009
Nov 30 12:00:45 server1 PLATFORM: SUNW,Sun-Fire-V490, CSN: -, HOSTNAME: server1
Nov 30 12:00:45 server1 SOURCE: eft, REV: 1.16
Nov 30 12:00:45 server1 EVENT-ID: 56b30d73-1f74-6d5c-812b-bca359fdc999
Nov 30 12:00:45 server1 DESC: The transmitting device sent an invalid request.
Nov 30 12:00:45 server1   Refer to Sun Message ID: PCIEX-8000-5Y for more information.
Nov 30 12:00:45 server1 AUTO-RESPONSE: One or more device instances may be disabled
Nov 30 12:00:45 server1 IMPACT: Loss of services provided by the device instances associated with this fault
Nov 30 12:00:45 server1 REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to repl
ace the affected device(s).  Use fmdump -v -u <EVENT_ID> to identify the devices or contact Sun for support.

[ root@server1 Mon Nov 30 12:09:17 2009 ]
/ # fmdump -vu 56b30d73-1f74-6d5c-812b-bca359fdc999             
TIME                                           UUID                                 SUNW-MSG-ID
Nov 30 12:00:45.5664 56b30d73-1f74-6d5c-812b-bca359fdc999 PCIEX-8000-5Y
   50%  fault.io.pci.device-invreq

        Problem in: hc://:product-id=SUNW,Sun-Fire-V490:server-id=server1/motherboard=0/hostbridge=1/pcibus=0/pcidev=1/pcifn=0
           Affects: dev:////pci@9,600000/network@1
               FRU: hc:///component=MB
          Location: MB

   50%  fault.io.pci.device-invreq

        Problem in: hc://:product-id=SUNW,Sun-Fire-V490:server-id=server1/motherboard=0/hostbridge=1/pcibus=0/pcidev=2/pcifn=0
           Affects: dev:////pci@9,600000/SUNW,qlc@2
               FRU: hc:///component=MB
          Location: MB

I'm stucked! I don't know how to determine the reason of such strange behavior, I consider to check it with SunVTS, but It can not be used in Sun Cluster environment, according to documentation... One of way to find out what is going on is to uninstall Sun Cluster software and run stress testing for a couple of days... Maybe this will tell the matter of failures...
# 2  
Old 12-01-2009
few things to check. OBP, kernel patch. If there's no cluster issue afterall, it could be a motherboard problem. also analyse the messages file.
# 3  
Old 12-01-2009
You mean check OBP? But how?.. And what kernel patch should I install?.. SunSolve shows about 30 patches by searching by keywords 'pci network' and 'pci qlc', but I can't find any that meet my problem. Another node in this cluster has no such problems at all... And unfortunately, system messages unexpectedly interrupts on server failures... Can it be a SB error?..

---------- Post updated at 03:20 ---------- Previous update was at 03:18 ----------

Also, I have my metadevices degraded after such failures, but after issuing metasync things get better... I've putted faulted server on component stress testing with SunVTS, I hope it will find out
# 4  
Old 12-01-2009
OBP -> prtconf -V
kernel patch -> is it the latest patch cluster? showrev

From the OBP test-all, it shows something related to MB. Unlikely that both the HBA card or network card will go faulty at the same time.
# 5  
Old 12-01-2009
Quote:
Originally Posted by incredible
OBP -> prtconf -V
kernel patch -> is it the latest patch cluster? showrev

From the OBP test-all, it shows something related to MB. Unlikely that both the HBA card or network card will go faulty at the same time.
I'm a little bit confused. I have a relatively recent version of OS, two absolutely similar servers (in hardware configuration), I have not installed any patch clusters on both servers, but one of them fails sometime, but another - don't. Also as Sun Fire V490 was released earlier than my OS version I suppose there should be all required patches and software updates... Maybe I'm wrong?..
# 6  
Old 12-01-2009
Recommended patch cluster is very important. Pls provide the outputs as requested. At least we can advise if its too low.
# 7  
Old 12-02-2009
incredible
Ok, I will post the output soon...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Solaris

Solaris 10 1/13 & Sun Fire V490 Server

Hi Solaris/Unix Experts, I've installed Oracle Solaris 10 1/13 on a Sun Fire V490 Server via its external Serial Port using a Cisco Console Cable connected to my Laptop PC running SecureCRT terminal emulation software emulating VT100 Terminal. The Sun Fire V490 server has two on-board... (1 Reply)
Discussion started by: ssabet
1 Replies

2. Solaris

Help with Solaris 8 10/00 Installation on Sun Fire V490 Server

cat note Hi Sun Solaris Experts, When I try to install Solaris 8 10/00 OS via the Solaris 8 Installation Disk on our Sun Fire V490 Server using the command , I get: boot: cannot open kernel/sparcv9/unix which means it can't find the unix kernel file on the installation disk . By the way, I... (5 Replies)
Discussion started by: ssabet
5 Replies

3. Solaris

Reset password RSC SUN V490

Dear all Can you help me to reset RSC from Sun V490 Inside this machine the Operating System is broken.. please help guys thanks (1 Reply)
Discussion started by: gema.utama
1 Replies

4. Solaris

Sun-Fire-V490 Printer Issue After Upgrade of Solaris

Hey Guys I am new here, dont know if any one can assist me with this issue. I have a Sun-Fire-V490 machine that was upgraded to version 9 and patched a few months back. Problem is a few network printers managed by the server is printing an extra page that comes out before and after every print... (0 Replies)
Discussion started by: mprogams
0 Replies

5. UNIX for Dummies Questions & Answers

Disable Serials ports in Sun Fire v490

Hello people from Argentina i said you Hello. I am doing some works from an auditory. They want to disable login prompts on serial ports. Somebody knows how to do that? Thank your time. (1 Reply)
Discussion started by: enkei17
1 Replies

6. Solaris

Sun Server T2000 occasionally reboot

Hi, i am really 'fresh' to Solaris or any UNIX OS. My role as web developer but i need slightly involve to Solaris support. It is harder for me to understand it and i recently encounter a problems. /var file system (/dev/md/rdsk/d425) is being checked. run fsck -F ufs /dev/md/rdsk/d425 ... (8 Replies)
Discussion started by: webster5u
8 Replies

7. Solaris

Sun Fire 280R Sun Solaris CRT/Monitor requirements

I am new to Sun. I brought Sun Fire 280R to practice UNIX. What are the requirements for the monitor/CRT? Will it burn out old non-Sun CRTs? Does it need LCD monitor? Thanks. (3 Replies)
Discussion started by: bramptonmt
3 Replies

8. Solaris

Sol 10 on SUN V490: Setting LOCALE

Hi all, We've upgraded/migrated our production server from Sol 9 on a Sun V480 to Sol 10 on a V490 server. How do I set/change the following values on Solaris 10: Solaris 9: $> locale LANG= LC_CTYPE=en_US.ISO8859-1 LC_NUMERIC=en_US.ISO8859-1 LC_TIME=en_US.ISO8859-1... (2 Replies)
Discussion started by: dewets
2 Replies

9. UNIX for Dummies Questions & Answers

Sun Fire 280R

Hello all, I'm lost and can't figure this problem out. I have a Sun fire 280R running Solaris 8. Everything was working great. I have one drive in bay 1(not 0). But when I reboot the system it trys to open files in /dev/rdsk/c1t1d0s0. Should it have been opeing /dev/rdsk/c1t0d0s0, the... (4 Replies)
Discussion started by: larryase
4 Replies
Login or Register to Ask a Question