Go Back   The UNIX and Linux Forums > Operating Systems > Solaris
google site



Solaris The Solaris Operating System, usually known simply as Solaris, is a Unix-based operating system introduced by Sun Microsystems. The Solaris OS is now owned by Oracle.

Closed Thread
English Japanese Spanish French German Portuguese Italian Powered by Powered by Google
 
Search this Thread
  #1  
Old 11-30-2009
Registered User
 

Join Date: Feb 2009
Posts: 111
Thanks: 0
Thanked 0 Times in 0 Posts
Sun Fire V490 occasionally downs

I can't find why but I suppose it is because of some hardware failures... I have a Sun Fire V490 with Solaris 10 5/08 which runs Sun Cluster 3.2 and it downs occasionally with a circumstances I can't define... Also I note some strange behavior after server failures: when I ran any diagnostics from OBP, it shows a lot of error:


Code:
{12} ok test-all
Testing /pci@9,600000/SUNW,qlc@2

   ERROR   : RISC RAM failed to load from host buffer.
   DEVICE  : /pci@9,600000/SUNW,qlc@2
   SUBTEST : selftest:mats-test
   CALLERS : (f010633c)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:25  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,600000/SUNW,qlc@2 selftest failed, return code = 1
Testing /pci@9,600000/network@1

   ERROR   : TX DMA block never received packet.
   DEVICE  : /pci@9,600000/network@1
   SUBTEST : selftest:mltpkt-gmii-int-lpb-test
   CALLERS : (f010a928)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:37  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,600000/network@1 selftest failed, return code = 1
Testing /pci@9,700000/network@2

   ERROR   : TX DMA block never received packet.
   DEVICE  : /pci@9,700000/network@2
   SUBTEST : selftest:mltpkt-gmii-int-lpb-test
   CALLERS : (f010e9e0)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:49  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,700000/network@2 selftest failed, return code = 1
Testing /pci@9,700000/usb@1,3
Testing /pci@9,700000/ebus@1

   ERROR   : DMA control status register 1
   SUMMARY : Obs=0xef Exp=0x00 XOR=0xef Addr=0x0
   DEVICE  : /pci@9,700000/ebus@1
   SUBTEST : selftest:dma-func-test
   CALLERS : (f010162c)
   MACHINE : Sun Fire V490
   SERIAL# : 71218358
   DATE    : 11/25/2009 07:44:50  GMT
   CONTR0LS: diag-level=max test-args=

/pci@9,700000/ebus@1 selftest failed, return code = 1

But after power-off and some time in power-off state, diagnostics doesn't show any issues... Also system log and fmdump -vu show some interesting outputs:

Code:
Nov 30 12:00:45 server1 EVENT-TIME: Fri Mar 27 13:56:44 MSK 2009
Nov 30 12:00:45 server1 PLATFORM: SUNW,Sun-Fire-V490, CSN: -, HOSTNAME: server1
Nov 30 12:00:45 server1 SOURCE: eft, REV: 1.16
Nov 30 12:00:45 server1 EVENT-ID: 56b30d73-1f74-6d5c-812b-bca359fdc999
Nov 30 12:00:45 server1 DESC: The transmitting device sent an invalid request.
Nov 30 12:00:45 server1   Refer to Sun Message ID: PCIEX-8000-5Y for more information.
Nov 30 12:00:45 server1 AUTO-RESPONSE: One or more device instances may be disabled
Nov 30 12:00:45 server1 IMPACT: Loss of services provided by the device instances associated with this fault
Nov 30 12:00:45 server1 REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to repl
ace the affected device(s).  Use fmdump -v -u <EVENT_ID> to identify the devices or contact Sun for support.

[ root@server1 Mon Nov 30 12:09:17 2009 ]
/ # fmdump -vu 56b30d73-1f74-6d5c-812b-bca359fdc999             
TIME                                           UUID                                 SUNW-MSG-ID
Nov 30 12:00:45.5664 56b30d73-1f74-6d5c-812b-bca359fdc999 PCIEX-8000-5Y
   50%  fault.io.pci.device-invreq

        Problem in: hc://:product-id=SUNW,Sun-Fire-V490:server-id=server1/motherboard=0/hostbridge=1/pcibus=0/pcidev=1/pcifn=0
           Affects: dev:////pci@9,600000/network@1
               FRU: hc:///component=MB
          Location: MB

   50%  fault.io.pci.device-invreq

        Problem in: hc://:product-id=SUNW,Sun-Fire-V490:server-id=server1/motherboard=0/hostbridge=1/pcibus=0/pcidev=2/pcifn=0
           Affects: dev:////pci@9,600000/SUNW,qlc@2
               FRU: hc:///component=MB
          Location: MB

I'm stucked! I don't know how to determine the reason of such strange behavior, I consider to check it with SunVTS, but It can not be used in Sun Cluster environment, according to documentation... One of way to find out what is going on is to uninstall Sun Cluster software and run stress testing for a couple of days... Maybe this will tell the matter of failures...
Sponsored Links
  #2  
Old 12-01-2009
Registered User
 

Join Date: May 2008
Location: SINGAPORE.. The "FINE" City
Posts: 2,671
Thanks: 0
Thanked 8 Times in 8 Posts
few things to check. OBP, kernel patch. If there's no cluster issue afterall, it could be a motherboard problem. also analyse the messages file.
  #3  
Old 12-01-2009
Registered User
 

Join Date: Feb 2009
Posts: 111
Thanks: 0
Thanked 0 Times in 0 Posts
You mean check OBP? But how?.. And what kernel patch should I install?.. SunSolve shows about 30 patches by searching by keywords 'pci network' and 'pci qlc', but I can't find any that meet my problem. Another node in this cluster has no such problems at all... And unfortunately, system messages unexpectedly interrupts on server failures... Can it be a SB error?..

---------- Post updated at 03:20 ---------- Previous update was at 03:18 ----------

Also, I have my metadevices degraded after such failures, but after issuing metasync things get better... I've putted faulted server on component stress testing with SunVTS, I hope it will find out
  #4  
Old 12-01-2009
Registered User
 

Join Date: May 2008
Location: SINGAPORE.. The "FINE" City
Posts: 2,671
Thanks: 0
Thanked 8 Times in 8 Posts
OBP -> prtconf -V
kernel patch -> is it the latest patch cluster? showrev

From the OBP test-all, it shows something related to MB. Unlikely that both the HBA card or network card will go faulty at the same time.
  #5  
Old 12-01-2009
Registered User
 

Join Date: Feb 2009
Posts: 111
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by incredible View Post
OBP -> prtconf -V
kernel patch -> is it the latest patch cluster? showrev

From the OBP test-all, it shows something related to MB. Unlikely that both the HBA card or network card will go faulty at the same time.
I'm a little bit confused. I have a relatively recent version of OS, two absolutely similar servers (in hardware configuration), I have not installed any patch clusters on both servers, but one of them fails sometime, but another - don't. Also as Sun Fire V490 was released earlier than my OS version I suppose there should be all required patches and software updates... Maybe I'm wrong?..
  #6  
Old 12-01-2009
Registered User
 

Join Date: May 2008
Location: SINGAPORE.. The "FINE" City
Posts: 2,671
Thanks: 0
Thanked 8 Times in 8 Posts
Recommended patch cluster is very important. Pls provide the outputs as requested. At least we can advise if its too low.
  #7  
Old 12-01-2009
Registered User
 

Join Date: Feb 2009
Posts: 111
Thanks: 0
Thanked 0 Times in 0 Posts
incredible
Ok, I will post the output soon...
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Disable Serials ports in Sun Fire v490 enkei17 UNIX for Dummies Questions & Answers 1 08-31-2009 02:12 AM
Sun Server T2000 occasionally reboot webster5u Solaris 8 06-09-2009 03:05 AM
Problem While Configuring IPMP on Sun Fire V490 Linux Bot Solaris BigAdmin RSS 0 05-20-2009 11:00 PM
V490 - Centerplane failed nam.nguyen Solaris 6 11-07-2008 08:35 PM
Sol 10 on SUN V490: Setting LOCALE dewets Solaris 2 10-19-2007 06:03 AM



All times are GMT -4. The time now is 11:53 PM.