Sun Fire v440 Hard disk or controller broken? WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1)

02-13-2020
Sun Fire v440 Hard disk or controller broken? WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1)


I have a Sun Fire V440 server that fails to boot up correctly. A lot of services are not started and the sytems acts really slow to commands. During boot I can see the following Error:

WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1):
        SCSI transport failed: reason 'reset': retrying command
WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 689376                    Error Block: 689390
        Vendor: LSILOGIC                           Serial Number: LSI INTERNAL
        Sense Key: Media Error
        ASC: 0x11 (read retries exhausted), ASCQ: 0x1, FRU: 0x0

The first two disks sd0 and sd1 are configured as raid 1 it seems. So I would assume that one of those disks is bad. But raidctl shows no errors:

RAID    Volume  RAID            RAID            Disk
Volume  Type    Status          Disk            Status
c1t0d0  IM      RESYNCING       c1t0d0          OK
                                 c1t1d0          OK

But iostat -en shows soft and hard errors for the raid:

bash-3.00# iostat -en
  ---- errors ---
  s/w h/w trn tot
    3   6   0   9 c1t0d0
    0   0   0   0 c1t2d0
    0   0   0   0 c1t3d0
    1   0   0   1 c3t600144F0A549542200005CC83C9C0003d0
    1   0   0   1 ssd3

Is it possible that the Raid controller is broken?

bash-3.00# prtdiag -v
System Configuration: Sun Microsystems  sun4u Sun Fire V440
System clock frequency: 183 MHZ
Memory size: 16GB

==================================== CPUs ====================================
               E$          CPU                    CPU
CPU  Freq      Size        Implementation         Mask    Status      Location
---  --------  ----------  ---------------------  -----   ------      --------
0    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
1    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
2    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
3    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -

================================= IO Devices =================================
Bus     Freq  Slot +      Name +
Type    MHz   Status      Path                          Model
------  ----  ----------  ----------------------------  --------------------
pci     66    MB          pci108e,abba (network)        SUNW,pci-ce
              okay        /pci@1c,600000/network@2

pci     33    MB          isa/su (serial)
              okay        /pci@1e,600000/isa@7/serial@0,3f8

pci     33    MB          isa/su (serial)
              okay        /pci@1e,600000/isa@7/serial

pci     33    MB          isa/rmc-comm-rmc_comm (seria+
              okay        /pci@1e,600000/isa@7/rmc-comm@0,3e8

pci     33    MB          pci10b9,5229 (ide)
              okay        /pci@1e,600000/ide

pci     66    MB          pci108e,abba (network)        SUNW,pci-ce
              okay        /pci@1f,700000/network@1

pci     66    MB          scsi-pci1000,30 (scsi-2)      LSI,1030
              okay        /pci@1f,700000/scsi@2

pci     66    MB          scsi-pci1000,30 (scsi-2)      LSI,1030
              okay        /pci@1f,700000/scsi

============================ Memory Configuration ============================
Segment Table:
Base Address       Size       Interleave Factor  Contains
0x0                4GB               16          BankIDs 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0x1000000000       4GB               16          BankIDs 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0x2000000000       4GB               16          BankIDs 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
0x3000000000       4GB               16          BankIDs 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63

Bank Table:
           Physical Location
ID       ControllerID  GroupID   Size       Interleave Way
0        0             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1        0             0         256MB
2        0             1         256MB
3        0             1         256MB
4        0             0         256MB
5        0             0         256MB
6        0             1         256MB
7        0             1         256MB
8        0             1         256MB
9        0             1         256MB
10       0             0         256MB
11       0             0         256MB
12       0             1         256MB
13       0             1         256MB
14       0             0         256MB
15       0             0         256MB
16       1             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
17       1             0         256MB
18       1             1         256MB
19       1             1         256MB
20       1             0         256MB
21       1             0         256MB
22       1             1         256MB
23       1             1         256MB
24       1             1         256MB
25       1             1         256MB
26       1             0         256MB
27       1             0         256MB
28       1             1         256MB
29       1             1         256MB
30       1             0         256MB
31       1             0         256MB
32       2             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
33       2             0         256MB
34       2             1         256MB
35       2             1         256MB
36       2             0         256MB
37       2             0         256MB
38       2             1         256MB
39       2             1         256MB
40       2             1         256MB
41       2             1         256MB
42       2             0         256MB
43       2             0         256MB
44       2             1         256MB
45       2             1         256MB
46       2             0         256MB
47       2             0         256MB
48       3             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
49       3             0         256MB
50       3             1         256MB
51       3             1         256MB
52       3             0         256MB
53       3             0         256MB
54       3             1         256MB
55       3             1         256MB
56       3             1         256MB
57       3             1         256MB
58       3             0         256MB
59       3             0         256MB
60       3             1         256MB
61       3             1         256MB
62       3             0         256MB
63       3             0         256MB

Memory Module Groups:
ControllerID   GroupID  Labels         Status
0              0        C0/P0/B0/D0
0              0        C0/P0/B0/D1
0              1        C0/P0/B1/D0
0              1        C0/P0/B1/D1
1              0        C1/P0/B0/D0
1              0        C1/P0/B0/D1
1              1        C1/P0/B1/D0
1              1        C1/P0/B1/D1
2              0        C2/P0/B0/D0
2              0        C2/P0/B0/D1
2              1        C2/P0/B1/D0
2              1        C2/P0/B1/D1
3              0        C3/P0/B0/D0
3              0        C3/P0/B0/D1
3              1        C3/P0/B1/D0
3              1        C3/P0/B1/D1

============================ Environmental Status ============================
Fan Status:
Location             Sensor          Status
FT0/F0               TACH            okay
FT1/F0               TACH            okay
FT1/F1               TACH            okay
PS0                  FF_PDCT_FAN     okay

Temperature sensors:
Location       Sensor              Status
C0/P0          T_CORE              okay
C1/P0          T_CORE              okay
C2/P0          T_CORE              okay
C3/P0          T_CORE              okay
C0             T_AMB               okay
C1             T_AMB               okay
C2             T_AMB               okay
C3             T_AMB               okay
SCSIBP         T_AMB               okay
MB             T_AMB               okay
Current sensors:
Location             Sensor       Status
MB                   FF_SCSIA     okay
MB                   FF_SCSIB     okay
MB                   FF_POK       okay
C0/P0                FF_POK       okay
C1/P0                FF_POK       okay
C2/P0                FF_POK       okay
C3/P0                FF_POK       okay
Voltage sensors:
Location       Sensor        Status
MB             V_+1V5        okay
MB             V_VCCTM       okay
MB             V_NET0_1V2D   okay
MB             V_NET1_1V2D   okay
MB             V_NET0_1V2A   okay
MB             V_NET1_1V2A   okay
MB             V_+3V3        okay
MB             V_+3V3STBY    okay
MB/BAT         V_BAT         warning (0.00V)
MB             V_SCSI_CORE   okay
MB             V_+5V         okay
MB             V_+12V        okay
MB             V_-12V        okay
PS0            P_PWR         okay
PS0            FF_POK        okay
Location       Keyswitch   State
SYS            SYSCTRL     NORMAL
Led State:
Location               Led                   State       Color
SYS                    ACT                   on          green
SYS                    SERVICE               on          amber
SYS                    LOCATE                off         white
PS0                    POK                   on          green
PS0                    STBY                  on          green
PS0                    SERVICE               off         amber
PS0                    OK2RM                 off         blue
HDD0                   SERVICE               off         amber
HDD0                   OK2RM                 off         blue
HDD1                   SERVICE               off         amber
HDD1                   OK2RM                 off         blue
HDD2                   SERVICE               off         amber
HDD2                   OK2RM                 off         blue
HDD3                   SERVICE               off         amber
HDD3                   OK2RM                 off         blue

=========================== FRU Operational Status ===========================
Fru Operational Status:
Location                Status
SC                      okay
HDD0                    present
HDD1                    present
HDD2                    present
HDD3                    present
PS0                     okay

================================ HW Revisions ================================
ASIC Revisions:
Path                   Device           Status             Revision
/pci@1c,600000         pci108e,a801     okay               4
/pci@1d,700000         pci108e,a801     okay               4
/pci@1e,600000         pci108e,a801     okay               4
/pci@1f,700000         pci108e,a801     okay               4

System PROM revisions:
OBP 4.16.4 2004/12/18 05:20 Sun Fire V440,Netra 440
OBDIAG 4.16.4 2004/12/18 05:21

I'm really thankful for any hints, as I have no clue how to proceed with this.

Best Regards,
02-13-2020
The Raid controller is not showing no problems, as you put it.

RESYNCING means that the controller is remirroring the Raid1 disks because of a problem. Depending on the capacity of the Raid1 disks (they will typically be exactly the same size) this resyncing shouldn't take very long, however, whilst this is in progress, system response time will be impacted. Once complete, the status should become OPTIMAL.

If the resyncing is falling over for some reason then the process might be restarting over and over and OPTIMAL is never achieved. What for that. If that is the case I would be inclined to first if possible take the system down and re-seat all SCSI/SATA cables both ends (disk and mobo) and all disk power supply plugs. Reboot and see if the problem persists. If it does, then most likely one of the disks is faulty. It's possible but unlikely that the raid controller is faulty. All the moving parts are the disks.

You could remove the faulty raid1 drive (the one continuously resyncing) and put it on another machine running diagnostics. Perhaps completely reformat and try again. Otherwise, it's a new disk required.
02-13-2020

the status is shown as optimal. I would guess that if a disk is failed or failing raidctl would show that? How can I identify which of the two disks are bad if raidctl claims everything is ok. I have powered down the server many times. I have not replugged all the cables yet. I will give it a try.

Last edited by hicksd8; 02-13-2020 at 07:32 AM..
02-13-2020
Watch for a repeat of resyncing. If it keeps happening something is wrong (probably with one of the disks). You will also see high disk activity on the disk LEDs which might be easier to spot than keep doing a raidctl.
02-13-2020
Do I need to issue a command in order to remove one of the disks? Is the raid hotplug capable?
02-13-2020
Yes, the onboard raid is hotplug capable but, of course, you need to be sure that you're pulling the right disk.

With the system down you can pull out and re-seat both of them to try to ensure good connection with the hotplug sockets.
02-13-2020
Also, from your original post, it shows that c1t0d0 is the disk being rebuilt (RESYNCING) and c1t1d0 is running OK.

