Visit Our UNIX and Linux User Community


Recover failed system disk


 
Thread Tools Search this Thread
Operating Systems Solaris Recover failed system disk
# 1  
Old 02-10-2016
Recover failed system disk

I have an oldish Solaris 10 system (SunFire x4240), which due to a recent heating event in the server room, lost it's system disk.

I have rsync backups of all the other (data) disks, but apparently I do not have a backup of /. :-( I can start the machine up in failsafe mode, but running fsck on the system disk always reports a couple of bad sectors, which I don't seem to be able to repair or ignore (tried format->analyze->read, etc.).

It looks like I can mount the disk read only, so I'm hoping I can copy most of the pertinent info off of it, install Solaris 10 on a fresh replacement disk, and then copy that pertinent system info back onto the new system so that I don't have to recreate network info, users, disk mounts, NIS info, and various other things from scratch.

Once I have the fresh OS installed on a new disk, is it safe to mount the failed disk read-only and use rsync to copy the accessible files on that disk to a safe location - or is there a better way to do this?

Thanks.

-J

---------- Post updated at 02:38 PM ---------- Previous update was at 10:07 AM ----------

I installed Solaris 10 on the new disk, but now I'm wondering how best to get the files I want off of the old disk. The system has 16 SAS bays, they are as follows:

0: new system disk
1-3: single-volume disks
4-15: RAID 5 array

All bays are filled.

The raid controller knows about all the disks, but the new system does not (yet), though I can easily mount the single-volume disks. The question is: can I turn off the computer, swap out one of the single-volume disks for the bad disk, and then power up and mount the bad disk and copy files off of it - without irrevocably screwing up the raid-controller's knowledge of the disk I pulled out to make space for the bad disk?

Thanks.

-J
# 2  
Old 02-11-2016
It seems to me that the main thing you need to do is save whatever files you can from the root filesystem that you have no backup of.

You should be able to mount it read-only onto your new O/S (or you could boot into single-user from CD/DVD and mount it under that).

If you have other storage sufficient available you could then attempt to 'dd' the whole raw partition off to a file and/or attempt to find/cpio the whole filesystem off to an archive.

The comments you have already made lead me to assume that you have quite a bit of experience of Solaris so I haven't gone into much detail.

What RAID controller is it? Make/model?

When you say that fsck doesn't straighten out the filesystem what command are you using? If you're using a "-n" flag then it will test the filesystem without correcting anything. I definitely would not use a "-y" flag because the filesystem could be damaged beyond repair before you can stop the operation. I would use neither flag and just see what questions it asks. I would also perhaps use the (often undocumented) flag "-o full" to examine the whole filesystem although this could take a very long time. Again, don't use -y or -n so you can abort if needed.

---------- Post updated at 11:28 AM ---------- Previous update was at 11:26 AM ----------

I guess the main point I'm making is to mount the filesystem read-only and then back it up. If you subsequently lose that filesystem completely, you can get some data back from archive.
# 3  
Old 02-11-2016
"quite a bit of experience with Solaris" is probably a big overstatement. I have done system administration only out of necessity for the past 20 years on various *nix systems. So I have experience over many years, but it is infrequent experience. The main sysadmin around here has not dealt with Solaris for years. So while I remember a few things, google is my friend.

Running StorMan, I see the claim that the raid controller is "Sun STK RAID INT", but that's about all the info I seem to be able find without rebooting - which I guess I will be doing soon. My main worry right now is that if I swap out a disk (a single-volume disk, not one from the RAID) that the controller may lose knowledge of it. However, I have THAT data backed up, so it's not a big deal if I lose it. And I guess if I can actually mount and read the old system disk in the freed up slot, I should be able to repeat that process when I put back whatever disk I pulled out to make room.

I've already run fsck -y a number of times on the bad disk, so whatever damage that may have done is already done.

I know what to do in general, but I am trying to avoid a misstep that will damage the disk further and prevent me from getting as much info as I can off of it.

Thanks.

-J
# 4  
Old 02-11-2016
"Sun STK Raid int" tells me it's a StorageTek RAID controller often found in Sun boxes.

If you Google search for it there's plenty of info.

ALSO, search this forum for it.......

How to find missing disks on Sun x4150 without reboot?

There's experience of this RAID controller on this forum better than mine like DukeNuke2.

If you don't know the existing RAID configuration I'd be inclined to avoid removal of any working disks as they may be part of an array; a RAID5 array for example or, even worse, a RAID0 where loss of one drive takes the array off-line.

You either need to backup the drive where it is now or remove the drive and connect to another machine just to take an image (sector by sector) backup. That way if the data you lose afterwards turns out to be vital you can write that image out to a new drive.

So you are saying that despite running fsck -y a number of times the filesystem still isn't fixed? It still shows errors?
# 5  
Old 02-11-2016
So far so good.

I knew the configuration of all of the disks from the raid controller setup screen (ctrl-a when booting). I had already removed the failed disk in order to put in a fresh one on which to put a new system (no free slots).

The raid controller must be smart enough because everything worked smoothly. I shut down the computer and pulled a known single-volume disk with not too much data on it (for which I had a complete backup as well) and inserted the failed disk into that slot. On boot, the controller detected the change and made a new configuration and came up just fine. I was able to mount the bad disk and copy everything off of it except for the contents of /usr/lib. This should effectively get me everything (config files, etc.) that I need to rebuild the system the way it was before. I copied the files off of the bad disk by rsyncing what I thought were the most important directories first (in case something bad should happen). After turning the computer off, re-inserting the disk that I had swapped out, and turning it back on again - the raid controller once again detected the change, did a reconfiguration and now everything looks as it did earlier today. I can see and mount all of the disks and access their data - except that now I also have a copy of everything that was on the old system disk (except for /usr/lib) from which I can (hopefully) get the system back into its pre-crash state.

Thanks.

-J

---------- Post updated at 01:57 PM ---------- Previous update was at 12:57 PM ----------

Apparently, I foolishly chose the default disk partitioning when installing the new system. Now / (slice 0) has very little space on it (6.4 GB), while the rest of the space on the disk (124 GB) is mounted as /export/home (slice 7).

So far, I've only made a few minor changes to /, and none to /export/home.
Is there any way to repartition the disk so that the whole thing is allocated to / in slice 0, or will I have to reinstall the system (again)?

User home directories are all on a separate disk anyway.

Thanks.

-J

---------- Post updated at 03:29 PM ---------- Previous update was at 01:57 PM ----------

FYI to anyone who might read this thread in the future:

I was actually able to increase the size of partition 0 to the full disk (except for the swap and boot sectors) by basically following the instructions at https://blogs.oracle.com/michel/entr...aris_partition (modified for my purposes) and running growfs. And it didn't even bork my system! I was prepared to have to reinstall Solaris.

However, I didn't think to first remove the line from my vsftab which attempts to mount the partition that I removed. This caused problems on boot and ended up requiring another reboot.

The system is probably not properly tuned. Of course, now that I have done it, it occurs to me that maybe swap should be much larger. This is what happens when you only do system management occasionally.

-J

Previous Thread | Next Thread
Test Your Knowledge in Computers #199
Difficulty: Easy
The TIOBE Index for October 2019 proclaimed that Python is becoming the new big language in favor of languages such as PHP and Perl.
True or False?

10 More Discussions You Might Find Interesting

1. Solaris

Need to recover/move diskgroup from failed host to another host

Hi All I am having VxVm on two Solaris hosts. host1 is using disk group dgHR. right now this server went down due to hardware fault. Not I need to import this dgHR into host2 server. Please let me know the procedure for the same. (1 Reply)
Discussion started by: amity
1 Replies

2. Solaris

Failed to recover root password in Solaris 10 on Sparc CPU Sun Ultra10

Failed to recover lost root password for Solaris SunSparc (On Sun Ultra10 - SPARC CPU Hardware, not x86 Intel CPU nor x64 AMD CPU) This Sun Ultra10 workstation comes with an old 6-in wafer probing station purchased from a Surplus equipment vendor. Computer: Sun Ultra 5/10 UPA/PCI... (21 Replies)
Discussion started by: fromtexas0
21 Replies

3. Solaris

recover a corrupted solaris10 system /usr/lib

did something very dump under /usr/lib, eg: overwite a bunch of files from a similar system's /usr/lib, while the system is live.. I have no backup on this..it crashed...and came up with a bunch of device driver load errors and hung... This is Solaris10 update 7 .. I wonder if I could do a... (0 Replies)
Discussion started by: ppchu99
0 Replies

4. SCO

Recover data from failed Hard disk in SCO OpenServer 5.06

One hard disk fail to mount (/dev/data). I had run "fsck /dev/data" then some error occured "unrecoverable error reading SCSI Disk 1 dev 1/104". I need to recover data from disk. please help. (1 Reply)
Discussion started by: rakeshkumar9919
1 Replies

5. Solaris

How to recover root file system

Please can anyone explain me how to take a backup of root file system and how to recover it if it is corrupted. please explain me in detail (1 Reply)
Discussion started by: suneelieg
1 Replies

6. Solaris

How to replace failed disk?

Dear all Please can any one explain me how to replace failed disk in Solaris 10. Please tell me the step by step procedure. (9 Replies)
Discussion started by: suneelieg
9 Replies

7. UNIX for Dummies Questions & Answers

how to recover system file

Hi everyone I have a sun fire v880 server there having a system file problem it gets currupted. i haven't taken backup of that.i want to recover system file. i have not any tape drive ,CDDRIVE & machine is also not in network. so how i can recover system file. Is there any procedure for that or... (1 Reply)
Discussion started by: pshelke
1 Replies

8. UNIX for Dummies Questions & Answers

recover DIsk and make bootable

Hello all, I'm trying to recover from backup file to a new system with a new disk. I'm able to partition my new hard drive the same way as my old drive, but I'm unable to boot off of it. I have set the fdisk to toogle as a boot flag. But it does not seem to be working. Does anyone know how to... (4 Replies)
Discussion started by: larryase
4 Replies

9. HP-UX

How to recover Hp-ux O/S Disk from Mirro Disk

Hi All of my unix forum friends. I mirrored my hp-ux o/s disk with the help of LVM on a disk array. Can any one tell me the procedure if the O/S disk fails then how could i recover the system disk from mirrored disk Regards' Alam (5 Replies)
Discussion started by: waqaralam
5 Replies

10. UNIX for Advanced & Expert Users

Scsi Disk Failed

My WS boot disk has failed so when i want to boot system (by OK boot -s or Ok boot disk0 ) i get these mesgs: disk read error boot :can not find misc/sparc9v/krtld boot:error loading interperetor (misc/sparc9v/krtld) Elf64 read error. boot failed. although it's probed ,finally i try... (1 Reply)
Discussion started by: nikk
1 Replies

Featured Tech Videos