DiskSuite: Breaking mirrors.

12-23-2005

Registered User

411, 5

Join Date: Feb 2005

Last Activity: 7 May 2012, 4:35 PM EDT

Location: Longmont, CO

Posts: 411

Thanks Given: 1

Thanked 5 Times in 5 Posts

DiskSuite: Breaking mirrors.

Ok, so I have a remote system (7 states away) that's using SDS to manage the two 18 gig disks. /, swap, /var, /home, and /opt.

The mirroring procedure I created uses installboot to ensure there's a bootblk on both disks of an SDS mirror.

The system has a problem booting (can't write to /var/adm/utmp) and there's no bootable CD on site. I have a "hands & eyes" person who's not familiar with Solaris. My intention was to break the mirror, boot to one disk, fsck the second disk and boot to it to recover. Remirror the system after it's back up.

OBP has boot-device=disk net
c0t0d0 and c0t2d0 are the two disks.
c0t2d0 is identified as the left side of the mirror where normally t0 is left and t2 is right.

Remember, I'm not there so all the commands are being entered by the H&E guy.

1. Enter root's password to go to single-user mode.
2. fsck all the slices on t0 and t2 except t0's /var slice since it's mounted in ro mode.
3. mount c0t2d0s0 /mnt
4. Remove the MDD stuff from /mnt/etc/system
5. Change the mounts in /mnt/etc/vfstab
6. eeprom to set boot-device=disk1
7. umount /mnt (ensures everything's written to disk).
8. to be sure, installboot bootblk /dev/rdsk/c0t0d0s0 and /dev/rdsk/c0t2d0s0

Upon boot, the system said that c0t0d0s0 was not of this fstype and we received the same "can't write to /var/adm/utmp" error.

I did some google searching and didn't find anything specific to this issue. I can boot in single user mode and mount the slice without a problem so it's puzzling.

Because of this, I think that "disk" is defined as t2 instead of t0 so bring it to single user and change eeprom boot-device=disk and it generates the exact same error.

Now, aside from the problems (we ultimately left the mirror broken, reinstalled Solaris on one disk and recovered the data from the second disk), does this sound like it should have worked?

One of the results of this for me is to ensure installboot is run on all SDS mirrors and to check the status of boot-device (some systems weren't "disk disk1").

Carl

BOFH

View Public Profile for BOFH

Find all posts by BOFH

12-23-2005

Registered User

962, 67

Join Date: Mar 2005

Last Activity: 16 January 2019, 9:30 PM EST

Location: Philadelphia metro

Posts: 962

Thanks Given: 3

Thanked 67 Times in 61 Posts

same process worked for me with the older version of disksuite on solaris 2.5.1 but definitely went bonkers with the newer version on solaris 8 ... your process would've worked if you reformatted the metadb slice on 1 of the drives prior to rebooting ... the system enables disksuite on bootup and sees intact metadbs so it tries to configure the filesystems under disksuite control like normal ... removing the metadbs ensures that the system doesn't have disksuite running ... and everything related to disksuite in /etc/system --- from "Begin MDD" to "End MDD" needed to get removed and not just commented out ...

the fsck of the individual filesystems while they were still mirrored, however, did not sound too good --- i think they should have been done after the mirrors were broken and the box rebooted ...

if you didn't know this yet ...
you don't need to go into single user mode to reset the eeprom entries if you don't want to (see "man eeprom") ... and you could also set them from the ok prompt as required (see this )

Just Ice

View Public Profile for Just Ice

Find all posts by Just Ice

12-23-2005

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Well, I must be not focusing on something in your post. OK, /var is screwed up somehow. So you break the mirror, do nothing to repair /var, and boot from one side of the mirror. And then.. 'we received the same "can't write to /var/adm/utmp" error.' Was that not the expected result?

You say 'fsck all the slices on t0 and t2 except t0's /var slice since it's mounted in ro mode.' How did it come to pass that /var was mounted in ro mode? Even if you couldn't get it unmounted for some odd reason, fsck -n should have been possible. Repairing /var seems like the key to recovery. What's wrong with boot into single user mode, square away /var, and reboot?

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

12-26-2005

Registered User

411, 5

Join Date: Feb 2005

Last Activity: 7 May 2012, 4:35 PM EDT

Location: Longmont, CO

Posts: 411

Thanks Given: 1

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by Just Ice

same process worked for me with the older version of disksuite on solaris 2.5.1 but definitely went bonkers with the newer version on solaris 8 ...

Well, that might explain it since the last time I had to do this was on a 2.5.1 Solaris system.

Quote:

your process would've worked if you reformatted the metadb slice on 1 of the drives prior to rebooting ... the system enables disksuite on bootup and sees intact metadbs so it tries to configure the filesystems under disksuite control like normal ... removing the metadbs ensures that the system doesn't have disksuite running ... and everything related to disksuite in /etc/system --- from "Begin MDD" to "End MDD" needed to get removed and not just commented out ...

I hadn't thought that formatting the metadb slice would have mattered. I removed the MDD entries from /etc/system and removed /etc/system all together on further boots just in case.

Quote:

the fsck of the individual filesystems while they were still mirrored, however, did not sound too good --- i think they should have been done after the mirrors were broken and the box rebooted ...

Yea, that could have been a problem, however we did boot it several times over a few days and fscked the systems a few times so disk suite should have been off the system pretty quickly.

Quote:

if you didn't know this yet ...
you don't need to go into single user mode to reset the eeprom entries if you don't want to (see "man eeprom") ... and you could also set them from the ok prompt as required (see this )

Yep, knew that. Thanks though.

Carl

BOFH

View Public Profile for BOFH

Find all posts by BOFH

12-26-2005

Registered User

411, 5

Join Date: Feb 2005

Last Activity: 7 May 2012, 4:35 PM EDT

Location: Longmont, CO

Posts: 411

Thanks Given: 1

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by Perderabo

Well, I must be not focusing on something in your post. OK, /var is screwed up somehow. So you break the mirror, do nothing to repair /var

Well, I didn't "do nothing". I attempted to fsck /var but it responded that it couldn't since /var was already mounted as read-only. I don't know why it was mounted read only. I can only assume (which is why I'm asking here) that due to the initial problem, it couldn't remount read-write.

I thought the initial process of mounting disks, mounted root and var in read-only, fscked them, then remounted them to read-write before mounting the rest of the slices.

Correction would be appreciated of course.

Quote:

, and boot from one side of the mirror. And then.. 'we received the same "can't write to /var/adm/utmp" error.' Was that not the expected result?

The initial problem was that the system wasn't able to create utmp, possibly because /var was not able to be remounted read-write (again, assuming my comment above is true).

So my steps were to break the mirror so that I had two separate disks, reboot it to just a single, non disk suite controlled disk, fsck the other disk and bring the mirror back.

I cleared /etc/system, changed the md entries in /etc/vfstab back to mounting the disk rather than mounting the metadisks, fixed eeprom so that it booted from the second disk and booted the system.

Since it should be booting from a clean, non SDS controlled disk, it should have booted successfully. I got puzzled when I received the exact same error.

Quote:

You say 'fsck all the slices on t0 and t2 except t0's /var slice since it's mounted in ro mode.' How did it come to pass that /var was mounted in ro mode?

I don't know. See my theory above. Since I don't have a Sun contract here at work (AIX/Red Hat shop), I can't check some of the deeper knowledge available within Sunsolve that I had available when I was working at a Sun shop.

Quote:

Even if you couldn't get it unmounted for some odd reason, fsck -n should have been possible.

I don't know how -n would have worked (reply 'n' to all prompts) and it wouldn't have occurred to me to try it. Can you explain further how it might have helped?

Quote:

Repairing /var seems like the key to recovery.

Well yea

That's what I was trying to do. I thought, perhaps in error, that getting it mounted without disk suite would let me fsck the other disk and then boot to repaired disk to get the system back up. I could then re mirror the disk afterwards.

Quote:

What's wrong with boot into single user mode, square away /var, and reboot? Smilie

Hence the questions. Thanks for taking the time though.

Carl

BOFH

View Public Profile for BOFH

Find all posts by BOFH

12-26-2005

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

I'm not sure what version of Solaris you're using, but I thought that when I boot into single user mode, only / and /usr was mounted. Even if /var was mounted, I would think that a "umount /var" would take care of that. I'm not sure if stuff is mirrored in single user mode. But I would resist breaking a mirror if I didn't need to. And I don't see how that will help here. Actually, I now think that you simply had a typo in /etc/vfstab. Someone had changed /var to "ro". Later, a reboot caused your problem to occur.

As for "fsck -n", it simply provides information and I try to gather information when I don't understand something. "fsck -n" might result in anything from "all looks cool" to "file system? what file system?". Knowing the state of /var would be a help. If I'm right about the typo in /etc/fstab, "fsck -n" will not find a problem. That would lead me to stop looking at /var. Add that to the odd read-only status of /var and my next step would be to check vfstab.

I also like to run "fsck -n" to see how bad stuff is before I run a plain "fsck". I have deeply regretted not doing that on several occasions.

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

12-26-2005

Registered User

411, 5

Join Date: Feb 2005

Last Activity: 7 May 2012, 4:35 PM EDT

Location: Longmont, CO

Posts: 411

Thanks Given: 1

Thanked 5 Times in 5 Posts

Quote:

Originally Posted by Perderabo

I'm not sure what version of Solaris you're using,

Solaris 8.

Quote:

but I thought that when I boot into single user mode, only / and /usr was mounted. Even if /var was mounted, I would think that a "umount /var" would take care of that.

Nope, umount /var gave me a "mount point busy" type of message.

Quote:

I'm not sure if stuff is mirrored in single user mode. But I would resist breaking a mirror if I didn't need to. And I don't see how that will help here.

Actually, I now think that you simply had a typo in /etc/vfstab. Someone had changed /var to "ro". Later, a reboot caused your problem to occur. Smilie

Well I can check again, however we're using explorer to get a weekly dump of the system. I checked out the copied system and vfstab just to make sure there wasn't a problem, however I wasn't looking for that in particular so I'll check again when I get in tomorrow.

Quote:

As for "fsck -n", it simply provides information and I try to gather information when I don't understand something. "fsck -n" might result in anything from "all looks cool" to "file system? what file system?". Knowing the state of /var would be a help. If I'm right about the typo in /etc/fstab, "fsck -n" will not find a problem. That would lead me to stop looking at /var. Add that to the odd read-only status of /var and my next step would be to check vfstab.

I also like to run "fsck -n" to see how bad stuff is before I run a plain "fsck". I have deeply regretted not doing that on several occasions.

I just use fsck since it'll ask for each item. Then I can review them as they come up. If you autoanswer 'y' or 'n', you won't be able to evaluate the problems as they come up.

I appreciate the thought on vfstab. I have found an error in another server's vfstab so there's a chance that was it. I'll check the explorer output.

I've spent a lot of time these past several months discovering problems, making repairs and whipping up scripts so they won't happen again so it wouldn't surprise me.

Thanks.

Carl

BOFH

View Public Profile for BOFH

Find all posts by BOFH

Solaris

DiskSuite: Breaking mirrors.

6 More Discussions You Might Find Interesting

1. Solaris

Zpool with 3 2-way mirrors in a pool

Discussion started by: fishface

2. Solaris

Oneway mirrors

Discussion started by: ossupport55

3. Solaris

Help with attaching mirrors

Discussion started by: phanidhar6039

4. Linux

[Errno 256] No more mirrors to try.

Discussion started by: saqlain.bashir

5. Linux

Additional mirrors on centos

Discussion started by: c0mrade

6. Solaris

both mirrors in needs maintenance state.

Discussion started by: sag71155