Servers lacked maintenance, here's my to-do list

01-06-2013

Registered User

172, 19

Join Date: Jan 2013

Last Activity: 3 August 2016, 12:58 PM EDT

Posts: 172

Thanks Given: 27

Thanked 19 Times in 19 Posts

Servers lacked maintenance, here's my to-do list

I'll be taking over administration of a rack of Solaris machines that haven't had an admin for the last 9 months. Prior to that they had limited maintenance. I understand there are a few tickets that will need addressed, but I won't have the details for a few days on them. Regardless, I'm trying to compile a to-do list. What would you add to this list?

Check for hardware failures, disks, fans, psus, etc... repair as needed
Ensure backups are being taken and are restorable
Snapshot filesystems
Check who has permissions to access these servers, internally and externally. Verify they all should have access.
Reset the root passwords, and check who else may have root access via sudo, powerbroker (if used), or uids.
Check all installed packages for exploits, update as needed
Verify you have account access to the SC/SP/ALOM/ILOM over serial console. If I don't have access, look into resetting the password.
Setup monitoring providing me with immediate access of issues.
Identify critical apps, machines, etc... and prioritize them for support
Acquire Oracle Support agreement details so if\when I need them, I have ready access.
Check the cron tables on each system as well, just to see what the prior admins have tried to automate (system admin related or application related).
Check the messages file on each system as well to catch any other issues that may have been written via syslog.
Review logs specifically with a view to what has happened before\after reboots to return the server to the expected state.
Check /var/crash/<hostname> to see if/when the last time the server may have panicked.
Check if startup and shutdown of applications is implemented well and if it is automatic or manual
Check for possible dependencies on other systems. Track ingoing and outgoing traffic if needed to check dependencies.
Check external hardware, for example NAS / SAN Disk Arrays, Network and SAN-switches, UPS, Airco, etc...
Try to track documentation and if possible reports of past changes and logs, if not available, see if you can interview old admin.
Acquire a test system so I can try stuff out.
Make a runbook.

What would you add\change\remove on this list? Thanks in advance for your help.

Last edited by DustinT; 01-07-2013 at 09:36 AM.. Reason: Added prioritize. Added Scrutinize and Bryan's tips. Attempted to prioritize.

DustinT

View Public Profile for DustinT

Find all posts by DustinT

01-06-2013

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

You should also get some information on what kind of applications are running there and which servers are critical - so they should be handled first.

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

01-06-2013

Registered User

172, 19

Join Date: Jan 2013

Last Activity: 3 August 2016, 12:58 PM EDT

Posts: 172

Thanks Given: 27

Thanked 19 Times in 19 Posts

Quote:

Originally Posted by bartus11

You should also get some information on what kind of applications are running there and which servers are critical - so they should be handled first.

Yes, I'll add that to the written list. I believe this is just a single rack of Solaris equipment so I don't expect it will be too hard to hit the priorities.

DustinT

View Public Profile for DustinT

Find all posts by DustinT

01-06-2013

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Also maybe:

Try to track documentation and if possible reports of past changes and logs, if not available, see if you can interview past admins. It is really nice to be confident that systems will come back up without problems if rebooted...
Check if startup and shutdown of applications is implemented well and if it is automatic or manual..
Check for possible dependencies on other systems. Track ingoing and outgoing traffic..
Also check external hardware, for example NAS / SAN Disk Arrays, Network and SAN-switches, UPS, Airco, etc...
Acquire a test system so you can try stuff out..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-06-2013

Registered User

3, 1

Join Date: Oct 2008

Last Activity: 6 January 2013, 6:40 PM EST

Posts: 3

Thanks Given: 0

Thanked 1 Time in 1 Post

Brainstorming a bit here..

Check the cron tables on each system as well, just to see what the prior admins have tried to automate (system admin related or application related)..

I'd also verify you have account access to the SC/SP/ALOM/ILOM over serial console, having this information handy will go a long way if a critical server goes down. If you don't have access, look into resetting the password.

You hit on user access, but to expand that, reset the root passwords, and check who else may have root access via sudo, powerbroker (if used), or uids.

Check the messages file on each system as well to catch any other issues that may have been written via syslog.

Oh yea, check /var/crash/<hostname> to see if/when the last time the server may have panic'd..

This User Gave Thanks to bryanNJ For This Post:

bryanNJ

View Public Profile for bryanNJ

Find all posts by bryanNJ

01-07-2013

Registered User

172, 19

Join Date: Jan 2013

Last Activity: 3 August 2016, 12:58 PM EDT

Posts: 172

Thanks Given: 27

Thanked 19 Times in 19 Posts

Quote:

Originally Posted by Scrutinizer

Also maybe:

Try to track documentation and if possible reports of past changes and logs, if not available, see if you can interview past admins. It is really nice to be confident that systems will come back up without problems if rebooted...
Check if startup and shutdown of applications is implemented well and if it is automatic or manual..
Check for possible dependencies on other systems. Track ingoing and outgoing traffic..
Also check external hardware, for example NAS / SAN Disk Arrays, Network and SAN-switches, UPS, Airco, etc...
Acquire a test system so you can try stuff out..

These are some excellent tips. I'll be adding all of them to my to-do list. It's starting to seem that I may not get through them as quickly as I hoped. Oh well, job security, I suppose. I honestly don't see how I could skip any of the steps. They're all critical things that could take down the system.

Regarding the test system, there's a large VMware cluster. At a minimum, I'll use that to provide a test environment. Because of the size of the environment, I tend to think there's no unused servers, but I'll look for on.

I will have the chance to interview a former admin. I'll try to find out if there's a log, change log, etc... If not, I'll press for details. Any specific questions or terms I might want to use?

---------- Post updated at 09:38 PM ---------- Previous update was at 09:30 PM ----------

Quote:

Originally Posted by bryanNJ

Great brain storm, man. There's some great security tips in here. I'll have to add all these in too. At some point, I may have to get some help in prioritizing these. It's a good problem to have, I suppose. I want my client to get their money's worth.

---------- Post updated 01-07-13 at 08:47 AM ---------- Previous update was 01-06-13 at 09:38 PM ----------

I had cross posted this on Oracle's forums and got a nice tip for taking a snapshot. Also, to review the logs associated with reboot and looking for anything unusual to return the server to it's expected state.

DustinT

View Public Profile for DustinT

Find all posts by DustinT

01-08-2013

Registered User

172, 19

Join Date: Jan 2013

Last Activity: 3 August 2016, 12:58 PM EDT

Posts: 172

Thanks Given: 27

Thanked 19 Times in 19 Posts

Well, I'd just like to say thanks for everyone's help. You have been most helpful.

DustinT

View Public Profile for DustinT

Find all posts by DustinT

Solaris

Servers lacked maintenance, here's my to-do list

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to find Error: rpmdb open failed on list of servers

Discussion started by: greavette

2. Shell Programming and Scripting

Traverse through list of servers using ssh non-interactively.

Discussion started by: mohtashims

3. Shell Programming and Scripting

How to Find List of MQ and Websphere certificates that are installed on Linux and UNIX servers?

Discussion started by: sidh_arth85

4. Shell Programming and Scripting

Ping script to list of servers

Discussion started by: kumar85shiv

5. Shell Programming and Scripting

List the IP address of list of servers

Discussion started by: kumar85shiv

6. Shell Programming and Scripting

List of servers that are NOT authorized for password-less SSH

Discussion started by: magnus29

7. Shell Programming and Scripting

List and Compare Files accross different servers.

Discussion started by: zixzix01

8. Shell Programming and Scripting

How to find out list of all proccess which are running on unix servers from last two days.

Discussion started by: akshu.agni

9. UNIX for Advanced & Expert Users

want to mail a list of files in different servers

Discussion started by: AshishK

10. Solaris

I will do Maintenance for my Servers ... pls help

Discussion started by: ArabOracle.com