Administrator responsibilities, in case of power outage?


 
Thread Tools Search this Thread
Operating Systems Linux Administrator responsibilities, in case of power outage?
# 1  
Old 01-24-2011
Administrator responsibilities, in case of power outage?

Hi guys,

I was wondering if you could share some of your knowledge, in the event of a power outage.
Let presume you are on duty and you get a call at midnight because half of your cabinets have no power, air conditioning is down and you deal with a ton of 500 error messages on your boxes.

What would you do, in this situation? From my very small experience, I would do this:
Make sure all vital boxes with sensitive data get an UPS source hooked ASAP, so they can be shut gracefully. Once the power supply is restored, I would check each system for errors and restore corrupted data from backup, if any.

I would appreciate if you could give me an example how would you deal with this situation, in a more appropriate manner. My goal is to find out what would you do, before the power issues are solved. Thanks for sharing your experience.
# 2  
Old 01-24-2011
UPS are nearly always essential - even small ones can make the difference between a system shutting down gracefully and just turning off (Ive found in the past that if you calculate the downtime of the system and the cost of re-installing, including your own time spent doing that, then you tend to justify UPS on nearly all equipment)
Transactional filesystems can improve things when hardware has an abrupt power failure, but you cant rely on that fact. Also I have found that often network equipment is forgotten when spec'ing up UPS - services such as DNS, network shared filesystems and the like can often stop systems shutting down in a timely manner if the network has just been turned off. Make sure that systems with databases have large UPS as they can take a while to sync their disks and stop. I found that Active Directories and Windows Exchange Servers can take ages and ages to stop - so can need long running UPS. With machines which host virtual machines, often you can get the virtual machine to "suspend" instead of shutting down - this can make overall shutdown of the host system quicker. My last tip is to get the UPS to check their batteries regularly - ive too often found that UPS have batteries that have degraded to the point that they are useless.
I generally feel that if I am at the point of restoring a system image, then I have failed in my emergency measures, so although that is obviously the most important backup measure, I would try to make sure you never have to use it.

I hope some of these points help in your UPS decisions...
This User Gave Thanks to citaylor For This Post:
# 3  
Old 01-24-2011
UPSes with degraded batteries can be worse than useless; they might forget their state and stay off after an extended power outage is fixed! I had to drive 250km to swap one stupid box over that once...
This User Gave Thanks to Corona688 For This Post:
# 4  
Old 01-24-2011
Once the electricity issues are dealt with, what would you do next? Presuming that you reboot several boxes and they simply refuse to start properly the services, deadlocks etc. I'm trying to also find out how I should deal with a situation where several essential boxes cannot be started for X reasons.

I presume I could investigate why the services don't start, starting with disks checkup and ending with data integrity (i.e. service reinstall, database restore, etc.)?
# 5  
Old 01-24-2011
I would use this as a reason, for management's awareness, to get every vital system on a UPS and regimented backup and recovery process.
These 2 Users Gave Thanks to mark54g For This Post:
# 6  
Old 01-24-2011
But you will still be stuck to fix the issues, at midnight... when your boss is sleeping like a baby. Smilie
# 7  
Old 01-24-2011
Well, I guess that electrical issues would be the biggest. If the boxes shutdown gracefully, there is no obvious reason why it wouldnt then boot again gracefully.

First I guess you'd have to identify which boxes are down (UPS boxes may have survived, or some may be down depending on the UPS battery time). I would recommend a list of boxes to check whether they are up or down. An SNMP system may be useful to check on hosts, network applicances and services. Then once you have identified which services need to be started, you need to identify in what order (Network Switches, DNS, DHCP, SAN/NAS boxes, Active Directory, file servers, etc). Make sure they booted ok before you move onto secondary services. Create a document detailing how you would test these services to make sure they are working and the definitive order of which to boot first. Once they are up, then list the secondary services you would need to reboot and how to test they are working. With UNIX hosts check the /var/log/messages (or appropriate syslog entries), on windows check event viewer to check that everything is running ok. To be honest you cant really second guess why services may be down, so it is hard to preempt that. You should make sure you have all the necessary documentation, including error messages for all the services you are trying to run so that in an emergency you can find it quickly. You could build a plan on what you would do in the event that a piece (or multiple pieces) of hardware have failed. Eg spare hardware, restore documentation, etc. Keep a telephone list of people that may be called upon to fix hardware or software services in an emergency. Keep a list of hardware serial numbers, contracts, SLA's and telephone numbers for emergency callout for hardware and software vendors, so that you can call them in an emergency to get them fixed. Virtual machines are very useful as you can have 2 or more host machines with standby virtual images containing up-to-date backups that can be started in the event that a given piece of hardware has died. VMware, for example, allows you to create pools of virtual machine hosts that can take over functionality easily and quickly should one fail....erm otherwise I would get a book on the subject or google the subject as a whole, as Im sure there are major area's Ive missed. I hope this helps...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Automatic Server bootup after power outage?

Hi everyone, We had a power outage few days ago, and I got the servers up and running but I was informed to look into, if there is a way to bring up the servers automatically/defaultly. I was told the windows admin has their server set up where the servers are up automatically if there is a... (11 Replies)
Discussion started by: Adnans2k
11 Replies

2. AIX

System can't boot up after power outage

Hello Forum, I am very newbie with AIX. We have 2 AIX 9111-285 servers. The OS version is 5.3. After the power outage, they did not come up. I try to unplug the power cable and re-connect after 1 minutes but do not help. Both display the same reference code 110000AC on the front panel... (6 Replies)
Discussion started by: lilyn
6 Replies

3. AIX

Role of sys admin during power outage in Data center

i am new to aix environment and all my servers are @ remote location just curious to know , what issues/tasks we will be facing when there is a power outage in a data centre, i heard outage's will be a challenging task for administrators.. any example of that sort will be a great help (2 Replies)
Discussion started by: rigin
2 Replies

4. Solaris

Help me in responsibilities of solaris admin

Dear friends I have a doubt 4 months back i've completed my Solaris course now i'am searching for job on 2+ years experience please anyone tell me what are the common responsibilities of solaris admin means when i'll get a job what is the common daily work for me in office as a 2+ years... (7 Replies)
Discussion started by: suneelieg
7 Replies

5. Red Hat

Roles & Responsibilities of a Linux/Unix administrator

Hi All, At present i have good knowledge and experience in unix/ linux shell scripting. I believe unix shell scripting with administration will be a hot skill set, so I would like to become a Unix/Linux system admin. What are the key skills i have to learn to become a successful administrator.... (1 Reply)
Discussion started by: apsprabhu
1 Replies

6. What is on Your Mind?

Unix Administrator and Linux Administrator transition

Hello Unix Experts, I'm going to be graduating with a CIS (Computer Information Systems) degree in the coming year. I have been offered an internship with a job title of Unix Administrator under a well known company. I understand that Unix is used for high-end servers in many large... (1 Reply)
Discussion started by: brentmd24
1 Replies

7. Solaris

Booting up problem after power outage

hi guys, i'm new so don't bite too hard. having a problem booting up a V210 running sol9 on after a power outage... an init5 was done but not a init0 before the power cut... so now when booting up it gives the ff: SC Alert: Host System has Reset Probing system devices Probing memory... (2 Replies)
Discussion started by: lungsta
2 Replies

8. AIX

Administrator responsibilities

HELLOW ALL Can any one tell me what are the Requirements for any system administrator to be a system administrators (After taking all the courses for IBM or the the track that requires only during your job). (1 Reply)
Discussion started by: magasem
1 Replies

9. UNIX for Advanced & Expert Users

Sysadmins: your top 10 duties/responsibilities?

For you Unix sysadmins: what are you 10 most common duties/responsibilities as sysadmins and what would you suggest a newbie sysadmin do to learn them? For instance, say adding/deleting users is one of your most common duties. So a newbie would be wise to get familiar with useradd/userdel,... (15 Replies)
Discussion started by: jatkins679
15 Replies
Login or Register to Ask a Question