01-24-2011
406,
72
Join Date: Jul 2010
Last Activity: 10 July 2018, 5:08 PM EDT
Location: Somerset, UK
Posts: 406
Thanks Given: 0
Thanked 72 Times in 70 Posts
Well, I guess that electrical issues would be the biggest. If the boxes shutdown gracefully, there is no obvious reason why it wouldnt then boot again gracefully.
First I guess you'd have to identify which boxes are down (UPS boxes may have survived, or some may be down depending on the UPS battery time). I would recommend a list of boxes to check whether they are up or down. An SNMP system may be useful to check on hosts, network applicances and services. Then once you have identified which services need to be started, you need to identify in what order (Network Switches, DNS, DHCP, SAN/NAS boxes, Active Directory, file servers, etc). Make sure they booted ok before you move onto secondary services. Create a document detailing how you would test these services to make sure they are working and the definitive order of which to boot first. Once they are up, then list the secondary services you would need to reboot and how to test they are working. With UNIX hosts check the /var/log/messages (or appropriate syslog entries), on windows check event viewer to check that everything is running ok. To be honest you cant really second guess why services may be down, so it is hard to preempt that. You should make sure you have all the necessary documentation, including error messages for all the services you are trying to run so that in an emergency you can find it quickly. You could build a plan on what you would do in the event that a piece (or multiple pieces) of hardware have failed. Eg spare hardware, restore documentation, etc. Keep a telephone list of people that may be called upon to fix hardware or software services in an emergency. Keep a list of hardware serial numbers, contracts, SLA's and telephone numbers for emergency callout for hardware and software vendors, so that you can call them in an emergency to get them fixed. Virtual machines are very useful as you can have 2 or more host machines with standby virtual images containing up-to-date backups that can be started in the event that a given piece of hardware has died. VMware, for example, allows you to create pools of virtual machine hosts that can take over functionality easily and quickly should one fail....erm otherwise I would get a book on the subject or google the subject as a whole, as Im sure there are major area's Ive missed. I hope this helps...