Quote:
Originally Posted by
anaigini45
All the servers are mission critical.
And in terms of risk management, we have an SLA of maximum 4 hours to bring the server back up in an event of a catastrophe.
Does not sound very "mission critical" to me.
If you define everything that can be down with an SLA for four hours as "mission critical", what would you define a server that if it went down it would cost the company 100K to 1M USD per hour?
Most people would not define a service as "MISSION CRITICAL" if it has a SLA of four hours, to be frank. But then again that depends on the "MISSION".
If you have SLA of four hours, then you can easily make a mistake and recover from it long before the four hour SLA window is reached. That is more like "A STANDARD BUSINESS SLA", for a lack of a better term.
Do you have a risk management team (normally a part of either the IT security or audit teams) responsible for the risk management of all these servers?
If so, get them involved.
The biggest loses any company has is usually a mistake by a well intended trusted employee. Often, these big mistakes are caused by trying to automate an upgrade to hundreds of devices (routers, servers, firewalls, etc).
Best to set up a test bed, work on the changes, and get it working. You cannot just take "YUM" and try to upgrade if the original installs were done manually. This is a formula for a lot of downtime!