Reducing downtime in the data centre

Unplanned changes are major source of downtime in the Data Centre with more than 80% of all failures being the result of human error. Whenever a change is made in a controlled environment, you put your critical services at risk.

Although a lot of effort is spent improving the reliability and resilience of the IT infrastructure, the weakest link is usually in the operational processes.

The most effective way to fight this threat is with documentation and training. The establishment of a documentation and training programme is key to providing robust processes to ensure data centre stability.

Anything that happens in a data centre, from changing out a server to performing maintenance on a CRAC unit, should be well documented so that it’s understood how the task should be performed. The expected result of the procedure should also be clearly documented, such that employees will quickly be aware of any variance from that result and can take steps to remedy it.

While human error cannot be eliminated it can be substantially reduced with the right tools.