Outages continue to be a major problem for IT industry, cloud providers mostly to blame but facilities problems and management issues play a major role
Critical systems and data centers are immeasurably more reliable than they were two or three decades ago. In most cases, problems are identified and resolved before users and customers notice. Not only has equipment become far more reliable over time, but management processes are now in place to anticipate failures or limit the consequences. In an era of cloud, distributed architectures, traffic management, and low-cost replication, IT can re-route around many failures, in some cases automatically.
There is also repeated anecdotal evidence, backed by research, that management shortcomings play a major role in these failures and it is clear than many could, and should, have been anticipated. This calls for an approach to understanding the causes and impacts that is more holistic or service-based and less siloed which can allow for critical vulnerabilities to be overlooked.
However, outages continue to be a major, and expensive, problem for the IT industry and a combination of facilities issues are still the biggest cause of outages.
These failures have become complex, and affected services can span across multiple systems and data centers. Often there are problems with personnel training, operational process and management that compound an initial failure. Many failures have multiple causes, or can cascade between data centers and networks, triggering secondary failures.
Public cloud-based services account for a significant number of reported service outages, with causes ranging from power to wide-area network issues. Although the reliability/availability of these services is generally good, their scale and complexity mean that outages are likely to have a highly visible, clear and well-recorded impact.
With the increased use of outsourced services and public cloud, it has led to a loss of visibility and the organisation most exposed to the risk is often not legally or technically indemnified, nor sometimes even directly involved in designing, operating, or managing the service. To address this issue, there is likely to be more pressure to increase transparency and accountability across service providers for hybrid infrastructures.