There’s a humorous old dictum in the IT industry that “To err is human, but to really foul things up you need a computer.” When you combine the two, however – human error PLUS computers – then you can achieve catastrophic disruptions of truly epic proportion, like the massive AWS outage suffered by Amazon in early 2017.
The Uptime Institute estimates that human error lies behind about 70 percent of the problems plaguing data centers today. Seventy percent. Think about that. In 70 percent of cases, all it took to bring today’s most powerful high-tech to its knees was just one person making an honest mistake.
A mistyped keyboard stroke here. An erroneous mouse click there. We’ve all done it. Who among us hasn’t absent-mindedly pressed “Reply All” to an email meant for one person, then realized with horror that the errant message went organization-wide?
The people working in data centers are no different from you and me. They’re human and therefore vulnerable (if not more prone) to those same kinds of errors, especially given the number of systems they have to interface with, the profusion of processes they’re responsible for and the sheer monotony of their workload. Together these dynamics congeal into a brain-fog of tedium which can easily cause one to lose focus and make mistakes.
Where data center operators do primarily differ from us is that they often run the same mission critical jobs over and over, performing their tasks flawlessly hundreds, thousands, even tens of thousands of consecutive times. At some point though, that perfect streak comes to an end. It always does, because humans are fallible, and not even the best data center operators are 100 percent perfect 100 percent of the time.
In fact, when you think about it, it’s a wonder the rate of outages due to human error isn’t even higher than it is.
If you’re in upper management, the probability of downtime caused by human error should worry you. Not just because of the resulting damage and its consequential costs, but because of the repercussions that inevitably follow the outage. After the shock and anger subside, the finger pointing begins, thanks to impeccable 20/20 hindsight. Once the blame game starts, it usually doesn’t end until reputations are ruined and secure jobs are lost. Many promising corporate careers were terminated by preventable disasters like these.
So what’s upper management to do about the potential for IT failures in their data center caused by human error?
Optimizing existing procedures or building in procedural circuit breakers to prevent wider damage are strategies that ultimately ignore the root cause of the issue – the fact that humans performing repetitive and often laborious manual processes inevitably commit errors. Therefore, the key to drastically reducing the potential for human error-caused outages is to eliminate the potential for human error in the first place by automating the part that humans do.
It’s the exact same logic behind the push for self-driving automobiles.
Various studies have pegged human error as the cause of anywhere between 90-95 percent of all traffic accidents. According to McKinsey, widespread use of autonomous vehicles could “eliminate 90 percent of all auto accidents in the United States, prevent up to $190 billion in damages and health-costs annually and save thousands of lives.”
The irreducible conclusion many are realizing is that eliminating the potential for human errors also eliminates big problems and the big costs associated with them. The most practical and cost-effective way of eliminating human errors is with automation.
In the NOCs and SOCs where headline-making outages occur, this revelation may initially be met with concern and even resistance. After all, your data center operators have spent many years honing their skill sets and developing their expertise in order to run the mission critical systems your organization relies upon. They may not take kindly to the notion that much of what they’ve done all these years can now be done more accurately with software.
In due time however, these same operators will realize that they stand to benefit from automation as much as anyone. Automation frees up data center personnel from tedious, repetitive, mind-numbing processes (that many of them secretly despise) so they can focus on more strategic and challenging work that will allow them to make better use of their skills and expertise. This realignment will also likely improve their level of job satisfaction since they’ll be contributing to the organization at a higher level. As one of our customers once put it, “If they spend less time staring at blinking lights then they can spend more time on higher value projects.” Happier employees means lower costly turnover.
Furthermore, combining automation with the knowledge and skills of your best operators will make them far more productive and vastly more effective. In other words, automation is a force multiplier for your data centers, as well as being a highly effective risk mitigation tool.
When contemplating automation, upper management should consider one more factor – the competition. Many of your competitors have already embraced automation as a risk mitigation strategy to eliminate human error-caused service disruptions. As they automate more and more of their operations, they not only become less prone to outages, but their businesses also become more efficient and much more scalable, potentially leaving your organization at a competitive disadvantage.
If you’re ready to take the plunge and give automation a try, here’s a high-level view of how we recommend you begin:
- Identify the critical processes most vulnerable to human error, and those which would be most costly to the organization should an operator make a mistake leading to a disruption.
- Document these processes, paying particular attention to the parts involving potential for human error.
- Automate these processes using an enterprise-strength orchestration tool which has a proven track record in mission critical environments (like Ayehu).
Why let costly errors rob your organization of efficiency, employee retention and competitive advantage? Check out automation in action and see how it can become a force multiplier for your enterprise by launching your demo today.
About the Author: Guy Nadivi is the Sr. Director of Customer Success for Ayehu, and the first employee hired by the company in North America. Having previously served in numerous roles for Ayehu, Guy now leads the customer success initiative, which has emerged as a leading customer success program among all automation vendors. Previously, Guy founded three technical consulting firms, one of which was acquired by a publicly traded NASDAQ company. He has authored numerous articles of both a business and technical nature, for Forbes, The Jerusalem Post, Lotus Notes Advisor, and others. Guy received a Bachelor of Science degree from California State University, Northridge.