Author: Guy Nadivi
As most of you already know, there’s a digital transformation underway at many enterprise organizations, and it’s revolutionizing how they do business. That transformation though is also leading to increasingly more complex and sophisticated infrastructure environments. The more complicated these environments get, the more frequently performance monitoring alerts get generated. Sometimes these alerts can come in so fast and furious, and in such high volume, that they can lead to alert storms, which overwhelm staff and lead to unnecessary downtime.
Since the environments these alerts are being generated from can be so intricate, this presents a multi-dimensional problem that requires more than just a single-point solution. Ayehu has partnered with LogicMonitor to demonstrate how end-to-end intelligent automation can help organizations better manage alert storms from incident all the way to remediation.
The need for that sort of best-of-breed solution is being driven by some consistent trends across IT reflecting a shift in how IT teams are running their environments, and how costly it becomes when there is an outage. Gartner estimates that:
- 75% Of organizations will deploy a multi-cloud or hybrid cloud model by 2020 (September 13, 2018 – ID G00372986)
- By 2022, more than 75% of global organizations will be running containerized applications in production
- Based on industry surveys, the cost of network downtime extrapolates to well over $300K/hour
Further exacerbating the situation is the complexity of multi-vendor point solutions, distributed workloads across on-premise data centers, off-premise facilities, and the public cloud, and relentless end-user demands for high availability, secure, “always-on” services.
From a monitoring standpoint, enterprise organizations need a solution that can monitor any infrastructure that uses any vendor on any cloud with any method required, e.g. SNMP, WMI, JDBC, JMX, SD-WAN, etc. In short, if there’s a metric behind an IP address, IT needs to keep an eye on it, and if IT wants to set a threshold for that metric, then alerts need to be enabled for it.
The monitoring solution must also provide an intuitive analytical view of the metrics generated from these alerts to anyone needing visibility into infrastructure performance. This is critical for proactive IT management in order to prevent “degraded states” where services go beyond the point of outage prevention.
This is where automating remediation of the underlying incident that generated the alert becomes vital.
The average MTTR (Mean Time To Resolution) for remediating incidents is 8.40 business hours, according to MetricNet, a provider of benchmarks, performance metrics, scorecards and business data to Information Technology and Call Center Professionals.
When dealing with mission critical applications that are relied upon by huge user communities, MTTRs of that duration are simply unacceptable.
But it gets worse.
What happens when the complexities of today’s hybrid infrastructures lead to an overwhelming number of alerts, many of them flooding in close together?
You know exactly what happens.
You get something known as an alert storm. And when alert storms occur, MTTRs degrade even further because they overwhelm people in the data center who are already working at a furious pace just to keep the lights on.
If data center personnel are overwhelmed by alert storms, it’s going to affect their ability to do other things.
That inability to do other things due to alert storms is very important, especially if customer satisfaction is one of your IT department’s major KPI’s, as it is for many IT departments these days.
Take a look at the results of a survey Gartner conducted less than a year ago, asking respondents what they considered the most important characteristic of an excellent internal IT department.
If an IT department performed dependably and accurately, 40% of respondents considered them to be excellent.
If an IT department offered prompt help and service, 25% of respondents considered them to be excellent.
So if your IT department can deliver on those 2 characteristics, about 2/3 of your users will be very happy with you.
But here’s the rub. When your IT department is flooded with alert storms generated by incidents that have to be remediated manually, then that’s taking you away from providing your users with dependability and accuracy in a prompt manner. However, if you can provide that level of service regardless of alert storms, then nearly 2/3 of your users will consider you to be an excellent IT department.
One proven way to achieve that level of excellence is by automating manual incident remediation processes, which in some cases can reduce MTTRs from hours down to seconds.
Here’s how that would work. It involves using the Ayehu platform as an integration hub in your environment. Ayehu would then connect to every system that needs to be interacted with when remediating an incident.
So for example, if your environment has a monitoring system like LogicMonitor, that’s where an incident will be detected first. And LogicMonitor, now integrated with Ayehu, will generate an alert which Ayehu will instantaneously intercept.
Ayehu will then parse that alert to determine what the underlying incident is, and launch an automated workflow to remediate that specific underlying incident.
As a first step in our workflow we’re going to automatically create a ticket in ServiceNow, BMC Remedy, JIRA, or any ITSM platform you prefer. Here again is where automation really shines over taking the manual approach, because letting the workflow handle the documentation will ensure that it gets done in a timely manner, in fact in real-time. Automation also ensures that documentation gets done thoroughly. Service Desk staff often don’t have the time or the patience to document every aspect of a resolution properly because they’re under such a heavy workload.
The next step, and actually this can be at any step within that workflow, is pausing its execution to notify and seek human approval for continuation. Just to illustrate why you might do this, let’s say that a workflow got triggered because LogicMonitor generated an alert that a server dropped below 10% free disk space. The workflow could then go and delete a bunch of temp files to free up space, it could compress a bunch of log files and move them somewhere else, and do all sorts of other things to free up space, but before it does any of that, the workflow can be configured to require human approval for any of those steps.
The human can either grant or deny approval so the workflow can continue on, and that decision can be delivered by laptop, smartphone, email, Instant Messenger, or even via a regular telephone. However, note that this notification/approval phase is entirely optional. You can also choose to put the workflow on autopilot and proceed without any human intervention. It’s all up to you, and either option is easy to implement.
Then the workflow can begin remediating the incident which triggered the alert.
As the remediation is taking place, Ayehu can update the service desk ticket in real-time by documenting every step of the incident remediation process.
Once the incident remediation is completed, Ayehu can automatically close the ticket.
And finally, it can go back into LogicMonitor and automatically dismiss the alert that triggered this entire process. This is how you can leverage intelligent automation to better manage alert storms, as well as simultaneously eliminating the potential for human error that can lead to outages in your environment.
Gartner concurs with this approach.
In a recently refreshed paper they published (ID G00336149 – April 11, 2019) one of their Vice-Presidents wrote that “The intricacy of access layer network decisions and the aggravation of end-user downtime are more than IT organizations can handle. Infrastructure and operations leaders must implement automation and artificial intelligence solutions to reduce mundane tasks and lost productivity.”
No ambiguity there.