IT Incidents: From Alert to Remediation in 15 seconds

IT Incidents: From Alert to Remediation in 15 seconds [Webinar Recap]

Author: Guy Nadivi

Remediating IT incidents in just seconds after receiving an alert isn’t just a good performance goal to strive for. Rapid remediation might also be critical to reducing and even mitigating downtime. That’s important, because the cost of downtime to an enterprise can be scary. Even scarier though is what can happen to people’s jobs if they’re found to be responsible for failing to prevent the incidents that resulted in those downtimes.

So let’s talk a bit about how automation can help you avoid situations that imperil your organization, and possibly your career.

Mean Time to Resolution (MTTR) is a foundational KPI for just about every organization. If someone asked you “On average, how long does it take your organization to remediate IT Incidents after an alert?” what would your answer be from the choices below?

  • Less than 5 minutes
  • 5 – 15 minutes
  • As much as an hour
  • More than an hour

In an informal poll during a webinar, here’s how our audience responded:

More than half said that, on average, it takes them more than an hour to remediate IT incidents after an alert. That’s in line with research by MetricNet, a provider of benchmarks, performance metrics, scorecards and business data to Information Technology and Call Center Professionals.

Their global benchmarking database shows that the average incident MTTR is 8.40 business hours, but ranges widely, from a high of 33.67 hours to a low of 0.67 hours (shown below in the little tabular inset to the right). This wide variation is driven by several factors including ticket backlog, user population density, and the complexity of tickets handled.

Your mileage may vary, but obviously, it’s taking most organizations far longer than 15 seconds to remediate their incidents.

If that incident needing remediation involves a server outage, then the longer it takes to bring the server back up, the more it’s going to cost the organization.

Statista recently calculated the cost of enterprise server downtime, and what they found makes the phrase “time is money” seem like an understatement. According to Statista’s research, 60% of organizations worldwide reported that the average cost PER HOUR of enterprise server downtime was anywhere from $301,000 to $2 million!

With server downtime being so expensive, Gartner has some interesting data points to share on that issue (ID G00377088 – April 9, 2019).

First off, they report receiving over 650 client inquires between 2017 and 2019 on this topic, and we’re still not done with 2019. So clearly this is a topic that’s top-of-mind with C-suite executives.

Secondly, they state that through 2021, just 2 years from now, 65% of Infrastructure and Operations leaders will underinvest in their availability and recovery needs because they use estimated cost-of-downtime metrics.

As it turns out, Ayehu can help you get a more accurate estimate of your downtime costs so they’re not underestimated.

In our eBook titled “How to Measure IT Process Automation ROI”, there’s a specific formula for calculating the cost of downtime. The eBook is free to download on our website, and also includes access to all of our ROI formulas, which are fairly straightforward to calculate.

Let’s look at another data point about outages, this one from the Uptime Institute’s 2019 Annual Data Center Survey Results. They report that “Outages continue to cause significant problems for operators. Just over a third (34%) of all respondents had an outage or severe IT service degradation in the past year, while half (50%) had an outage or severe IT service degradation in the past three years.”

So if you were thinking painful outages only happen at your organization, think again. They’re happening everywhere. And as the research from Statista emphasized, when outages hit, it’s usually very expensive.

The Uptime Institute has an even more alarming statistic they’ve published.

They’ve found that more than 70% of all data center outages are caused by human error and not by a fault in the infrastructure design!

Let’s pause for a moment to ponder that. In 70% of cases, all it took to bring today’s most powerful high-tech to its knees was a person making an honest mistake.

That’s actually not too surprising though, is it? All of us have mistyped a keyboard stroke here or made an erroneous mouse click there. How many times has it happened that someone absent-mindedly pressed “Reply All” to an email meant for one person, then realized with horror that their message just went out to the entire organization?

So mistakes happen to everyone, and that includes data center operators. And unfortunately, when they make a mistake that leads to an outage, the consequences can be catastrophic.

One well-known example of an honest human mistake that led to a spectacular outage occurred back in late February of 2017. Someone on Amazon’s S3 team input a command incorrectly that led to the entire Amazon Simple Storage Service being taken down, which impacted 150,000 organizations and led to many millions of dollars in losses.

If infrastructure design usually isn’t the issue, and 70% of the time outages are a direct result of human error, then logic suggests that the key would be to eliminate the potential for human error. And just to emphasize the nuance of this point, we’re NOT advocating eliminating humans, but eliminating the potential for human error while keeping humans very much involved. How do we do that?

Well, you won’t be too surprised to learn we do it through automation.

Let’s start by taking a look at the typical infrastructure and operations troubleshooting process.

This process should look pretty familiar to you.

In general, many organizations (including large ones) do most of these phases manually. The problem with that is that it makes every phase of this process vulnerable to human error.

There’s a better way, however. It involves automating much of this process, which can reduce the time it takes to remediate an IT incident down to seconds. And automation isn’t just faster, it also eliminates the potential for human error, which should radically reduce the likelihood that your environment will experience an outage due to human error.

Here’s how that would work. It involves using the Ayehu platform as an integration hub in your environment. Ayehu would then connect to every system that needs to be interacted with when remediating an incident.

For example, if your environment has a monitoring system like SolarWinds, Big Panda, or Microsoft System Center, that’s where an incident will be detected first. The monitoring system (now integrated with Ayehu) will generate an alert which Ayehu will instantaneously intercept. (BTW – if there’s a monitoring system or any kind of platform in your environment that we don’t have an off-the-shelf integration for, it’s usually still pretty easy to connect to it via a REST API call.)

Ayehu will then parse that alert to determine what the underlying incident is, and launch an automated workflow to remediate it.

As a first step in our workflow we’re going to automatically create a ticket in ServiceNow, BMC Remedy, JIRA, or any ITSM platform you prefer. Here again is where automation really shines over taking the manual approach, because letting the workflow handle the documentation will ensure that it gets done in a timely manner (in fact, in real-time) and that it gets done thoroughly. This brings relief to service desk staff who often don’t have the time or the patience to document every aspect of a resolution properly because they’re under such a heavy workload.

The next step, and actually this can be at any step within that workflow, is pausing its execution to notify and seek human approval for continuation. To illustrate why you might do this, let’s say that a workflow got triggered because SolarWinds generated an alert that a server dropped below 10% free disk space. The workflow could then go and delete a bunch of temp files, it could compress a bunch of log files and move them somewhere else, and do all sorts of other things to free up space. Before it does any of that though, the workflow can be configured to require human approval for any of those steps.

The human can either grant or deny approval so the workflow can continue on, and that decision can be delivered via laptop, smartphone, email, instant messenger, or even regular telephone. However, please note that this notification/approval phase is entirely optional. You can also choose to put the workflow on autopilot and proceed without any human intervention. It’s all up to you, and either option is easy to implement.

Then the workflow can begin remediating the incident which triggered the alert.

As the remediation is taking place, Ayehu can update the service desk ticket in real-time by documenting every step of the incident remediation process.

Once the incident remediation is completed, Ayehu can automatically close the ticket.

Finally, Ayehu can go back into the monitoring system and automatically dismiss the alert that triggered the entire process.

This, by the way, illustrates why we think of Ayehu as a virtual operator which we sometimes refer to as “Level 0 Tech Support”. A lot of incidents can be resolved automatically by Ayehu without any human intervention, and thus without the need for attention from a Level 1 technician.

This then is how you can go from alert to remediation in 15 seconds, while simultaneously eliminating the potential for human error that can lead to outages in your environment.

Gartner concurs with this approach.

In a recently refreshed paper they published (ID G00336149 – April 11, 2019) one of their Vice-Presidents wrote that “The intricacy of access layer network decisions and the aggravation of end-user downtime are more than IT organizations can handle. Infrastructure and operations leaders must implement automation and artificial intelligence solutions to reduce mundane tasks and lost productivity.”

No ambiguity there.

Gartner’s advice is a good opportunity for me to segue into one last topic – artificial intelligence.

The Ayehu platform has AI built-in, and it’s one of the reasons you’ll be able to not only quickly remediate your IT incidents, but also quickly build the workflows that will do that remediation.

Ayehu is partnered with SRI International (SRI), formerly known as the Stanford Research Institute. In case you’re not familiar with them, SRI does high-level research for government agencies, commercial organizations, and private foundations. They also license their technologies, form strategic partnerships (like the one they have with us) and creates spin-off companies. They’ve received more than 4,000 patents and patent applications worldwide to date. SRI is our design partner, and they’ve designed the algorithms and other elements of our AI/ML functionality. What they’ve done so far is pretty cool, but what we’re working on going forward is what’s really exciting.

One of the ways Ayehu implements AI is through VSAs, which is shorthand for “Virtual Support Agents”.

VSA’s differ from chatbots in that they’re not only conversational, but more importantly they’re also actionable. That makes them the next logical step or evolution up from a chatbot. However, in order for a VSA to execute actionable tasks and be functionally useful, it has to be plugged in to an enterprise grade automation platform that can carry out a user’s request intelligently.

We deliver a lot of our VSA functionality through Slack, and we also have integrations with Alexa and IBM Watson. We’re also incorporating an MS-Teams interface, and looking into others as well.

How is this relevant to remediating incidents?

Well, if a service desk can offload a larger portion of its tickets to VSA’s, and provide its users with more of a self-service modality, then that frees up the service desk staff to automate more of the kinds of data center tasks that are tedious, repetitive, and prone to human error. And as I’ve previously stated, eliminating the potential for human error is key to reducing the likelihood of outages.

Speaking of tickets, another informal webinar poll we conducted asked:

On average, how many support tickets per month does your IT organization deal with?

  • Less than 100
  • 101 – 250
  • 251 – 1,000
  • More than 1,000

Here’s how our audience responded:

Nearly 90% receive 251 or more tickets per month. Over half get more than 1,000!

For comparison, the Zendesk Benchmark reports that among their customers, the average is 777 tickets per month.

Given the volume of tickets received per month, the current average duration it takes to remediate an incident, and most importantly the onerous cost of downtime, automation can go a long way towards helping service desks maximize their efficiency by being a force multiplier for existing staff.

Q:          What types of notifications can the VSA send at the time of incident?

A:           Notifications can be delivered either as text or speech.

Q:          How does the Ayehu tool differ from other leading RPA tools available on the market?

A:           RPA tools are typically doing screen automation with an agent. Ayehu’s automation is an agentless platform that primarily interfaces with backend APIs.

Q:          Do we have to do API programming or other scripting as a part of implementation?

A:           No. Ayehu’s out-of-the-box integrations typically only require a few configuration parameters.

Q:          Do we have an option to create custom activities? If so, which programing language should be used?

A:           In our roadmap, we will be offering the ability to create custom activity content out-of-the-box.

Q:          Do out-of-the-box workflows work on all types of operating systems?

A:           Yes. You just define the type of operating system within the workflow.

Q:          How does Ayehu connect and authenticate with various endpoint devices (e.g. Windows, UNIX, network devices, etc.)? Is it password-less, connection through a password vault, etc?

A:           That depends on what type of authentication is required internally by the organization. Ayehu integrated with the CyberArk password vault can be leveraged when privileged account credentials are involved. Any type of user credential information that is manually input into a workflow or device is encrypted within Ayehu’s database. Also, certificates on SSH commands, Windows authentication, and localized authentication are all accessible out-of-the-box. Please contact us for questions about security scenarios specific to your environment.

Q:          What are all the possible modes that VSAs can interact with End Users?

A:           Text, Text-to-Speech, and Buttons.

Q:          Can we create role-based access for Ayehu?

A:           Yes. That’s a standard function which can also be controlled by and synchronized with Active Directory groups out-of-the-box.

Q:          Apart from incident tickets, does Aheyu operate on request tickets (e.g. on-demand access management, software requests from end-users, etc.)?

A:           Yes. The integration packs we offer for ServiceNow, JIRA, BMC Remedy, etc. all provide this capability for both standard and custom forms.

Q:          Does Ayehu provide APIs for an integration that’s not available out of the box?

A:           Yes. There are two options. You can either forward an event to Ayehu using our webservice which is based on a RESTful API, or from within the workflow you can send messages outbound that are either scheduled or event-driven. This allows you to do things such as make a database call, set an SNMP trap, handling SYSLOG messages, etc.

Q:          Does Ayehu provide any learning portal for developers to learn how to use the tool?

A:           Yes. The Ayehu Automation Academy is an online Learning Management System we just launched recently. It includes exams that provide you an opportunity to bolster your professional credentials by earning a certification. If you’re looking to advance your organization’s move to an automated future, as well as your career prospects, be sure to check out the Academy.

Q:          Does Ayehu identify issues like a monitoring tool does?

A:           Ayehu is not a monitoring tool like Solarwinds, Big Panda, etc. Once Ayehu receives an alert from one of those monitoring systems, it can trigger a workflow that remediates the underlying incident which generated that alert.

Q:          We have 7 different monitoring systems in our environment. Can Ayehu accept alerts from all of them simultaneously?

A:           Yes. Ayehu’s integrations are independent of one another, and it can also accept alerts from webservices. We have numerous deployments where thousands of alerts are received from a variety of sources and Ayehu can scale up to handle them all.

Q:          What does the AI in Ayehu do?

A:           There are different areas where AI is used. From use in understanding intent through chatbots to workflow design recommendations, and also suggesting workflows to remediate events through the Ayehu Brain service. Please contact an account executive to learn more.

New call-to-action