How to Predict and Remediate IT Incidents Before They Affect Business Outcomes [Webinar Recap]

Author: Guy Nadivi

The ability to proactively predict  and remediate IT incidents BEFORE they occur, rather than react to them after they’ve already happened, is one of the key value propositions of a new IT operations category called AIOps, which stands for Artificial Intelligence for IT Operations.

Leveraging the AI part of AIOps to mitigate problems before they become problems is a game changer for IT. So we’ve partnered with Loom Systems, who like ourselves are a Gartner Cool Vendor in their category, to demonstrate how two best-of-breed providers can integrate their respective platforms to create an enterprise-grade AIOps solution. In doing so, we believe the result is an early glimpse at the self-healing data center of tomorrow, and we think you’ll be intrigued to experience how you can peek over the horizon to see  and automatically remediate incidents before they impact end-users.

Let’s start with the obvious question many of you might have on your mind – what is AIOps? It is after all, a term that kind of snuck up on all of us.

The term AIOps, like a lot of buzzwords in our industry, was originated by Gartner. In this case, a Sr. Director Analyst named Colin Fletcher coined it in 2016, and its earliest published appearance (as best I can tell) was in early 2017.

Interestingly though, Colin told me he originally meant the term to refer to Algorithmic IT Operations.

Since then it’s evolved to refer to Artificial Intelligence for IT Operations.

Now we all know how it is in IT marketing. New buzzwords are used to refresh a category and create excitement. So is AIOps basically just a recycling of the term “IT monitoring”? Are IT monitoring and AIOps basically the same? Twins, so to speak, but with different names?

Here’s the definition for IT Monitoring, courtesy of an internet publication many of you are probably aware of called TechTarget:

  “IT monitoring is the process to gather metrics about the operations of an IT environment’s hardware and software to ensure everything functions as expected to support applications and services.   Basic monitoring is performed through device operation checks, while more advanced monitoring gives granular views on operational statuses, including average response times, number of application instances, error and request rates, CPU usage and application availability.”    

The operative words there are “gather metrics” – “through device operation checks”.

This reflects one of the primary characteristics of IT Monitoring – namely that it’s passive in nature.

And here’s Colin Fletcher’s original definition for AIOps:

“AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.”

Unlike IT Monitoring, AIOps is proactive and far more sophisticated. So AIOps is a LOT MORE than just IT Monitoring.

At this point you may be asking yourself, “OK, but how can this benefit me?”

As we all know, in today’s Digital Era, most businesses are digital or undergoing a digital transformation, which means that IT systems are replacing many traditional physical business processes, and that in turn means more work for IT Operations.

In fact, IT Operations engineers have become responsible for the customers’ digital experience. When your organization’s systems are misbehaving, underperforming, or worse not working at all, your customers’ satisfaction is affected, which often leads to customer churn.

It’s that simple.

End users often use applications or websites and love how simple and intuitive they can be. In IT though, we all know that building something to look nice and simple, can actually be quite difficult. That’s because there are usually many technologies under the hood that need to work together seamlessly in order for these digital experiences to run smoothly.

As if that wasn’t enough, let’s add some more complexity:

With Cloud Computing on the one hand, and Microservices architectures on the other, things become even more complex, for the following reasons:

  1. Cloud computing means abstraction – that can lead to struggles understanding what the impact of a performance issue on a host will do to other components of your applications.
  2. These environments change dynamically, making it harder to stay on top of everything.
  3. Microservices often require disparate data sources, each generating its own logs and metrics, making tracing and correlation an inherent part of root cause analysis (RCA).

So, the increased complexity of digital businesses architectures, coupled with the explosion of different data types, and the elevated expectations consumers have these days for seamless end user experiences, makes the life of IT Operations teams quite challenging.

Enter AIOps.

AIOps is a set of tools that enable achievement of optimum availability and performance by leveraging machine learning technologies against massive data stores with wide variance. The big idea here is to use machines to deal with machines.

Here are some examples of the challenges customers often look to address by implementing AIOps:

  • Outage prevention – organizations in the process of cloud migration or architecture change, often look for modern technologies like AIOps to help them prevent outages before the business is affected. This is a marked difference from 2 years ago when the market was just focused on noise reduction. Artificial intelligence and machine learning have raised expectations of how much more is possible.
  • Capturing different data feeds – this means it’s not just about alerts anymore. There’s a huge need to consolidate logs, metrics, and events together, and to make sense out of them as a whole.
  • Consolidation of tools – this one is mainly about the workflow of the users. They’d like AIOps to make their daily lives easier and consolidate everything into one system.

A monitoring architecture for modern enterprises that can do all of the above would be a real-life example of a self-healing architecture.

Everything starts with observability. Many enterprises use one or more infrastructure monitoring tools. Application Performance Management (APM) monitors do a great job in monitoring performance, but are very limited for the application stack and log management, rendering them a bit unhelpful for triage and forensic investigations.

These monitoring tools are usually focused on specific data feeds or IT layers, and they emit alerts when things go wrong. However, these can lead to confusing alert storms.

This is another reason why organizations are beginning to leverage AIOps to work for them and make sense out of it all. Think of AIOps as a robot that turns monotonous data into information you cannot ignore. In our case, turning logs into predictions or early stage detection of an outage.

Now that you know something is about to break, can you prevent it from happening? That’s exactly the idea of self-healing. When working with an intelligent automation platform like Ayehu, you can build simple (or complex) remediation workflows, that can take the alert from Loom Systems and automatically remediate the incident BEFORE it becomes something more calamitous.

In your monitoring architecture, you want the Automation tool to seamlessly interact with both the AIOps solution and your ITSM platform, to open a ticket and update it as you’re taking remedial action.

When configured properly, this architecture can resolve issues before they affect the business, while also documenting what happened for future reference.

Gartner concurs with this approach.

In a paper published earlier this year (ID G00384249 – April 24, 2019), they wrote that:

  “AI technologies play an important role in I andO, providing benefits such as reduced mean time to response (MTTR), faster root cause analysis (RCA) and increased I andO productivity. AI technologies enable I andO teams to minimize low-value repetitive tasks and engage in higher-productivity/value-oriented actions.”    

No ambiguity there.

A little further down in the same paper, Gartner gave the following recommended actions, representing their most current advice to infrastructure and operations leaders regarding AIOps and automation:

  Embark on a journey toward driving intelligent automation. This involves managing and driving AI capabilities that are embedded by infrastructure vendors, in addition to reusing artificial intelligence for operations (AIOps) capabilities to drive end-to-end (from digital product to infrastructure) automation.”    

With AIOps + Automation, it’s possible to predict and prevent network outages or other major disruptions by proactively detecting the conditions leading up to them and automatically remediating them BEFORE disaster strikes. Given how costly a service interruption can be to an enterprise, avoiding issues before they happen will be a critical function in the self-healing data center of tomorrow.

New call-to-action