Alert escalation chains provide a structured approach to communicating incident information with people and 3rd-party applications. Automating this process is about a lot more though than just communications.
It’s about data enrichment, meaning you can not only automate the gathering of forensic information such as CPU and memory utilization, which apps are currently running, whether or not the OS can be pinged, etc., but you can also quickly determine if an alert is a false positive before taking any further steps. This mitigates the need for human intervention, ultimately making your IT Operations more efficient.
It’s also about automating incident remediation as well as automating documentation of everything that transpired and all the people involved in remediating a given incident.
Given LogicMonitor’s growing popularity, automating an alert escalation chain for that platform can add significant value to IT operations.
By now everyone reading this posting has heard about the Solarwinds Hack. That was the cyberattack perpetrated by agents of the Russian government where malicious code was inserted into software updates for SolarWinds Orion.
Since the attack took place, cyber forensic specialists have been investigating, and the deeper they’ve probed the bigger they’ve found the breach to be. Among the many high-profile organizations known to be affected by the hack are NATO, the U.K. government, the European Parliament, numerous Federal and State government agencies in the US, and a laundry list of private sector businesses.
The SolarWinds hack was so broad and so pernicious that it’s escalated international tensions. It also led to political repercussions when the U.S. government ordered financial sanctions and diplomatic expulsions against Russia.
This has put SolarWinds in a very bad situation, and led to a lot of their customers re-evaluating their relationship with them. Many of those customers have decided to move off the SolarWinds platform altogether and onto something they consider safer.
We’ve seen anecdotal evidence of this among our own customers, and one of the leading platforms we’ve seen people abandoning Solarwinds for is LogicMonitor, who tout themselves as “…the only fully automated, cloud-based infrastructure monitoring platform for enterprise IT and managed service providers.”
Let’s take a closer look why LogicMonitor has been getting a lot of attention as an alternative to SolarWinds.
G2 is one of the world’s leading business solutions reviews websites. It’s kind of a Yelp for enterprise software, with lots of listings, ratings, and reviews.
G2 just released their spring 2021 report, and LogicMonitor has emerged as one of the top competitors in the Cloud Infrastructure Monitoring Software category.
This chart shows them rated very strong along the vertical Y-axis in market presence, and the top-rated choice for satisfaction along the X-axis at the bottom.
LogicMonitor was among the top 4 vendors for satisfaction ratings in G2’s Spring 2021 Report, and this chart breaks down the constituent components their score is based on. As you can see, LogicMonitor finished first or tied for first in every category but one, and they finished a close second in that one.
The G2 Spring 2021 Report also ranked LogicMonitor #1 for usability. Pretty impressive.
So why am I singing the praises of LogicMonitor? Am I trying to sell you a license?
No, not at all. Ayehu is all about automation and LogicMonitor is all about observability, the fashionable new term for monitoring. Those are two very different things.
The purpose in showing you these last few charts was simply to provide context for why LogicMonitor is growing in popularity. As that popularity increases, demand for automated workflows that integrate with LogicMonitor is growing as well, and that’s where Ayehu can add value to LogicMonitor customers and partners.
Back in the good old days of early 2020 (and prior), this is what an IT department might’ve looked like.
If an incoming call to the service desk came in and the L1 technician couldn’t remediate the incident on his or her own, then they might’ve just got up out of their seat and walked a few steps over to the cubicle where the L2 technician was seated to recruit their help.
If the L2 technician couldn’t figure it out, he might’ve simply swiveled around to get assistance from the service owner, or the subject matter expert (the SME), or even the vendor representative if they were deployed onsite to provide real-time troubleshooting.
This work paradigm may or may not return, but one of the lessons learned from the pandemic is this isn’t how teams work right now, at least in the vast majority of IT departments. Instead, here’s how things have been functioning for over a year now.
When incident alert notifications come in from a monitor system, they go to an L1 Technician just like before, except now the technician is working from home.
If the L1 Technician can’t resolve the underlying incident that triggered the alert, they reach out to the L2 technician, who is also working from home, and may now have a personal assistant sitting on their lap constantly interrupting them.
If the L2 technician is stumped by the alert, well he can’t just swivel around in his chair anymore to talk to the subject matter expert or SME. He now has to find them remotely and hope they’re available, but the SME might be taking a break for a heart-to-heart discussion with their 4-legged personal assistant.
So now the L2 technician needs to find another SME that can help with this incident. And on it goes.
This is called an alert escalation chain, and it’s something you very much want to automate, especially given the current remote work paradigm overlayed upon the same pre-existing complex environments.
Those pre-existing environments may have already included geographic dispersal for your organization. In fact, many IT operations teams have a global footprint, not just those for large enterprises. So having a worldwide alert escalation chain might very well be something you’re already familiar with.
Then of course there’s the need for high availability because we live in an always-on world with expectations that systems be available and accessible 24/7/365 with 99.99%+ uptimes.
Finally, if all that wasn’t enough, your IT department may be committed to rigid and very demanding SLAs, which require short response times from your incident response teams.
These separate, but highly intertwined, dynamics elevate the importance of proper communication within IT when responding to incidents.
All these factors, the need for high availability, geographic dispersal, and rigid service level agreements mean that more than ever, IT Operations teams require resilient, structured communication chains that help get the right notification to the right person at the right time. Automation technologies like Ayehu NG can deliver precisely that exact capability.
From the moment an incident occurs, to the alert getting created, to acknowledgement by a technician, to the need for escalation, to final resolution, the entire chain needs to function in a way that automates this tedious administrative process so that the people can focus on doing the work that keeps the systems up and running.
I think I’ve made a pretty good argument for automating your alert escalation chain, but some of you might be wondering what the consequences are of not automating your alert escalation chain. Here’s a few for you to consider:
- Keeping alert escalation as a manual process relying on human labor makes it harder and more time consuming to find the right people for remediating any given incident
- When it takes longer to find the right person or expert to remediate an incident, that leads to a deterioration in mean-time-to-resolution, which is a critical KPI for most IT Ops teams
- That deterioration in MTTR will ultimately lead to a decline in customer satisfaction, sometimes represented as an NPS score, which is another increasingly important KPI, and from what we hear anecdotally is becoming a popular basis for bonus calculations at many enterprises
It turns out that for all the upside there is to automating alert escalation chains, there’s also a lot of downside to keeping them as manual processes.
If you’re interested in test driving Ayehu NG and seeing how easy it is to automate alert escalation chains for LogicMonitor or any monitoring/observability platform, click here to download your very own free 30-day trial version today.