How to Automate Incident Response for Splunk Alerts in Minutes

Author: Guy Nadivi, Sr. Director of Marketing, Ayehu

Let’s talk about Splunk, a market leader in the Security Event Information Management (SEIM) market. BTW – You can always tell who the market leader is in any category when its competition starts touting itself as the ones who will eradicate that company. Recently, one of Splunk’s competitors described themselves as “Splunk Killers”, reaffirming that Splunk is indeed at the head of its class in that segment.

In Gartner’s 2018 Magic Quadrant for the SIEM market, Splunk appears higher than everyone else and further to the right than anybody but IBM. What this means is they excel above all other competitors on the y-axis of the Magic Quadrant, which is a measurement of “Ability to Execute.” On the x-axis of the Magic Quadrant, which measures “Completeness of Vision,” they exceed almost everyone except IBM.

Score highly on those two measurements, and Gartner considers you a Market Leader.

Market share is another key indicator of market leadership, and here Splunk is ranked No. 2 with 13.7% market share. Only IBM has a larger market share when it comes to SEIM’s.

Thanks to Splunk’s January 31, 2019 Form 10-K filing with the SEC, we also know they have 17,500 customers in more than 130 countries, including 90% of the Fortune 100. Another clear indication that they are a leader in this market.

With a market position like that, it seems worthwhile to talk about how to quickly and easily automate incident remediation for Splunk alerts in minutes.

As many people know, Splunk produces software for capturing, indexing, correlating, searching, monitoring, and analyzing machine-generated big data.

Some sources of that data include logs for Windows events, Web servers, and live applications, as well as network feeds, metrics, change monitoring, message queues, archive files, and so on.

Generally, these data sources can be categorized as:

  • Files and directories
  • Network events
  • Windows sources
  • And the catch-all category of “other sources”

There are a number of outputs and outcomes Splunk generates from this data, including:

  • Analyzing system performance
  • Troubleshooting failure conditions
  • Monitoring business metrics
  • Creating dashboards to visualize and analyze results
  • And of course storing and retrieving data for later use

That’s A LOT of data, and the more systems Splunk monitors, and the more those systems grow, the greater the volume of machine data that gets generated. This is becoming a problem because IT and security operations are getting inundated by all this data, and not just from Splunk, but other systems as well, though Splunk generates a big chunk of this.

Every time there’s an incident, an event, a threshold being crossed, etc. new data is generated, adding to the surge already flooding over IT and Security Operations. And it’s only getting worse.

Ultimately, it’s people who have to deal with all this data, and the problem is, (as we often say) people don’t scale very well.

Even the very best data center workers in NOCs and SOCs can only handle so much. At some point – and that point is pretty much right now – automation has to take on a greater share of the task burden all this growth in data is necessitating.

Why automation? Because people may not scale very well, but automation DOES! And if you’re in one of these overwhelmed data centers, that should be music to your ears.

Here are just a few of the ways automation can bring relief to NOCs and SOCs drowning in Splunk data:

Triggering Workflows

Let’s say there’s been an event detected of a corporate website being hacked and defaced. This event can trigger an automatic workflow that quickly restores a website to its pre-defacement state. In fact, an automation platform like Ayehu can do this MUCH quicker than humans could do manually once they got the alert. Restoring the website automatically and almost instantaneously minimizes the damage to corporate reputation, not to mention the threat to job security because the defacement happened in the first place.

Remediating Incidents

In addition to the example of remediating a website defacement incident, let’s consider a situation where Splunk generates an alert about a specific machine due to some observed suspicious activity. Ayehu can remotely lock it either automatically or at the SOC analyst’s manual command, to mitigate any damage until a hands-on inspection can take place. Furthermore, this automated incident remediation workflow could also include doing things like deactivating that user’s Active Directory credentials, turning off their card key’s ability to swipe in or out of a building, etc.

Data Enrichment

This task is well known to anyone who’s ever had to perform cybersecurity forensics during and after an incident. It involves aggregating all the information a SOC analyst needs to make an informed decision about what’s happening in real-time, or what happened as part of an after-incident evaluation. This can be a laborious manual task, and certainly one that’s difficult to script out.

If your automation platform easily integrates with just about anything in a typical, heterogeneous IT environment, however, then it can gather this critical information very rapidly as well as add more precise context to it about the nature of the incident. This will greatly reduce time-to-decision-making for SOC analysts, which is vital when, for example, you’re watching a ransomware virus swiftly encrypt your enterprise data and you need to decide on a course of action fast.

Opening Tickets

Just about every data center uses an ITSM platform like ServiceNow, JIRA, BMC Remedy, or one of many others. It’s very important to document what steps were taken to remediate an incident or conduct a cybersecurity forensics investigation. SOC analysts are pretty overwhelmed these days, and often don’t have the time to do that. When they do have time, they often don’t document as thoroughly as necessary in order to provide a complete picture of what transpired.

An automation tool like Ayehu can do this much quicker, and in real-time during workflow execution, so everything is properly documented, and nothing slips through the cracks.

Now let’s walk through the flow of events that uses Splunk data and alerts as triggers for actions.

We call this flow a closed-loop, automated incident management process. It starts out with Ayehu NG creating an integration between Splunk and whatever IT Service Management or help desk platform you’re using, be it ServiceNow, JIRA, BMC Remedy, etc.

When Splunk generates an alert or any kind of data you want to act upon, Ayehu intercepts it via the integration point. It will then parse it to determine the underlying incident, and launch the appropriate workflow for that situation, whether it be remediating that specific underlying incident, gathering information for forensic analysis, or whatever.

While this is taking place, Ayehu also automatically creates a ticket in your ITSM, and updates it in real-time by documenting every step of the workflow. Once the workflow is done executing, Ayehu automatically closes the ticket. All of this can occur without any human intervention, or you can choose to keep humans in the loop.

This closed-loop illustration also reveals why we think of Ayehu as a virtual operator, which we sometimes refer to as “Level 0 Tech Support”. Many incidents can simply be resolved automatically by Ayehu without human intervention, and without the need for attention from a Level 1 technician.

Imagine automating manual processes like Capture, Triage, Enrich, Respond, and Communicate. Automating resolution and remediation can result in a pretty significant savings of time, which can be particularly critical for data centers feeling overwhelmed.

Customers tell us over and over that automating the manual, tedious, time-intensive stuff accelerated their incident resolution by 90% or more.

We can also say with confidence that you can automate incident response for Splunk alerts in minutes, because Ayehu’s automation platform is agentless. Being agentless also makes us non-intrusive since we leverage API’s, SSH, and HTTPS behind enterprise firewalls under that organization’s security policy to perform automation. The only software to install is on a server, either physical or virtual, which centralizes management and greatly simplifies maintenance and upgrades.

Another reason it only takes minutes to automate incident response for Splunk alerts is because the Ayehu automation platform is codeless. This is something really important to consider because while there are many vendors out there touting their platforms as “automation”, the fact remains that they’re really just frameworks for scripting, and we steadfastly believe that scripting IS NOT automation.

For starters, in order to script you need to have programming expertise. With a true automation tool, however, you shouldn’t need to have any programming expertise. In fact, the automation platform should be so easy to use, even a junior SysAdmin with zero programming expertise should be able to master it in less than a day. Why is that so important? Because one of the promises of true automation is that you don’t have to rely on specialized talent to orchestrate activities in your environment. Requiring specialized programmers would be a bottle-neck to that goal.

Finally, the Ayehu automation platform includes AI and Machine Learning built into the product.

The first thing you should know about Ayehu’s AI and Machine Learning efforts is that we’re partnered with SRI International (SRI), formerly known as the Stanford Research Institute. For those not familiar, SRI does high-level research for government agencies, commercial organizations, and private foundations. They also license their technologies, form strategic partnerships (like the one they have with us), and create spin-off companies. They’ve received more than 4,000 patents and patent applications worldwide to date. SRI is our design partner, and they’ve designed the algorithms and other elements of our AI/ML functionality. What they’ve done so far is pretty cool, but what we’re working on going forward is really exciting.

Questions and Answers

Q:          What are the pros and cons of using general purpose bot engines compared to your solution?

A:           General purpose bot engines won’t actually perform the actions on your infrastructure, devices, monitoring tools, business applications, etc. All they can really do is ingest a request. By contrast, Ayehu not only ingests requests, but actually executes the necessary actions needed to fulfill those requests. This adds a virtual operator to your environment that’s available 24x7x365. Additionally, Ayehu is a vendor-agnostic tool that interfaces with MS-Teams, Skype, etc. to provide these general purpose chat tools with intelligent automation capabilities.

Q:          Do you have an on-premise solution?

A:           Yes. Ayehu can be installed on-premise, on a public or private cloud, or in a hybrid combination of all three.

Q:          Do you have voice integration?

A:           Ayehu integrates with Amazon Alexa, and now also offers Angie™, a voice-enabled Intelligent Virtual Support Agent for IT Service Desks.

Q:          If a user selects a wrong choice (clicks the wrong button) how does he or she fix it?

A:           It depends on how the workflow is designed. Breakpoints can be inserted in the workflow to ask the endpoint user to confirm their button selection, or go back to reselect. Ayehu also offers error-handling mechanisms within the workflow itself.

Q:          Does Ayehu provide orchestration capabilities or do you rely on a 3rd party orchestration tool?

A:           Ayehu IS an enterprise-grade orchestration tool, offering over 500 pre-built platform-specific activities that allow you to orchestrate multi-platform workflows from a single pane of glass.

Q:          Can you explain in a bit more detail on intent-based interactions?

A:           Intent is just that, what the user’s intent is when interacting with the Virtual Support Agent (VSA). For example, if a user types “Change my password”, the intent could be categorized as “Password Reset”. That would then trigger the “Password Reset” workflow.

Q:          Thanks for the information so far, great content! I would like to know if I can use machine learning from an external source, train my model, and let Ayehu query my external source for additional information?

A:           Yes. Ayehu can integrate with any external source or application, especially when it has an API for us to interface with.

Q:          Can I create new automations to my inhouse applications?

A:           Yes. Ayehu can integrate with any application bi-directionally. Once integrated with your inhouse applications, Ayehu can execute automated actions upon them.

Q:          Is there an auto form-filling feature? (which can fill in a form in an existing web application)

A:           Yes. Ayehu provides a self-service capability that will allow this.

Q:          How can I improve or check how my workflows are working and helping my employees to resolve their issues?

A:           Ayehu provides an audit trail and reporting that provides visibility into workflow performance. Additionally, reports are available on time saved, ROI, MTTR, etc. that can quantify the benefits of those workflows.

Q:          What happens when your VSA cannot help the end user?

A:           The workflow behind the VSA can be configured to escalate to a live support agent.

Q:          If there is a long list of choices – what options do you have? Dropdown?

A:           In addition to the buttons, dropdowns will be provided soon in Slack as well.

Q:          Did I understand correctly, an admin will need to create the questions and button responses? If so, is this a scripted Virtual Agent to manage routine questions?

A:           Ayehu is scriptless and codeless. The workflow behind the VSA is configured to mimic the actions of a live support agent, which requires you to pre-configure the questions and expected answers in a deterministic manner.

Q:          Is NLP/NLU dependent on IBM Watson to understand intent?

A:           Yes, and soon Ayehu will be providing its own NLP/NLU services.

Q:          Are you using machine learning for creating the conversations? Or do I have to use intents and entities along with the dialogs?

A:           Yes, you currently have to use intents and entities, but our road map includes using machine learning to provide suggestions that will improve the dialogs.

Q:          What are the other platforms that I can deploy the VSA apart from Slack?

A:           Microsoft Teams, Amazon Alexa, ServiceNow ConnectNow, LogMeIn, and any other chatbot using APIs.

This is a recap of a live Webinar we hosted in May 2019. To watch the on-demand recording and see this content in action, please click here.

New call-to-action

Share This Post

Share on facebook
Share on twitter
Share on linkedin