4 Essential steps for Successful Incident Management

Automated Incident ManagementIt never hurts to go back to basics. Recently, we were surprised at the confusion of some organizations about the process of incident management, so we thought – why not to put a quick primer down on paper?

To be successful, first you need a process – a repeatable sequence of steps and procedures. Such a process may include four broad categories of steps: detection, diagnosis, repair, and recovery.

1 – Detection

Identification Incident management begins with problem identification. This can be handled using different tools. For instance, infrastructure monitoring tools help identify specific resource utilization issues, such as disk space, memory, CPU, etc.  End user experience tools can mimic user behavior and identify users’ POV problems such as response time and service availability. Last but not least, domain-specific tools enable detecting problems within specific environments or applications, such as a database or an ERP system.

On the other hand, users can help you detect unknown problems that are not reported by infrastructure or user behavior monitoring tools. The drawback with problem detection by users is that it usually happens late (the problem is already there), moreover the symptoms reported may lead you to point to the wrong direction.

So which method should you use? Depending on your environment, the usage of the combination of multiple methods and tools would be the best solution. Unfortunately, no single tool will enable detecting all problems.

Logging events will allow you to trace them at any point to improve your process. Properly logged incidents will help you investigate past trends and identify problems (repeating incidents from the same kind), as well as to investigate ownership taking and responsibility.

Classification of events lets you categorize data for reporting and analysis purposes, so you know whether an event relates to hardware, software, service, etc. It is recommended to have no more than 5 levels of classification; otherwise it can get very confusing. You can start the top level with something like Hardware / Software / Service, or Problem / Service request.

Prioritization lets you determine the order in which the events should be handled and how to assign your resources. Prioritization of events requires a longer discussion, but be aware that you need to consider impact, urgency, and risk. Consider the impact as critical when a large group of users are unable to use a specific service. Consider the urgency as high when the impacted service is of critical nature and any downtime is affecting the business itself. The third factor, the risk, should be considered when the incident has not yet occurred, but has a high potential to happen, for example, a scenario in which the data center’s temperature is quickly rising due to an air conditioning malfunction. The result of a crashing data center is countless services going down, so in this case the risk is enormous, and the incident should be handled at the highest priority.

2 – Diagnosis

Diagnosis is where you figure out the source of the problem and how it can be fixed. This stage includes investigation and escalation.

Investigation is probably one of the most difficult parts of the incident management process. In fact, some argue that when resolving IT problems, 80% of the time is spent on root cause analysis vs. 20% that is spent on problem fixing. With more straightforward problems, Runbook procedures may be very helpful to accelerate an investigation, as they outline troubleshooting steps in a methodical way.

Runbook tip: The most crucial part of the runbook is the troubleshooting steps. They should be written by an expert, and be detailed enough so every team member can follow them quickly. Write all your runbooks using the same format, and insist on using the same terms in all of them. New team members who are not familiar yet with every system will be able to navigate through the troubleshooting steps much more easily.

Following the runbook can be very time consuming and lengthen the recovery time immensely. Instead, consider automating the diagnostic steps by using run book automation software. If you build the flow cleverly and weigh in all the steps that lead to a conclusion, automating the diagnostics process will give you quick answers, and help you decide what your next step is.

Escalation procedures are needed in cases when the incident needs to be resolved by a higher support level.

3 – Repair

The repair step, well… it fixes the problem. This may sometimes involve a gradual process, where a temporary fix or workaround is implemented primarily to bring back a service quickly.  An incident repair may involve anything from a service restart, a hardware replacement, or even a complex software code change. Note that fixing the current incident does not mean that the issue won’t recur, but more on that issue in the next step.

 In this case too, straightforward repairs such as a service restart ,a disk cleanup and others can be automated.

4 – Recovery

The recovery phase involves two parts: closure and prevention.

Closure means handling any notifications previously sent to users about the problem or escalation alerts, where you are now notified about the problem resolution. Moreover closure also entails the final closure of the problems in your logging system.

Prevention relates to the activities you take, if possible, to prevent a single incident from occurring again in the future and therefore becoming a problem. Implement two important tools to help you in this task:

RCA process (Root Cause Analysis) The purpose of the RCA process is to investigate what was the root cause that led to the service downtime. It is important to mention that the RCA process should be performed by the service owners, who are not necessarily the ones who solved the specific incident. This is an additional reason why incident logging is so important – the information in the ticket is crucial for this investigation process.

And finally, Incident reports – while this report will not prevent the problem from occurring again, it will allow you to continually learn and improve your incident management process.

How to Get Critical Systems Back Online in Minutes

How IT Process Automation Can Benefit Over Human Intervention

How IT Process Automation Can Benefit Over Human InterventionIn the not so distant past, businesses of just about every industry held steadfastly to the belief that computers could never be as valuable as a human employee. Even when IT Process automation began to take hold of the manufacturing field, with cars being assembled by machines rather than assembly line workers, there was still belief that human intervention was the most coveted asset of an organization.

While it’s true that technology will never fully replace people, it’s becoming increasingly clear that automation can provide a distinct benefit beyond what any living, breathing employee could. Here’s how.

If you think about the typical day to day tasks of a data center in any given industry, you’ll inevitably come up with a list of routine, repetitive actions….managing storage, assigning network access, monitoring and responding to incoming incidents, adhering to SLAs, and countless other activities. Leaving these tasks in the hands of human employees could actually be causing your business more harm than good. Not only do these things cost valuable time and resources, but they’re very easy to mess up, leaving your organization vulnerable to costly human error.

IT process automation  solution

IT process automation provides a solution to these risks by taking just about every manual, repetitive task and allowing technology to do the heavy lifting instead. This vastly improves speed and efficiency, which in turn boosts service levels. It also eliminates the chance of mistakes made by overworked or tired human workers. And because the IT team will no longer be bogged down by menial day to day tasks, they will be able to focus on other, much more important items, vastly improving overall productivity.

Take this concept to the next level, and you’ve got the possibility of automating not just simple, repetitive tasks, but entire complex workflows. This broad term is applied to any series of events that take place in a certain pattern to achieve a desired outcome. One example of a common IT workflow is the service ticket process. A user initiates a ticket, which is retrieved and investigated by an IT team member, and either handled directly or escalated. The flow continues through resolution and the original ticket is closed, completing the workflow process.

IT workflow automation is designed to reduce and/or eliminate human intervention as much as possible. In the example above, rather than having an IT worker handle the service ticket process, the workflow could continue automatically, with responses and actions taken based on predetermined instructions. This eliminates the need for most, if not all human intervention in the process, making it faster and more accurate.

While some workflows may still require human intervention in certain situations, such as when a workflow encounters an error and cannot be completed, or when a step in a more complex process requires approval, even with these occasional interruptions, automated workflows are exponentially more efficient than if they were handled entirely by IT personnel.

It’s important to note the difference between IT process automation and scripts, which many organizations still rely on to assist with internal workflows. In comparison, IT automation provides a much greater level of control and efficiency than scripts. Automation is also much easier to manage, since scripts can be quite complicated and typically require the expertise of a tech-savvy person to write, manage and troubleshoot them. IT process automation is much more user-friendly and intuitive, and also much less prone to error.

Additionally, IT workflow automation can be integrated with existing systems to provide enhanced benefit and a more robust solution than standalone products. For example, the right automation tool integrated with an existing monitoring system can enhance the quality, speed and accuracy of incident management.

Imagine how much more valuable your IT team would be if they didn’t have to spend hours upon hours every day managing and monitoring workflows. Now, think about how your organization as a whole could benefit from improved efficiency, fewer errors, better service levels and lower expenses. When you look at it from that angle, it’s easy to see how beneficial IT workflow automation truly can be above and beyond the human team you’ve got in place.

eBook: 10 time consuming tasks you should automate