Incident Management: A Tale of Two Approaches

by Aaron Geer
Cloud Services Manager

It’s 10:32 am and your entire email system just crashed. No one can access their email. No email is coming in or out. It’s a classic “priority 1” incident, and everyone from the accounts receivable department to the CEO is freaking out.

In a situation like this, how well will your incident management system function? Whether your system will be a help or a hindrance as you attempt to resolve the problem depends a lot on the type of incident management processes and procedures you have in place. To illustrate the difference that this makes, let’s look at a Tale of Two Approaches…

Approach #1: Manual Systems

Over at J.T. Kaplan & Co. the scene is chaotic. There are multiple monitoring systems in place, and they’re all sending alerts. Alerts are buzzing…phones are ringing…and there’s so much noise (both literal and figurative) that it’s hard for the Admin on duty to filter it all out and focus on figuring out what went wrong and finding the solution.

J.T. Kaplan & Co. has a manual incident management process. During the first five minutes of the incident the Admin creates a ticket and begins troubleshooting. But then at the five minute point he has to stop troubleshooting and start the escalation process. This involves manually calling one of the three contacts on the company’s escalation table (a process that can kill off at least five minutes in and of itself) and giving that stakeholder a variety of information about the incident; creating a group conference bridge with the IT team’s Senior Incident Management team and inviting the appropriate people to join it; and updating the ticket with outage information, timeline and status.

In other words, the fire is burning and instead of manning the fire hose, the Admin is busy talking about the fire. Then, as the fire continues to burn and the clock keeps ticking, he’ll have to stop problem solving, again, at other crucial points in time to start the next phase of the escalation process. The continued stopping/starting of the troubleshooting process is dangerous…and can allow the fire to further spread through the enterprise.

Approach #2: Automated Systems

The scene is quite different over at CossTech, because CossTech’s IT team has Atlassian’s Opsgenie automatic incident management platform in place.

Right off the bat, Opsgenie reduces the chaos level by filtering out the noise. Opsgenie interfaces with over 200 apps and programs, and has the intelligence to de-dupe and normalize the alerts. As a result, CossTech’s Admin isn’t being bombarded with multiple alerts regarding the same issue.

Second—and this is key—Opsgenie automates the entire incident management process, based on your pre-configured parameters and rules. Opsgenie creates the ticket. Based on the flexible rules that CossTech established, it automatically contacts the appropriate people at the appropriate times, via voice call, email and/or text. It creates a group conference bridge (such as a WebEx or Zoom) and automatically sends that information out to everyone who needs to join that bridge. It updates the ticket. And so forth.

Meanwhile, CossTech’s Admin is able to focus all of her attention on resolving the incident. She doesn’t have to think about who else needs to be pulled in or informed.

As time goes on, and the Admin is still trying to put the fire out, Opsgenie will automatically call/email/text the next level stakeholders, providing all the necessary details and bridge information. However, with Opgenie “reducing the noise” and automating the escalation process, the Admin can most likely resolve the issue before it reaches this next level. This is because the Admin has more time to focus on putting the fire out, instead of manually calling/talking to stakeholders.

Automation is the way to go

As this example illustrates, Opsgenie offers many benefits. Opsgenie lets you create custom notification policies and on-call schedules in order to streamline incident management. It enables effective collaboration by working seamlessly with many different communications systems and automatically opening up communication channels. It automates incident management and reduces alerting noise. And much more.

Learn more about using Opsgenie for incident management at our upcoming webinar

Coyote Creek will be exploring Opsgenie in an upcoming two-part webinar series. At webinar #1, “Using Opsgenie for Incident Management,” we’ll focus on Opsgenie’s incident management capabilities. Webinar #2 will hone in on its monitoring and alerting functions.

Join us at 10 am PST on June 26th to learn:

  • Overview of Opsgenie
  • How Opsgenie automates Incident Management
  • How Opsgenie improves collaboration
  • How Opsgenie streamlines management
  • How Opsgenie integrates with other apps and programs
  • And more