5 Best Practices for Automating Incident Management Alerts

With more and more large companies opting into a remote workforce, including Coyote Creek, we believe that having a well-connected, automated Incident Management and Alerting system is more important than ever. 

As an IT professional services firm, we believe that in order for IT service management (ITSM) and DevOps teams to thrive in a remote environment and to compete in a rapidly innovating field, businesses will need to update their processes. 

Today, we’re going to share the best practices that we’ve developed, as a remote team, to restore normal service operation after an interruption, as fast as possible, and minimize the business impact for our clients. 

If you’d like to dive deeper into how Coyote Creek does Incident Management and Alerts, please download our new whitepaper, Opsgenie: Modern Incident Management and Alerts for Remote Teams.

In the meantime, let’s get to Coyote Creek’s best practices for automating incident management alerts.

Best Practices for Automating Incident Management Alerts

At Coyote Creek, we use Opsgenie by Atlassian to automate our incident management alert notifications.

We love Opsgenie’s functionalities because they connect all our monitoring, alerting, and collaboration tools so we never miss a critical alert and have the clarity we need to spring into action and start troubleshooting without confusion and process bottlenecks.

This is particularly important for remote teams who don’t have the traditional fall-back safeguards they might have in a traditional office setting, like instant communication with co-workers.

Here’s how we use Opsgenie to streamline and improve our processes. 

Step 1: Automate retrieval and prioritization of tickets to begin the incident management process. 

Through our integration, Opsgenie automatically pulls data from our ticketing system (Jira Service Desk), and then, using rules we’ve dictated, Opsgenie applies a priority level to the ticket. Already we’ve saved two manual steps: pulling tickets and analyzing them to assign a priority level.

Step 2: Create alerts for team members based on pre-set on-call schedule and notification preferences. 

After a priority level has been set, Opsgenie automatically sends a new alert to the indicated responders based on the routing rules that we’ve given Opsgenie as well as the notification rules for alert actions we’ve shared with the tool (i.e. text messages, email, call, etc).

Step 3: Deduplicate notifications to avoid “Alert Fatigue”

Furthermore, we’ve also set Opsgenie to deduplicate alerts from all of our systems so that our engineers aren’t getting the same alert from several systems. 

This is a critical process because otherwise, the overflow of repetitive alerts can not only cause “alert fatigue” but it can greatly increase the chance that a critical alert will get missed among the litany of repeat notifications.

Step 4: Automate information gathering so engineers can save precious time and get problems fixed faster. 

Once the appropriate person is notified, they are immediately given all of the information that would’ve previously had to gather manually. Opsgenie collects the ticket name, summary, full description, the affected client(s) and a direct link to the ticket. 

This saves an enormous amount of time that use to be used to gather data manually and inevitably ask clarifying questions of other staff members.  

Step 5: Connect all necessary parties through Opsgenie to Slack integration.

Finally, Opsgenie opens our conference bridge as well as the appropriate Slack channel so that affected stakeholders and other contributing staff members can communicate as needed and receive updates regarding an open ticket. 

This saves time wasted reaching out to separate people and minimizes the feedback loop during incident management.


Each of these five steps together has reduced our overall response time by at least 50% and has helped us to make our process simple, fast, and effective. By automating much of the menial tasks involved in responding to system incidents, our engineers and escalation managers are able to jump into action and resolve problems much more quickly.  

What we love the most is that we’re able to configure Opsgenie to do all the things we need based on OUR business rules, not rules pre-configured by someone who knows nothing about our business. 

Again, this is a must for remote teams that need to be empowered to work efficiently without needing a ton of hand-holding and supervision.

Now that we’ve discussed how we use Opsgenie, let’s take a quick look at what our process actually looks like when put into action. 

An Inside Look at Opsgenie Alerts at Coyote Creek

It’s important to note that all of our Opsgenie integrations are bi-directional, so all of our connected systems communicate consistently and update in real-time. 

After programming Opsgenie with our on-call schedules, prioritization rules, and alerting preferences, it then goes through its rules to determine who should be alerted, when, and how. 

Here are two examples of the Opsgenie alert process might play out.

Example 1: Server Offline Error

Here, a system offline alert is sent via the Opsgenie email integration.

Opsgenie Disk Utilization Alert

  • A ‘server offline’ alert is generated by our monitoring tool and an email is triggered.

Opsgenie Server Offline Alert

  • A forwarding rule exists for the group that receives the alert and forwards it to the Opsgenie
  • integration email.
  • A forwarding rule exists for the Opsgenie mailbox that forwards this alert to Opsgenie.
  • The alert is then filtered and assigned a priority status.
  • In this case, it’s a P1- High Priority since it’s an offline server alert.

Opsgenie Forwarding Rule

  • The alert is then escalated using the team’s escalation policies and on-schedule (on-shift) rotation.

Opsgenie escalation policies

  • The on-shift engineer will get alerted via Slack, email, mobile (phone call or SMS).
  • The engineer will then take action by first acknowledging and then working on the incident.
  • The onshift engineer checked the offline and found it was false positive, restarted the
  • monitoring agent and cleared up, and closed the alert.

Example 2: Disk Utilization Alert

In this next example, we will take a look at P4 Alert also using the Email Integration with Opsgenie.

Opsgenie Disk Utilization Alert

  •  Again, a disk utilization alert was generated by our monitoring tool and an email is forwarded to our monitoring distribution list which includes the Opsgenie mailbox.
  •  A forwarding rule exists for the Opsgenie mailbox that forwards this alert to Opsgenie.

Opgenie Filtering

  • The alert is then filtered and assigned a priority. In this case, it’s a P4-Low Priority.
  • The alert is then escalated using the team’s escalation policies and on-schedule (on-shift) rotation.

Opsgenie Escalation Policy

  • The on-shift CS engineer will get alerted via Slack, Email, Mobile (Phone Call or SMS).
  • The alert is then escalated using the team’s escalation policies and on-schedule (on-shift) rotation.
  • The engineer will then take action by first acknowledging and then working on the alerting incident.
  • The on-shift engineer cleared up space and closed out the alert.

Conclusion

With the rapid innovation of technology and growing expectations for fast problem resolution 24/7/365, software businesses are having to step up their game in order to remain competitive and keep their customers happy. This is particularly true for those with remote teams who can’t fall back on the traditional safeguards of working in-person. 

To succeed, teams must automate their processes and connect their systems to streamline workflows and remove time-wasting activities and disorganization that can lead to missed critical alerts, unhappy customers, and lost revenue. 

For us, Opsgenie is the answer to this. Opsgenie connects all of our monitoring, ticketing, and communication tools so that we can see things through a single pane of glass and shorten the alert lifecycle by saving time normally wasted on manual processes and other inefficiencies.

Ready to go more in-depth on Opsgenie? Download our latest whitepaper here.