When it comes to supporting enterprise software users, effective incident response is an organization’s best line of attack against downtime, security breaches, and ultimately, unhappy customers.
A large part of incident management revolves around alerting the right parties to an issue as it arises, but this can quickly get out of hand with a large company.
With no way to prioritize the severity of alerts and multiple means for notification causing an unmanageable amount of alerts to pile up for your on-call engineer, it can be easy for critical notifications to become a needle in a haystack.
This is commonly referred to as alert fatigue and the results can be devastating for your business and your customer.
How to Avoid Alert Fatigue
As IT consultants ourselves, we’ve experienced this. To help us optimize our processes and automate our alerts so the right people are notified promptly, according to their preference, and critical issues receive the highest priority- we turned to Atlassian’s Opsgenie Incident Management Platform.
In our upcoming webinar, How to Automate Incident Alerts and Escalation with Opsgenie, on November 12th we’re sharing what we’ve learned from onboarding and optimizing Opsgenie for ourselves and our clients.
Today, we’re going to share a brief overview of some of our optimization best practices with you, and if you want to take a deeper dive, register for our webinar.
Step 1: Incident Detection and Alert Trigger
The first step in an incident management workflow is obviously incident detection. Ideally, you will rely on system monitoring tools to hopefully recognize the error before your client does, but sometimes, this isn’t always the case. When it’s not, you’ll receive a support ticket.
Opsgenie will take note of that support ticket and track its lifecycle. Already you’re at an advantage by not having to catalog and track the tickets yourself.
Step 2: Alert the Proper Team Member
The next step will be to set up notification rules so that the appropriate engineer receives actionable alerts, whether during working hours or an off-hours on-call representative. This step is, unfortunately, where a lot of trouble can occur with a less than optimal process.
- The wrong person could receive a critical alert
- The right person could receive the alert, in a way they aren’t prepared for. For example, if they’re used to receiving chat alerts and the alert is sent by email.
- The right person receives the alert, along with 1000 others, and doesn’t have the means to prioritize the most critical alerts.
To combat these pitfalls, Opsgenie allows you to set up real-time alerts through your communication channels, preferences, and team schedules to ensure the right person receives the alert, via their preferred communication channel.
Communication channels could be email, a messaging service like Slack, or Zoom if video chat is part of your process. If using a chatops service like Slack, Opsgenie can create a dedicated room for alerts so that notifications don’t get lost among other unrelated notifications.
Furthermore, your communication channels are connected to Opsgenie so that it can maintain an audit of the entire incident management process and all communication that has transpired regarding it.
Step 3: Assess the Impact
Once you’ve identified the appropriate communication channels to ensure your alerts never go missed, the next step is to ensure that alerts prioritization is automated.
This includes severity level as well as what response is needed and who is assigned to begin the troubleshooting process. This includes addressing questions like:
- How are we (or the customer) impacted by this issue?
- What is the user experiencing/seeing?
- How many users are affected?
- When did the issue start?
- How many support tickets are connected to this issue?
- Are there any external factors involved? I.e. data loss, security breach, other systems impacted.
Step 4: Prioritizing Critical Alerts
Another critical piece to an ideal incident management process is being able to separate low-priority critical alerts by applying a severity level. Opsgenie categorizes the severity level in the following way:
- Severity Level 1 – This level is for those incidents that have a very high level of impact. Examples can include customer-facing issues like an outage affecting all users, a security or privacy breach, or a data loss issue.
- Severity Level 2 – This level is for a major incident that has a significant impact on core functionality but is not affecting all customers.
- Severity Level 3 – This is a minor incident with a low level of impact that is a minor inconvenience to customers and often has a workaround available. The performance of the system will be degraded, but usable.
With a numbered system like this, users can take action quickly just by someone mentioning that the alert is a “potential level 1 issue,” and don’t need a lengthy explanation.
This also helps to prioritize alerts for your on-call teams. For example, a less severe alert like a level 3 can probably wait for operating hours while a level 1 or 2 likely can not.
Step 5: Communicating with Key Stakeholders
With Opsgenie, you can have steps 1-4 done in record time and often be able to notify the key stakeholders before they’ve even realized the full impact of the situation.
While communicating internally helps to avoid confusion and expedite a resolution, communicating with the customer, however, is to let them know you’re aware and responding so they can be at ease. This helps to build trust and loyalty with your customers.
You can also use a status page so that the customer has something to refer to as they wait.
Step 6: Alert Deduplication
For many Systems Administrators, monitoring the alerts for the same issue can cause fatigue and result in potentially missing a separate, critical alert. Opsgenie has addressed this by using alert de-duplication which helps to reduce the number of alerts and prevent alert fatigue.
Step 7: Alert Escalation
From time to time, it may become possible to escalate, or transfer, management of an incident to another party.
This could be due to a level of severity, not initially noted, that requires a more senior engineer. Or it could simply be because a shift is up and the issue must be picked up by the next scheduled representative. Either way, appropriate escalation policies should be in place for ops teams.
With Opsgenie, you can configure team schedules and on-call schedules to automate incident routing as opposed to manually determining who should respond to an incident, which could waste valuable time.
Step 8: Resolution & Post-Mortem Learnings
Once the incident is resolved, and the current impact has ended, it’s time to shift into clean up, communicate resolution to all affected parties, and begin post-mortem analysis.
Ultimately, your incident management and alert system are only as powerful as how well you can resolve incidents when using it and what you can learn from the process.
While much of what we do can be automated and simplified with Opsgenie, it isn’t a substitute for observing what’s going on and sharing theories internally so we can try to avoid issues from recurring in the future. Additionally, metrics can also help responders improve our own internal IT operations.
These learnings can also be invaluable for our DevOps and engineers who create improvements that prevent future issues and enhance software over time.
As IT consultants, we’ve learned that a streamlined system that detects incidents, helps us assess the impact, includes notification automation, prioritizes alerts, efficiently resolves the issue, and helps us learn from it are key components to a successful incident management solution and alerts process.
At Coyote Creek, Opsgenie does all this for us and more.
To learn more about how to optimize your incident management and alerts protocols, watch our webinar-on-demand now.