Behind every agile ITSM team is a desire to ensure that their service stays online and works properly. When it doesn’t they go to work, as efficiently as possible, to correct the problem and restore service to the customer.
We know this well as IT consultants ourselves. To ensure uptime, streamline our service management, and automate alerting, we use Opsgenie by Atlassian. Opsgenie allows us to customize notification rules, routing, and prioritization based on our unique business needs.
If you’d like to learn more about best practices for Opsgenie, register for our upcoming webinar, How to Automate Incident Alerts, and Escalation with Opsgenie, on November 12th. In the meantime, we’re going to share five ways to achieve 99.999% with Opsgenie.
Please note that while these best practices have helped our organization, they are not a one-size-fits-all solution. If you need help implementing Opsgenie in a way that is tailored to your unique needs, we’re here and ready to help.
Five Incident Management Best Practices
1. Build to encourage uptime
DevOps teams and IT service managers are tasked with implementing their tools and building their add-ons and integrations in a way that best supports their ability to use to maintain and scale the functionality of their solution.
Therefore, if the goal is to achieve 99.999% uptime, it stands to reason that some of the goals you will engineer into your system architecture will include reliability and the mitigation of data loss risk.
One way to do this would be to organize your applications depending on their goal to ensure they are used as intended. For example, at Coyote Creek we have tools that support our ability to service our clients, one that provides a way to streamline collaboration internally, and another that monitors our web applications and communicates with our other tools when warnings occur.
When each component of a framework is broken down into microservices, each managed by a different owner, it is far easier to maintain, develop, and test on an individual basis, and identify potential threats early on.
2. Deploy via an immutable infrastructure
Instead of deploying changes to live servers that are then at risk of failure, an immutable infrastructure in software development is one in which live servers are never altered. Alternatively, new servers are created with a duplicate image, and changes are deployed on those servers.
If they prove to be effective and pass testing, then the old servers are de-provisioned in favor of the new servers. This way, if an error occurs, changes can be rolled back without risking a widespread outage. This also makes deployments more predictable and allows for more creative experimentation without fear of system failure.
The folks at Opsgenie do this as well, in addition to building their platform on mature services like AWS Lambda, SQS, Kinesis, S3, and others. Opsgenie traffic is also distributed evenly over multiple availability zones, that can be adjusted at any time, to ensure that zone failure isn’t an issue.
3. Invest in monitoring
All of your efforts to maintain availability and engineer reliability will be for not if monitoring is not a priority. By utilizing monitoring tools to proactively surveil your service, you can be better able to catch warning signs before they become an incident, or in the worst case, a system failure by consistently reviewing performance metrics.
At Coyote Creek, we have monitoring in place for our network, or systems and our tools. Every unanticipated exception triggers an alert that is routed, via Opsgenie, to our on-call IT staff around the clock. We also test services routinely so we’re better able to anticipate an issue before downtime occurs.
In addition, we also utilize reporting features that track our incident duration, response time, and communications so we can benefit from post-mortem learnings that help us to improve our operations.
4. Prepare to act quickly during an outage
The truth is, every minute your service is offline hurts your user experience and ultimately, your bottom line. Many times, this can be avoided, but not every time. When an outage or a service interruption occurs, you can ensure that it’s resolved as quickly and efficiently as possible by equipping your team with the right tools.
In addition to the right tools, you also need a plan for all the most common scenarios and how you can resolve them in the least amount of time possible. Other helpful tactics include:
- System failover backup
- Automated notifications including SMS and chat
- Slack command automation to improve response time
- Studying previous cases
- Running disaster recovery simulations in real-time
- A status page that keeps customers updated on fixes in progress
5. Setting recovery expectations
It’s important to set clear expectations with your teams to help them prevent incidents and prioritize recovery when they do occur. We encourage you to proactively monitor service and to also coordinate on-call teams so that incidents can be handled promptly, regardless of when they occur.
Furthermore, we encourage you to set goals for mean response time and mean time to acknowledge new incidents. This incentivizes your team to fix issues quickly and avoid them if possible.
We hope that these 5 tips will help you to engineer your infrastructure, systems, monitoring, and recovery processes in a way that ensures 99.999% uptime.
If you’d like to learn more about best practices and how you can utilize Opsgenie in the best way to support your business goals, please watch our webinar How to Automate Incident Alerts, and Escalation with Opsgenie on-demand.