In the early years of enterprise systems, developers would build products, services, and infrastructures, and IT admins would maintain them during normal business hours. But as advancement progressed and user expectations increased, it quickly became clear that a more flexible, consistently available solution must be in place.
Thus, the on-call shift became a critical duty that both DevOps teams and IT admins must embrace to keep their services available and customers happy.
At Coyote Creek, we too have on-call shifts and we ensure that each time a new engineer joins our team, we prepare them to be able to execute their critical duties while serving an on-call shift.
We familiarize our team members with what it means to be “on-call,” how to be competent in analyzing, diagnosing, and rectifying errors, how to utilize Atlassian’s Opsgenie software and anything else they need to be successful.
Based on this practice, we’ve prepared a collection of best practices to help you get your teams ready for on-call duties.
Best Practices for On-Call IT Scheduling
Teach your team the basics of on-call schedules and escalations
It’s important that each of your team members understands their on-call schedule and how your escalation procedures determine who’s on-call to escalate issues when needed.
At Coyote Creek, our on-call schedule includes a call rotation for both day and night shifts and redundancy in the event that a primary escalation point misses an alert. In that case, it would then go to a secondary on-call team member.
Set up individual alerting notification rules
It’s important that on-call team members are accessible during their shift. To ensure that your on-call engineer is notified appropriately, configure, and review notification rules with them. It’s also important that you have a method in place for prioritizing alerts based on severity and preparing different alerts based on those severity parameters.
For example, high-priority alerts can utilize push and voice notifications that ensure the most timely response whereas low priority alerts can be sent by email, text/sms, push notifications, or otherwise. We caution not to choose too many options as this can cause alert fatigue.
There should also be redundancy built into everything you do. Not just in escalation to a primary and secondary team member but also in your notifications. For example, if an engineer is not at their computer, they should be able to get a notification on their mobile phone. Furthermore, we recommend creating different channels for Slack or whichever chat service you use.
Provide access with appropriate permissions and the right tools
At the end of the day, our goal is to expedite a response to the most severe incidents. Therefore, it is critical that an on-call engineer be able to identify, diagnose, troubleshoot, and fix a software issue.
For them to be effective, they must have access to make necessary changes, API calls, and the right tools to solve issues. This can include access to your hosting environment, login credentials, change privileges, chat commands, and access to run books.
They will also need permission to access the deployment environment and understand which commands can adjust processes to provide a fix.
Train your on-call staff in your infrastructure and tech stack
Before they are ever given an on-call schedule, each of your engineers and team members need to be familiar with your organizational infrastructure as well as all the applications you use to conduct business and handle call management.
This will ensure they can quickly identify issues, their causes, and impact and can make resolutions easier and faster. In addition, you’ll want to make sure that any documentation around applications or standardized processes are also up to date so that if they’re used as a reference, they don’t lead the engineer down the wrong path.
Train your on-call staff on the appropriate diagnostic tools
All teams vary based on the management tools they use for deployment health tracking, application performance monitoring, and resource utilization. Any new employee, and especially one who is on-call, should have access to and learn how to utilize tool functionality.
Set up their schedule management notification rules
It is crucial to be aware of an upcoming on-call shift. To achieve this, ensure that your engineer has configured schedule notification rules. This includes alert notifications that their shift is about to start and how they will be notified of their impending shift. Don’t forget to include the appropriate time zones for your on-call staff.
Define incident response duties
To be successful, an on-call engineer or incident responder’s duties should be clearly stated to avoid any possibility for confusion, frustration, or incidents falling through the cracks. Additionally, incident response workflows and templates should also be documented and updated to avoid as many bottlenecks as possible.
Some of the expectations you might address in your documentation include:
- When should an alert be acknowledged?
- What are the response time expectations?
- What are the classification guidelines for issue severity?
- Routing rules for when an issue should be escalated?
- Escalation policies including who should receive the escalation based on different rotations.
- When should other departmental stakeholders (i.e. customer support) be notified?
- How should away time and breaks be handled?
- How should incidents be documented (including performance metrics) to provide post-mortem analysis?
Having a prepared, competent on-call staff is a critical component to any organization whose priority is to keep their services dependable and avoid widespread outages.
For your engineers to be successful in their on-call duties, you must ensure that they are trained well and prepared, with everything they need, to execute their on-call shift. The best practices shared in this guide are a great first step in ensuring that your team, and your on-call processes, are efficient and ready to create the best experience for your customers.
If you’d like to learn more about how to optimize your own incident management and alerts with Opsgenie, we invite you to join us for a webinar on November 12 where we’ll be discussing:
- Opsgenie and how it can help your Help Desk or IT Operations team thrive
- Opsgenie features
- Best practices for using Opsgenie alerts
- Recommended Opsgenie Integrations
- And much more!