If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps become a buzzword lately – it refers to how much we store in our “working memory,” divided into the intrinsic, extraneous, and germane cognitive load.
With regard to being on-call, a “background worry” (file under “intrinsic load”) is present when being on-call regardless of whether you get paged every night or have a completely uneventful shift. The more unpredictable and untrustworthy an on-call rotation is, the higher this worry is, and the higher chance for experiencing alert fatigue.
The cognitive burden of being on-call is different for everyone. Working parents need to juggle school and work schedules, heavy sleepers need to adjust (possibly by adjusting medications), and neurodiverse folks may have a heightened sense of anxiety during an on-call period. Everyone starts with a different threshold for how much cognitive load they can take on, and being on-call affects everyone’s mental well-being differently. For example, one person might not be at all bothered by the prospect of being woken up every night during their shift, whereas a different team member may be prone to illness if they are woken from a deep sleep even once in a week.
Furthermore, during their on-call shift, engineers are expected to make this a priority over other areas of their life. For example, going out to dinner means bringing a laptop with you and making sure you’ll have Wi-Fi; being paged during a live event might require you to step away for a period of time and miss the action; and seeing a movie in a theatre is completely out.
Fortunately, there are ways to reduce or address the cognitive load of being on-call (even if it’s still not a good idea to go to a movie theatre during your shift). This list is highly influenced by Liz Frost’s DevOpsDays Boston 2016 talk on Oncall Equity.
Decreasing the cognitive load of being on-call starts first and foremost with trustworthy alerting. As mentioned, alert fatigue is a serious problem. It most commonly arises from inactionable alerts or repeat alerts for the same problem.
Prioritize alert management for your team. What tools they use shouldn’t matter (and, in fact, should be up to the team), but alert management needs to be given serious consideration and revisited often in the spirit of continual improvement.
- Assign a secondary on-call engineer.
With management assigning a secondary on-call person to your rotation, it can signal to the primary on-call engineer that they’re not the “last line of defense” and they don’t have to know everything about everything. Of course, knowing a bit about all the company’s systems or being able to understand with little context and/or time is critical to being on-call, but the software is built and maintained by teams. Keep this team mentality through the on-call rotation.
- Minimize context switching for the on-call engineer.
Being on-call is full of context switching by nature. An alert comes in and the on-call engineer needs to pivot from whatever they were doing to address it. Frequent interruptions can increase extraneous cognitive load.
Some teams minimize context switching for the on-call engineer by having them solely assigned to triage during their on-call shift. From the Atlassian post, A Manager’s Guide To Improving On Call:
“This is why, as a best practice, we recommend keeping on-call duties and development duties separate. When on-call employees have free time, they can work on improving on-call-related documentation and automation to eventually improve the sustainability of systems and services.”
- Make it frictionless to trade shifts and on-call responsibilities between team members.
This requires trust among team members and between team members and management, a solid foundational knowledge of the systems/applications the team is responsible for across the team, and alerting software written for on-call teams (again, what software the team uses should be up to the team).
The ability to trade shifts can seriously decrease cognitive overload if team members know someone will be there to back them up if they need to step away for a personal emergency, or for something as routine as driving a child to school.
- Give a “rest” day after an on-call shift.
Offer your engineers a rest day after their on-call shift, and make sure they take it at
least part of the time. This should be taken immediately so as to avoid engineers “banking” comp days. The entire point of this day is to rest after an on-call shift in order to be refreshed.
What a “rest day” is could be different for different people (as the cognitive burden is different for everyone!), and it could vary depending on how active an on-call shift has been. A rest day might mean not coming into work or not signing online altogether, sometimes called a “comp day”. Alternatively, this could be a day of “light” work (whatever that means to the engineer), or a shortened day. This could even be a day dedicated to work the engineer wants to do but has not been allotted the time for.
The on-call engineer has had to adjust their life outside of regular hours. The stress of being on-call makes work bleed into personal time, even if the engineer is never paged during their shift. It is only fair to give some personal time back.
- Make being on-call a positive and rewarding experience.
There are many talks and blog posts on this already. One approach is making sure on-call is only a certain percentage of work. As Marc Hornbeek writes regarding SRE (Site Reliability Engineering) transformations, “SRE on-call work budgets are typically 25% or less. Developers share on-call work.”
Being on-call often means being exposed to broken processes and technical debt. On-call engineers often see this and have ideas to make these processes and systems better, but are not given the time in their work to do so. Combined with point 3 – minimize context switching for the on-call engineer – give the on-call engineer time to implement these fixes. This will, in fact, decrease the intrinsic cognitive loadburden of future on-call shifts, and make your engineers feel empowered.
Each of these six points could be blog posts in and of themselves, and a lot has been written on each of them. But, to sum it up: your on-call engineers keep your systems running and your business online. They are critical to your bottom line. Treat them like it.