I started my career as a developer in the mid-90s, around the DC area, as companies were just starting to realize the benefits the web could offer. Back then, as you would around that part of country, I started off by doing a lot of government contracts for different agencies. As many companies during that time,back then (and especially given government role segmentation) it was a widely accepted practice that developers could never access production. We sat quietly in the corner, wrote the code, and threw it over the wall to operations to deploy and manage. We were not allowed to not touch the production database or any of production services. Which made the feedback loop … interesting. With time, some companies became a bit more liberal (and I moved away from public sector) and we, as developers, were granted limited permission to deploy and troubleshoot issues in production.
Soon after, some companies realized that giving developers production access could be beneficial, so we were given more and more production freedom. These days, we can deploy to production, troubleshoot in production, even fix things in production. However, the more production control developer get, the more frequently the question comes up — should developers be on call?
I’ve talked to a good number of people over the past few months, people from different backgrounds, and it seems that different companies employ different types of models when it comes to oncall strategy. I’ve heard both success and failure stories with developers being in the same rotation as ops; developers acting as backup for the front line of support; developers being the front line of support; and even some hybrid approaches that in theory should never work . But if we abstract ourselves from tactical implementation and concentrate on the approach, your strategy should be based on the answer to a single question — what breaks the most in your production?
In complex systems (and it seems like in this day and age there are no others) a lot of things can break (and subsequently alert). Hardware, network, software components, performance issues, third party endpoints, you name it. So, at some point, you need to make a conscious decision on who would receive that dreaded page. Logically, if most of the things that break are networking and hardware related, chances are your ops people will be the first line of defense. If application is the main source of alert spam — congratulations, you’ve just won the on call sweepstakes. That said, regardless of who may be considered to be the first person to receive a page, you want to eliminate the escalation step. As a rule of thumb, you want to wake up as few people as possible in the middle of the night. If your on call process is predicated on a person receiving a page to call someone else – you have failed. The goal is to design your on call strategy so that only the person who gets alerted is the one who can solve the problem. That’s where the philosophy of actionable alerts come in.
Actionable alerts, in case you haven’t heard the term, is an approach where nothing should page a person if there is nothing that person can do. In layman terms, it’s very much like playing a game with fortune cookies, that you’ve all played as a child, adding “in bed” to the end of each fortune. Except instead of “in bed” you add “at 2 in the morning”. And instead of fortune, you have to answer three questions: “do I care about it … at 2 in the morning” , “can I fix it … at 2 in the morning” and “can somebody else fix it … at 2 in the morning”? If the answer to the first two questions is a “no” – you shouldn’t be alerted. If the answer is a “no” to the last one -congratulations, you’ve just won the on call sweepstakes. For life.
So how can you, as a developer, help with this process? Regardless of the role developers play in on call responsibility in your organization, there are a few base tenants that should be followed to improve time-to-remedy, as well as let people on call get some much needed sleep.
As much as most developers don’t like to hear this, documentation is the most essential aspect of your on call escalation process. Make sure that every single alert that goes out has a comprehensive list of steps to troubleshoot and fix the problem. Remember, an alert is a predefined problem, which should come with a predefined solution. You won’t be able to cover 100% of problematic use cases, but a continual effort to document new use cases as issues arise is a good way to eliminate many unnecessary escalation calls
Let me add, I am a firm believer that anyone on any tech team should be able to add an alert into the system regardless of the role, whether it’s the Devs, DBAs, Ops or mythical “DevOps” people. Anyone should have the ability to add an alert, as long as each alert comes with a complementing wiki page/manual/run book covering the steps to reproduce and troubleshoot the problem, and (if all fails) additional escalation points.
Another mitigation technique that should fall in the developers’ lap is monitoring business metrics. I am huge fan of a top-down monitoring strategy;, where you alert on business critical processes (revenue, registration, etc.) and then, during troubleshooting, correlate it to a technical problem (running out of memory, disk fail, performance issues, etc). Nobody cares if your cache hit ratio is slightly below threshold in the middle of the night. A lot of people do care, however, if the revenue suddenly drops. A revenue drop that may be caused by performance problems could indeed be caused by cache hit ratio being off. However, the last link of that chain by itself, does not show a full problem scope. Make sure to monitor everything, but only alert on what’s important. And more often than not – you’ll find business processes affecting revenue are most important.
Also, as you examine your monitoring needs, it’s important to remember that much like security and performance, monitoring is not a feature. You can’t deploy applications that you’ve spent the last six months developing and then say, “Hey let me slap some monitors on this”. Monitoring your application should be a part of your development process. Implementing checks (much like tests) for individual features as you’re developing them will help to collect all of the essential metrics that you will need in order to have a true overview of the system health. Then when that page does come at 2 in the morning, you’ll be able to correlate the information to assist with troubleshooting. Remember, monitoring It is a continuous process, you are not going to cover everything 100% up front. Things will break in production. But when they do, after fixing them, if you add another metric collecting deciding data to alert on a similar issue, you’ll improve your production coverage. Until next time.
Lastly, of course, there is a topic of availability. It’s simple, really. No matter what role you, as a developer, are going to play within the on call process — be a good citizen. If you push code to production Friday afternoon — don’t go out and party it up Friday night. If you know there is a big marketing initiative going on over the weekend — don’t drive off into the sunset with no access to the interwebs.
No matter what role in on call process is yours, remember — be kind to those who have to wake up in the middle of the night. Because next time it may be you.