Patrick Clinger
Senior Eng Director
On-call at Fora
At Fora, we know how important it is to have a great work environment that supports proper work-life balance. This is especially important for our on-call team.
One of our goals at Fora is supporting work-life balance. This is especially important for our on-call team. Engineers are responsible for managing the health of our production systems during their rotation, including off-hours. It’s a big responsibility, encompassing more than 1,000 unique community sites hosted on Google Cloud. Being on-call both requires and develops valuable DevOps skills, including helping to manage our Kubernetes clusters, Vespa, Apache, MySQL, Golang services, and more.
We use a number of tools and technologies to support our systems and on-call team. Since we use Kubernetes for deployment, scaling, and management of our production environment, allocated resources automatically increase with platform traffic, and the majority of issues self-resolve within minutes. We use Tideways to monitor response times and failure rates, identify bottlenecks and dive into performance details for specific requests. On the observability side, we use Honeycomb to dig into our data points and identify outliers and patterns that allow us to get to the root of problems quickly. We route rule-based alerts to Slack and Splunk using Google Cloud Monitoring so we can easily share and discuss alerts both within the on-call team as well as with cross-functional team members.
Growing our team
At one point, our on-call team had just four members. With week-long rotations, this meant that every team member was on call every month. This wasn’t sustainable, and we made it a priority to onboard additional team members.
Documentation
The biggest obstacle to adding members was the amount of tacit knowledge held by the existing on-call team. To onboard new team members, we had to improve our documentation so that knowledge wasn’t siloed in the heads of a few engineers.
We started with documenting every incident from recent months. This was less detailed than a postmortem, but still covered three key pieces of information:
- a description of the incident
- a detailed list of steps that were taken to investigate the incident
- how the incident was resolved
The team included step-by-step details and avoided generic language. Instead of writing, "I checked the logs," they wrote, "I checked the Google Cloud Logs <at this link> and found <this error message>" and some basic information about how this information helped resolve the incident.
In addition to incident documentation, we also improved documentation around onboarding, environment setup, debugging and monitoring, common alerts, and even included some example exercises for new on-call engineers to practice managing our Kubernetes environment.
Onboarding
After cleaning up our documentation, we reached out to the broader Fora engineering team in 1:1 sessions, inviting them to join the on-call team. Within a week, we had eight volunteers.
We held a few meetings to discuss the logistics of being on call, went over our documentation, and had an open Q&A session. There were lots of great questions, such as how team members could get help if they were stuck, or how to properly document incidents in our incident log. We answered all the questions that came up, making sure our new team members felt supported.
We also assigned each new team member a dedicated mentor to answer questions, help with environment setup, and support the team member during any incident, day or night.
Results 🎉
Onboarding eight new team members enabled us to decrease the on-call frequency for any given person from monthly to quarterly. We were also able to spread knowledge throughout the organization through improved documentation and mentorship.
With a more robust on-call team, engineers are generally refreshed and able to respond to incidents during their shifts and can help us to maintain several healthy practices. For instance, we conduct a formal postmortem for any severity P1 or P2 incident using Atlassian’s postmortem process, which is blameless to ensure incidents are treated as learning opportunities. Our postmortems include a root cause analysis, and we encourage on-call engineers to propose changes that would prevent reoccurrence and generally improve system reliability. We also consider keeping documentation up-to-date to be an ongoing requirement. If the current documentation didn’t cover how to handle an incident, it is the on-call engineer’s responsibility to update the documentation after the incident.
Last, while out-of-hours incidents are rare, they do occasionally occur and impact the on-call team member’s sleep. In these cases, we encourage the on-call team member to request that someone else stand in for their shift during normal working hours the next day, allowing them to catch up on some much needed shut-eye. With these healthy practices, we are striving to support both healthy team members and healthy production systems.