An Insider's Look Into Incident Management With Codeship's Florian Motlik
Recently, I took some time to sit down with Florian Motlik, CTO of Codeship, to chat about the company’s infrastructure, on-call alerting, and incident communication. For a short background, Codeship is one of the leading continuous delivery platforms, letting developers focus on writing code by taking care of the testing and release process. With a quick push up to GitHub or Bitbucket, Codeship will run all of your tests and deploy to specific environments if all tests pass.
Once integrated into your workflow, Codeship becomes a mission-critical part of your software development, making reliability core to their service and a main focus for the team. Flo’s put many hours into thinking about incident management so we’re happy he had the chance to swing by the StatusPage San Francisco office. Codeship is currently 19 full-timers with 11 engineers spanning 2 offices in Boston and Vienna. Let’s jump in.
In terms of infrastructure, we run on Heroku and AWS. Our main application is a Rails app running on Heroku. When a build comes in, it immediately gets queued up in Sidekiq. We run a cluster of large AWS machines that are split using containers to then take and run builds from the queue. The AWS machines are started from a common AMI that is rebuilt every time we want to deploy, making the whole build infrastructure immutable. Updates from the build are sent back to our main application through Sidekiq as well. As soon as the build has finished, notifications are sent through Email, Slack, and other services in Sidekiq background workers as well. We're heavy Sidekiq users and process millions of jobs every day.
The nice thing about a system like ours is that it is very event driven. We make sure to queue up and buffer on different levels so that if some part of the infrastructure goes down, we won’t lose any data. For customers, it’s not a huge issue if a build is delayed a couple seconds or even up to a minute. It’s a big issue if the build doesn’t actually run.
"Instead of optimizing for time in queue, we optimize the system to be fault tolerant and to make sure we never lose a build."
We have several ways of making sure that builds run properly from retrying failures in the infrastructure before the actual build occurs to shutting down machines gracefully when they start misbehaving. As we’re running an immutable infrastructure, it’s very easy to check our systems for high load or ping and simply replace those servers with equivalent ones. Also, all data exchanged between different parts of our system is routed through a queue so that we never lose a single data point if a system goes down. We’re currently working on more levels of replication for the entry point of data into the system to make sure we can withstand even a complete AWS or Heroku outage.
We don’t want to have a dedicated Ops team - definitely not at our size currently and most likely not in the future. For us, it simply sets the wrong incentive. If you build something, you should have the incentive not to just ship it, but to ship it in a way that really works well. If it breaks, you need to be the one fixing the issue.
"In general, I’m a big believer in you built it, you run it."
We use PagerDuty for all of our internal alerting and Librato for our metrics and dashboards. For PagerDuty, we have two layers of alerting. The first layer always alerts an engineer who has the ability to fix the problem. The second layer is for all non-technical employees.
While we need to have people on call that can actually fix the problem at hand, it’s equally as important to have a dedicated person communicating about an issue to customers on our status page and through Twitter. First layer engineers always have the ability to call on the second layer when it comes to communicating around incidents, allowing engineers to focus on fixing the issue. This setup ensures that the whole team feels responsible for reliability and communication as the company grows.
Our VP of engineering spent a lot of time in PagerDuty making sure that we have an overlapping schedule so that someone in Vienna is not getting woken up if there is someone awake in Boston and vice versa, although you may have to stop watching Netflix for a couple hours! There’s really very few hours in the day when nobody is up, so having two offices has been great so far.
Definitely applicable to us as well. Getting paged for unimportant issues happened quite a bit in the past when our alerting and dashboards weren’t up to par. That said, when you think about building a product overtime, there’s always a trade off between how much dev time you put into customer facing features versus internal tools like alerting. It’s unavoidable. In the beginning, it may be okay to do the minimum needed on the ops side, but becomes much more important as you grow.
"We’ve recently invested a lot more time into dashboarding and alerting to decrease false positives in PagerDuty. No one wants to be woken up in the morning just to see that everything is working fine and that the metric you’ve alerted on really isn't useful."
Sometimes, we’ll have a build server start to misbehave or load that becomes too high. The issue doesn't impact the builds of our customers and we don't need to alert the whole team, but it looks kinda strange and an engineer needs to know about it. Once the engineer gets alerted, we'll gracefully shut down the machine and spin up a new one. These tweaks over time have made our lives much easier. Our general philosophy is to build a system where you can exchange every part without the whole system coming crashing to its knees.
Side note: read more about immutable infrastructure from Flo here.
Communicating With Customers
We have various levels of customer communication from displaying current build status in-app to incident communication on our status page. As an example, we’ll always show the current status of a build to a customer once they’ve pushed up new code. Is it currently waiting? Is it queued up? Was there an infrastructure failure that impacted their specific build? Making that information accessible is an area where we have invested quite a bit of time recently.
"When it comes to incident identification and communication, alerting our customers through StatusPage is critical."
At this point in the incident lifecycle, we need to let our customers know at a very high level that a problem simply exists and that we are looking into it. What we don’t want is to leave customers out in the dark wondering whether or not we’re currently experiencing any issues.
Having this customer-first policy is not only helpful for end users, but also for us. Updating our status page immediately diminishes the inbound flow of support tickets."
It eliminates inquiries that would otherwise build up and cause a burden on both the support and engineering teams around simple questions like, “Hey, is there a problem right now with builds?” Acknowledging that we have a problem and communicating to customers will always be the first step. We’ll never try to hide it.
It’s typically someone on our customer support the team. The entire team lives in Slack, allowing engineers to communicate efficiently with those tasked with updating the status page.
We have a dev channel in Slack where the team on-call congregates during an incident. The second layer on-call gets regular updates from engineering, keeping the support team informed with what they’ve been working on and along with any new information that can be sent to customers. As part of the fixing the problem, every engineer understands that they need to regularly update the support team. The information doesn’t need to be perfect, but even something like “We’ve identified the issue and are moving in a specific direction to fix the problem” helps tremendously. I could also see us moving to an on-call or incident specific room in the future.
That’s right. One of our big pushes right now is to make internal performance metrics more accessible to both internal team members and external customers. We want to get to a point where customers are treated the same way as our marketing and customer support teams. As an example, we could start showing metrics such as the number of builds currently running, number of builds enqueued, and number of builds stuck - if any. Surfacing this information will let the customer support team diagnose high level problems and potentially even update customers without the back and forth required from engineering. The less they have to ask a developer about a problem to send a quick message out to customers, the better.
Not currently, but this would definitely be the next step as it sets the right incentives to show customers the same metrics that we’re viewing internally (in a way that they can understand). It’s a forcing function for accountability and transparency, which is key for us.
Absolutely. Thorough analysis is key. The first thing we’ll do is create a GitHub issue to begin tracking actions that occurred along with their respective timestamps. We retrieve this data from Librato and Papertrail.
"If you don’t have logs available, you’re flying blind."
Logs are generally very underused by most teams. For us, if we don’t have any data in the logs, it’s the first clear indicator of a change that needs to be made.
Once we have an established timeline of events, we’ll start asking ourselves a set of questions including, “Why did this specific system or set of systems fail? Did we not get alerted appropriately ahead of time? If so, what metrics do we need to add so that we can this doesn’t happen again in the future? Can we automate a fix?”
Depending on the severity of the issue, we’ll do a follow-up blog post, clearly outlining what happened, what we’re working on for the future, and a timeline for when we’d like to have an implementation in place.
Using Logs To Improve Customer Support
Yep. Within different parts of the application, we’ll link directly to Papertrail queries so that customer support can pull up pre-constructed searches. As an example, we link every build to a query in Papertrail so that admins can easily dig into the log history whenever a customer writes in wondering about an issue with a build. With one click, you have everything available to you - that makes a huge difference. This holds true for Librato as well. When a build is misbehaving, we can automatically pull up a Librato dashboard showing metrics for the exact machine that the build ran on. These simple tools are game changers for such small amounts of work. If I needed to manually log into Librato and build a dashboard every time to look at a specific build, it wouldn’t happen nearly as often and our customer retention would take the hit.
Tapping Into The Incident Command System (ICS)
So, one thing. I met with Peter van Hardenberg from Heroku and he mentioned a workflow they use called the Incident Command System, which provides a step by step guide to dealing with incident response. It was started by a government agency in the US and from what I’ve gathered, can be adaptable to even small software teams. I’d definitely take a look at his blog post for more info and see if it could be useful for any software teams as they grow.
How Does Your Incident Management Compare?
Over the coming months, we'll be featuring a variety of companies' incident management processes. If you're interested in being featured, get in touch with us at firstname.lastname@example.org and we'd love to talk shop!