On October 21, 2016 at approximately 4am PST, the internet broke. OK, we know the internet doesn't "break." But hundreds of important services powering our modern web infrastructure had outages – all stemming from a DDoS targeting Dyn, one of the largest DNS providers on the internet.
Here is the initial status notification Dyn customers received that morning:
This was the first of 11 status notifications that would follow. As the hours passed, the incident amplified. We're all pretty used to the idea of using and building web services that depend on uptime from many other web services. It helps us work, build, and share information faster than ever before. But with this efficiency comes a vulnerability: sometimes a really important domino, like Dyn, in your stack tips over. As more and more cloud services became unavailable, the domino effect ensued, causing one of the largest internet outages in recent history. The sheer scale of the incident set this day apart from past cyber attacks.
We never want to see our customers go through a day like Friday. But, in reality, it's the reason we exist as a product – to give companies like Dyn a clear communication path to customers in the worst of times. And the truth is that downtime is not unique to Dyn. It's inevitable for all services. Downtime also doesn't need to be caused by a massive DDoS attack, but can be triggered by a one-line-of-code bug in production. So, as the internet recovers from the outage, let's figure out how as an industry we can learn from the incident. On our end, we'll be working on better ways for companies to identify issues from third party service providers and building the right tools to keep end users informed even during a cloud crisis. It's what drives us to come to work every day.
Thousands of the top cloud services in tech including Dropbox, Twilio, and Squarespace use StatusPage to keep millions of end users informed during incidents like this. Let's look back on Friday, when the dominos began to fall and we found ourselves watching 1 million notifications firing through the system.
The Domino Effect
How did Dyn handle the incident?
We want to give some major kudos to the teams at Dyn who worked around the clock to get things up and running again. They did a killer job of communicating clearly and regularly through their status page even in crisis mode. News outlets like the New York Times were even able to link to Dyn's status page and quote their status updates during the incident.
Let's take a closer look:
Total number of updates : 11
Average time between updates : 61 minutes
Dyn also posted an incident report the following day and provided a link to a retrospective blog post that gives readers more information about what happened, how their teams responded, and how grateful they were to the community that came together to help resolve the issue. The frequency and transparency of Dyn's communication allowed users to stay informed even during a time of great stress and uncertainty.
In the end, it was a tough day for Dyn and hundreds of other service providers, but the open and honest communication during the incident helped to build trust and transparency within their community.
@Dyn Thank you for the frequent updates on the DDoS attacks. Good luck.— Matt Fagala (@mfagala) October 21, 2016
Creating your incident response plan
There is no time like the present – especially while incidents are top of mind for many people – to create an incident response plan or hone your existing one. There doesn't have to be a major DDoS attack for it to come in handy (think bugs in production, website issues, login issues, etc.). No matter what the problem is, you'll never regret some good ole fashioned preparation. Here are a few questions to think through ahead of time.
Define what constitutes an incident:
- How many customers are impacted?
- How long does the incident need to last?
- How degraded is the service and how do you determine severity level?
Create an incident communication plan:
- How do you identify the incident commander?
- Who owns the communication plan?
- How will you communicate with users? What channel(s) will be used for different severity levels?
- Do you have templates for common issues you can pull from to accelerate the time from detection to communication?
- How will you follow-up post-incident and work to avoid similar incidents in the future? Are you prepared for a post-incident review? At what point do you need to write a postmortem?
While we hope there is never a next time, we know incidents similar to this are an inevitable part of our cloud world. Having a vetted plan in place before an incident happens and a dedicated way to communicate with your customers – whether that's through a status page or other dedicated channels – will let your team focus on fixing the issue at hand all while building customer trust. If you do choose StatusPage, we pledge to be there for you so you can be there for your customers.
I recently met with Brad Henrickson, Director of Platform Engineering at Keen IO, to chat about engineering at the company, the effect of company culture on incident management, and an AMA that the team held in response to a couple recent backend incidents. Brad’s been working with tech companies for a while now including Co-founding Zoosk, and brings some great ideas to the table.
Recently, I took some time to sit down with Florian Motlik, CTO of Codeship, to chat about the company’s infrastructure, on-call alerting, and incident communication. For a short background, Codeship is one of the leading continuous delivery platforms, letting developers focus on writing code by taking care of the testing and release process. With a quick push up to GitHub or Bitbucket, Codeship will run all of your tests and deploy to specific environments if all tests pass.
Once integrated into your workflow, Codeship becomes a mission-critical part of your software development, making reliability core to their service and a main focus for the team. Flo’s put many hours into thinking about incident management so we’re happy he had the chance to swing by the StatusPage San Francisco office. Codeship is currently 19 full-timers with 11 engineers spanning 2 offices in Boston and Vienna. Let’s jump in.
As I write this post, one of our customers, DNSimple, is currently undergoing a sustained service outage due to a volumetric DDoS attack. DNS outages can be some of the worst to deal with since there's rarely a backup provider in place, TTLs and NXDOMAIN errors live on as you struggle to get a new one stood up, and it often means hard downtime for the majority of your stack.
Luckily for DNSimple, they were smart enough to run their status page outside of their own infrastructure, and on a separate DNS provider, with a dedicated domain. As their outage progresses, they're still able to communicate with their customers via dnsimplestatus.com.
Last week, Mailgun experienced a significant backup of messages, leading to email delays and duplication. As a Mailgun customer that relies on sending email notifications, this issue affected us and several of our customers.
Our reaction to the issue at hand was most likely the same reaction that any other business would have that relies on a cloud service such as Mailgun:
- Were we affected?
- When did the issue begin?
- When was it resolved?
- What exactly happened and why?
- What is being done to make sure the same issue doesn't happen again in the future?
Thankfully, Mailgun answered each of these questions with a step by step knockout postmortem reblogged below. While we need Mailgun to keep email deliverability issues to a minimum, their level of transparency has earned our trust along with many others (see the comments section on the original post).
(disclaimer: Mailgun switched to using StatusPage.io during the incident and mentions us in the post)
Check this article out on Hacker News
For those of you that didn't see, Twilio suffered a pretty nasty outage last Thursday that resulted in accounts being inadvertently disabled. Nobody likes outages, especially ones caused by a billing system failure, but the way they went about handling it was pretty stellar.
This post is Part 1 of what will be a 2-part series on effective communication when you're suffering from unexpected downtime. We'll look at Twilio's initial response, what they did well, and what they could have done better (part 2 will talk about their followup and postmortem).
The Initial Communication, Gotta Nail It
The first step in effectively communicating during an outage is to acknowledge - in ANY form - that there is a problem in the first place, and that you have an actual human being working hard on fixing it.
We obviously have strong opinions about what tool people should be using as a status page, but the fact that they have one in the first place puts them miles ahead of most vendors. Their status page acts as the authoritative place for communication when outages or service disruptions occur.
Here are the basic items that your initial incident response post should contain.
- What's causing the problem (if known)
- What effects the problem is having on customers
- How many people are affected
- Any workarounds to remedy the issue in the short term
- The severity of the issue (your status page design should communicate this)
How Twilio Fared
Twilio's postmortem acknowledges the incident response began at 3:28am, with the first post on their status page happening at 3:58am (30 minutes later). The next update was a full 57 minutes later, with no indication of whether or not the humans were still awake and fixing the problem. Information about what was causing the issue and what the effects were was put forth clearly, and it's likely they didn't know the full effect at that time nor if any workarounds were possible.
One other note - make sure your status page reflects accurately the severity of the issue. "Warning" is not an accurate portrayal of the billing system suspending accounts, and a severity of "Major" or "Critical" with a red color would have been more appropriate.
- Initial Status Page Response Time: 30 minutes
- Next Update Response Time: 57 minutes
- Status Page Severity Indication: Not strong enough
- Grade: C
This was possibly the worst part of the public response (as we'll see below). I cut them a little bit of slack since it was 330am, but if I'm a customer with an account shut off I'm definitely upset right from the start. 90 minutes have passed with a only a few cricket chirps - my only sense of hope is that they've promised an update within the next 30 minutes.
No Fix Yet, Keep Talking
Stuff is still broke. Bad. You're busy ruling out possible causes and narrowing your focus on hunting down the culprit. Time is moving slowly as you navigate the matrix, but it moves even slower for customers who don't know what's wrong or when it's going to be fixed.
The second phase of communication happens when you're actively working on resolving the issue, and it's this communication that lets your customers keep their sanity as they twiddle their thumbs and respond to their own angry customers.
Here are your priorities (in order) after the initial communication. Think of this as our own DevOps version of Aviate, Navigate, Communicate.
- Find and fix the root cause, or rectify the effects of issue even if the root cause is still not found
- If you have new information, send it out right away
- Pick a decent interval as an upper bound on your next communication, even if all you say is "no new info right now". 30 minutes is a good max
How Twilio Fared
Twilio did a fantastic job here, even if they were slightly over 30 minutes on a couple of their message followups.
- Patched Issue For Now: Yes!
- Communicate often: Yes
- Communicate regularly: Yes
- Grade: A
Why Communicating Frequently is Important
Humans are funny creatures. We'd much rather have bad news than no news, and updating at a regular interval puts tensions at ease. Keep in mind that your customers have customers too, and if you're mission critical to them then they'll be fighting the same fire you are - the same fight with no way to fix it!
Put another way, the people paying Patrick McKenzie money care nothing about who is powering their appointment reminder SMS messages, they're just mad at Patrick for picking the vendor that goes down. The best Patrick can do is relay whatever information Twilio gives him, and new information coming every 30 minutes is likely enough to quell the uprising.
Go To Where Your Customers Are
I found out about the outage first on Hacker News, where many of the internet denizens live, and likely where many of Twilio's customer found out about it as well.
Twilio is known for its developer outreach and support, and boy did it shine that morning! Let's check out Rob Spectre, their head dev evangelist, and his down-to-earth style. He met developers where they were at, and you'll notice that even non-customers of Twilio loved the humility and the consideration.
How Twilio Fared
Hats off to Rob. Not sure if this was part of the process, but it sure appeared to be that way. There's a couple things that jump out to me that make this type of outreach successful.
- Made himself personally available to his customers
- Provided a channel (email@example.com) for customers to get personal resolution to their issues
- Didn't use canned language ("we apologize for the inconvenience"), and was humble and honest about it being a rough morning
- Gave link to authoritative source of information, and used the word "authoritative"
- Empathy and authenticity move mountains to keep customers' trust and respect
Being an on-demand customer liason: A++
Report Card, and What's Next
After starting out a little sparse with the communication, Twilio really rose to the occasion. They had the luxury of getting a fix in for the accounts before they fully resolved the billing issue, and all in all they saw about 3 hours of downtime in the worst case for the affected accounts
Final Grade: A-
The next article will focus on the same-day-communication after the issue is resolved, as well as the full postmortem that was released 5 days later. Our jobs don't end with the golden git-push, and remembering to be humans for the week following is just as important as making sure the the bug is fixed and won't regress.
We just released a new gem tonight called 'librato-sidekiq' to, you guessed it, send your Sidekiq metrics into Librato. Beta testers are desired in plenty, as is feedback and bugs of course. The gem isn't tested yet, but we're running it in production, use at your own peril.
Not that they need my endorsement, but Sidekiq kicks some serious ass. Doing a ton of super latent communication in the form of emails and text messages lends itself way better to the threaded model than it does to the process model. We saved gobs of time and probably a factor of 50 in cost after switching to Sidekiq.
Delayed Job was the only thing I had ever used to run background processes and, although it works well when coupled with foreman, isn't the friendliest solution when you don't have an infinitely parallel slack pool at your disposal like those running on Heroku. With Delayed Job, each Delayed Job process must complete a single job before moving on to the next, so each unit of concurrency requires its own process. I estimate there to be something like 25ms of ruby processing time for each 975ms of waiting for Mailgun or Twilio to return the API call.
StatusPage does most things in feast or famine fashion - either we're hosting lots of traffic and sending lots of notifications, or things are cricket silent. Not only that, notifications are time sensitive and need to go out ASAP.
Starting out with Delayed Job was tough, each unit of concurrency required enough RAM and its own boot of the Rails environment. On a basic m1.medium instance we could run something like 20 concurrent workers, but getting all of the processes launched took about 7 minutes! Using foreman to publish to upstart, we did some pretty nasty hacks to have all "processes" start at once, but each had a certain amount of sleep such that we could slowly ramp up the processes 1 by 1 (remember, they all have to boot the full Rails environment).
The Long Wait
Even once we got all that figured out, there was a grave mismatch between the amount of ruby processing as compared to the amount of waiting for the network to return. You can imagine even a single core m1.medium machine, will all those rails processes, still not utilizing a full core. The RAM was tapped out and most of the processor time was spent asleep and waiting for network sockets.
After all of the work of coaxing the elephant up on the stand, we took a good hard look at Sidekiq and realized it was exactly what we would need to solve the CPU/RAM mismatch that came with the process model for background jobs.
Moving to Sidekiq
Sidekiq is a single Rails process that delegates work to worker threads to perform. Because Ruby can only ever be executing one thread at a time, it's not recommended you do any computationally intensive tasks (like image processing) using Sidekiq - Delayed Job would be much more appropriate for that. Conversely, typical SaaS background tasks almost always involve offloading some network-intensive call from the synchronous web request and, for this, Sidekiq is a darling. Our main use case is around sending notifications via email and SMS, but we do a bit of the former as well.
Each of our background workers requires just a single boot of Rails (since it's just 1 process), and with our recent move to Ruby 2.0.0 it's faster than ever. Once the process is up, it launches 50 worker threads (configurable) to start consuming jobs off of the Redis queue. Going back to our previous simplification of 25ms of Ruby processing for every 975ms of network latency, we can max out a single AWS core at a throughput of 40+ emails or SMS messages per second outbound from the StatusPage.io app...with only 1 physical process! While one thread is waiting for the network to return, another thread can start some meaningful ruby processing generating an email or a text message.
The process stays occuppied as many threads wait to get some meaningful work done before shipping it out over the wire. Using our 25ms simplification, we should have a theoretical throughput of about 40 messages / second (40 * 25ms = 1000ms), but in practice we've found that a concurrency of 50 works best for us.
Spinning up more concurrency comes in the form of 1 process and 50 threads. Even if we have to take on a huge customer with 10,000 subscribers, we can clear the whole messaging queue in under a minute with only 4 worker processes (200 threads).
Remember kids, practice safe threads
Most Ruby and Rails libraries seem to assume only a single thread, and strange things begin to surface when you're sharing clients used for external communication to other services. For us, this surfaced with the
To fix the
twilio-ruby issue, we needed to move from an application-wide client singleton over to initiating a new client for each communication with Twilio.
To fix the
redis-semaphore issue, we needed to pass the Sidekiq Redis connection around in order to avoid deadlocking in the synchronization stuff (technical term), and ensure only a single Redis client was doing the communication with the server. Sidekiq doesn't appear to accept another Redis connection (only accepts URL), so unfortunately it must become the authoritative connection for all Redis communication application-wide.
Redis::Semaphore.new(:name, :redis => Rails.application.redis_client) do # do protected work end
Sidekiq.redis do |redis| Redis::Semaphore.new(:name, :redis => redis) do # do protected work end end
Admin Back End
All hail mperham for creating a fantastic monitoring back end! The wiki has great documentation for mounting
sidekiq/web somewhere in your app, and you immediately gain visibility into realtime statistics, manual retry/delete of failed tasks, historical view of jobs completed, etc.
StatusPage.io is now running Ruby 2.0 in production. Making the switch wasn't very difficult, but there were definitely a couple gotchas. Below is a documented version of our switchover on Mac+RVM and Ubuntu 12.04 LTS compiling from source.
Mac + RVM
We use RVM to manage our ruby version installs because we are often maintaining up to 4 versions and gemsets at any given time between StatusPage.io and our contract work. Please don't harangue me for not using rbenv. Let's check out some dependencies and then we'll do the install.
Homebrew for openssl and readline
The version of openssl that ships with Mac OS X is incompatible with ruby-2.0.0-p0, and the ruby install will annoyingly just skip quietly over this compatibility. We ran into this on our initial
bundle install, and it took a while to track down why 1.9.3 didn't have this issue. Let's get a new version of openssl, and we'll grab readline as well while we're at it (methinks so debugger operations will work correctly with the debugger gem).
brew install openssl readline
Do the actual install with RVM
RVM requires you to be completely explicit with installing Ruby 2, likely because of a name conflict with something like a JRuby or Rubinius. Earlier versions let you just do something like 1.9.3, but now we're forced to use the full ruby-2.0.0-p0. We'll also specify the newly installed openssl and readline as directories to accompany the install.
rvm install ruby-2.0.0-p0 --with-openssl-dir=`brew --prefix openssl` --with-readline-dir=`brew --prefix readline`
Edge Bundler and Gemfile specs
Let's make sure we're using the edge version of bundler so that we can take advantage of the ruby directive, and we'll ensure that we're using the https version of rubygems so that we can't get hit with a MITM attack.
gem install bundler --pre
We hit a couple random issues with gems that we had installed long ago. You may experience the same, and I recommend checking out the latest version of the gem to see if it makes the gremlins go away.
gem 'rest-client', '~> 1.6.7'
gem 'binding_of_caller', '~> 0.7.1'
Ubuntu LTS 12.04 and Chef
We use a custom Chef cookbook to manage installing ruby on our production systems by compiling from source. Changing the system ruby when you depend on Chef is always a bootstrappy web of emotion and anti-gravity, and moving to Ruby 2 was even more scary given how many dependencies are assumed on Chef's end. We only ended up hitting one snag, and luckily somebody over at Opscode had already merged a change for this.
Long story short, we modified the chef-ruby-src repo to build and install a custom chef gem that had this patch already committed.
For those of you not using Chef, below is a small bash script that will just get you ruby-2.0.0-p0 installed.
Just to recap, we use chef to build a new system ruby and install a new chef gem under itself. Take that, grandfather paradox!
We push a lot of code, and although the unicorn setup we have allows for hot reloads there can be times when we have to pause traffic with HAproxy and do a hard restart of the unicorns. Cutting ~40% off of the startup time is significant, and this feature alone was enough of a driver to get us to upgrade.
Suggestions and Corrections
Please email firstname.lastname@example.org with any suggestions or corrections for this article.