Communicating like a pro during an outage - learning from Twilio (Part 1)
Check this article out on Hacker News
For those of you that didn't see, Twilio suffered a pretty nasty outage last Thursday that resulted in accounts being inadvertently disabled. Nobody likes outages, especially ones caused by a billing system failure, but the way they went about handling it was pretty stellar.
This post is Part 1 of what will be a 2-part series on effective communication when you're suffering from unexpected downtime. We'll look at Twilio's initial response, what they did well, and what they could have done better (part 2 will talk about their followup and postmortem).
The Initial Communication, Gotta Nail It
The first step in effectively communicating during an outage is to acknowledge - in ANY form - that there is a problem in the first place, and that you have an actual human being working hard on fixing it.
We obviously have strong opinions about what tool people should be using as a status page, but the fact that they have one in the first place puts them miles ahead of most vendors. Their status page acts as the authoritative place for communication when outages or service disruptions occur.
Here are the basic items that your initial incident response post should contain.
- What's causing the problem (if known)
- What effects the problem is having on customers
- How many people are affected
- Any workarounds to remedy the issue in the short term
- The severity of the issue (your status page design should communicate this)
How Twilio Fared
Twilio's postmortem acknowledges the incident response began at 3:28am, with the first post on their status page happening at 3:58am (30 minutes later). The next update was a full 57 minutes later, with no indication of whether or not the humans were still awake and fixing the problem. Information about what was causing the issue and what the effects were was put forth clearly, and it's likely they didn't know the full effect at that time nor if any workarounds were possible.
One other note - make sure your status page reflects accurately the severity of the issue. "Warning" is not an accurate portrayal of the billing system suspending accounts, and a severity of "Major" or "Critical" with a red color would have been more appropriate.
- Initial Status Page Response Time: 30 minutes
- Next Update Response Time: 57 minutes
- Status Page Severity Indication: Not strong enough
- Grade: C
This was possibly the worst part of the public response (as we'll see below). I cut them a little bit of slack since it was 330am, but if I'm a customer with an account shut off I'm definitely upset right from the start. 90 minutes have passed with a only a few cricket chirps - my only sense of hope is that they've promised an update within the next 30 minutes.
No Fix Yet, Keep Talking
Stuff is still broke. Bad. You're busy ruling out possible causes and narrowing your focus on hunting down the culprit. Time is moving slowly as you navigate the matrix, but it moves even slower for customers who don't know what's wrong or when it's going to be fixed.
The second phase of communication happens when you're actively working on resolving the issue, and it's this communication that lets your customers keep their sanity as they twiddle their thumbs and respond to their own angry customers.
Here are your priorities (in order) after the initial communication. Think of this as our own DevOps version of Aviate, Navigate, Communicate.
- Find and fix the root cause, or rectify the effects of issue even if the root cause is still not found
- If you have new information, send it out right away
- Pick a decent interval as an upper bound on your next communication, even if all you say is "no new info right now". 30 minutes is a good max
How Twilio Fared
Twilio did a fantastic job here, even if they were slightly over 30 minutes on a couple of their message followups.
- Patched Issue For Now: Yes!
- Communicate often: Yes
- Communicate regularly: Yes
- Grade: A
Why Communicating Frequently is Important
Humans are funny creatures. We'd much rather have bad news than no news, and updating at a regular interval puts tensions at ease. Keep in mind that your customers have customers too, and if you're mission critical to them then they'll be fighting the same fire you are - the same fight with no way to fix it!
Put another way, the people paying Patrick McKenzie money care nothing about who is powering their appointment reminder SMS messages, they're just mad at Patrick for picking the vendor that goes down. The best Patrick can do is relay whatever information Twilio gives him, and new information coming every 30 minutes is likely enough to quell the uprising.
Go To Where Your Customers Are
I found out about the outage first on Hacker News, where many of the internet denizens live, and likely where many of Twilio's customer found out about it as well.
Twilio is known for its developer outreach and support, and boy did it shine that morning! Let's check out Rob Spectre, their head dev evangelist, and his down-to-earth style. He met developers where they were at, and you'll notice that even non-customers of Twilio loved the humility and the consideration.
How Twilio Fared
Hats off to Rob. Not sure if this was part of the process, but it sure appeared to be that way. There's a couple things that jump out to me that make this type of outreach successful.
- Made himself personally available to his customers
- Provided a channel (email@example.com) for customers to get personal resolution to their issues
- Didn't use canned language ("we apologize for the inconvenience"), and was humble and honest about it being a rough morning
- Gave link to authoritative source of information, and used the word "authoritative"
- Empathy and authenticity move mountains to keep customers' trust and respect
Being an on-demand customer liason: A++
Report Card, and What's Next
After starting out a little sparse with the communication, Twilio really rose to the occasion. They had the luxury of getting a fix in for the accounts before they fully resolved the billing issue, and all in all they saw about 3 hours of downtime in the worst case for the affected accounts
Final Grade: A-
The next article will focus on the same-day-communication after the issue is resolved, as well as the full postmortem that was released 5 days later. Our jobs don't end with the golden git-push, and remembering to be humans for the week following is just as important as making sure the the bug is fixed and won't regress.
Bonus: 50% Off Your StatusPage
We like to think StatusPage is an important tool for any any businesses. You could sweat the details on hosting and providing the page yourself. Or let us do it so you can focus on what really matters to your team and customers.
That's why we want to offer 50% off the first three months of any StatusPage plan. Simply shoot us a note to firstname.lastname@example.org when you're ready to active.
And, of course, our free trial is free forever.