This is a guest post from Jordan Munson, Support Engineer at Wistia
What do you do when your software is experiencing a critical outage? Post an update to your status page, send out some updates via social, answer emails and calls that come in about it, etc. It all seems pretty obvious what to do in 2017, but for Wistia in 2013 things weren’t so clear. A handful of months into my tenure at Wistia, we faced what is still likely the biggest service outage in our company’s history. We were not ready, plain and simple.
We’ve seen a lot of status pages over the years. Everything from scrappy DIY pages for side projects to totally bespoke plans for global corporations has crossed our monitors in one way or another.
We wanted to share some examples of great status pages we’ve seen through the years.
Here are ten status pages that showcase some of the excellent design, planning, integrations, and creative thinking that go into incident communication.
Here’s a way to build a bridge that never fails:
Drain the river and fill it in with concrete.
Expensive, ugly, and stupid. But it’s certainly fail-proof.
This is a really simplified version of the problem web developers face when aiming to build high availability services. We’ve talked about the increasingly-interconnected nature of cloud tools and the domino-goes-crashing-down effect that can happen when just one critical service has downtime. Web uptime is more important than ever, and it’s critical that these services we all rely on are up and running as often as possible.
Incidents have always been a fact of life for people in IT and Ops. Today, it’s web developers, cloud service providers, and DevOps practitioners that are getting a crash course in incident communication.
Web scale incident communication is more complex than simply sending a bulk email. There are different audiences to consider. Different thresholds for messaging and response expectations.
Since downtime is inevitable, it’s best to plan ahead and make sure your team is ready.
If you’re hosting on AWS, you can expect some pretty excellent reliability and availability.
If your service isn’t responding, it’s likely an issue with your own code. On the other hand, system outages do happen. They’re usually pretty minor.
Sometimes, they’re not.
While AWS is the largest cloud provider and boasts excellent reliability, the service still experiences downtime. On Feb. 28, 2017, a US region of Amazon’s heavily-used S3 storage facility went dark for the better part of 4 hours. As a result, major organizations across the web experienced total or partial outages, including Quora, Medium, Imgur, Twilio, MailChimp, and many more.
This is our best attempt at a guide on actively keeping yourself in the loop for AWS system status. When incidents like this happen, it’s helpful to know how AWS thinks about reporting system-wide status, and what you can do to keep yourself updated.
Whether you’re a developer building on AWS, or a journalist or analyst keeping in the know on the web’s largest cloud provider, this guide aims to help you understand AWS status a little better.
When teams at Facebook, Twitter, Netflix, and Airbnb turn to a service for design collaboration, they fire up InVision. With millions of users worldwide, InVision is a powerful platform for product design collaboration.
As a cloud service serving so many end users, it’s critical for InVision to keep users updated about service status. The team brought on StatusPage to help communicate with its community around incidents, downtime, and scheduled maintenance.
“We viewed StatusPage as the industry leader and was the obvious choice by our support and engineering leadership," said Brandon Wolf, Vice President of User Enablement at InVision. "Personally, I was enamored with the extensibility and stock integration with other best-of-breed services, affording a quick-to-implement convenient web of statuses. Knowing StatusPage resides within the greater Atlassian family only made our selection and continued relationship easier."
At StatusPage, we’ve come across this question a lot.
“I’ve got my users on all these different deployments. How do I let one group know about an outage without alarming all the users on different servers who aren’t affected?"
It’s a good question. We talked with the team at Duo Security and learned about how they’re solving this problem.
Duo Security provides two-factor authentication and other security services for thousands of companies and millions of end users. Teams at Facebook, NASA, Yelp, and many more top companies count on Duo to keep their IT secure.
Launched in 2010, Duo puts a lot of effort into what security means for teams using cloud tools and working remotely. As a security service hosted in the cloud, Duo’s system status is extremely critical to their customers. When incidents occur, their customers need clear, correct, and immediate updates.
Just over three years ago, we embarked on a journey with a simple goal in mind. The software world was moving quickly in the direction of rented servers, hosted solutions, and outsourced vendors, all in service of allowing teams and companies to move quicker and to be more nimble than ever before. What used to be built and maintained internally was now delegated to other services or vendors.
And although every service provider and vendor strives for perfect uptime and operations, the data around availability tell us what we already know. Unexpected problems happen, and they happen to everyone. From Amazon Web Services, to Salesforce, to Comcast phone service this week, nobody is safe from things going wrong (even Pokemon Go!).
This new world was great, but it was missing a core component in the relationship between companies and their service vendors. That component, of course, was status communication.
Before we got started, status communication was very costly to build and maintain, and in most cases just didn’t exist. Our simple goal was to provide the ability for every software company in the world to build and maintain their own custom status page. Having felt this pain ourselves, we were as equipped as anyone to build the right solution, and the timing couldn’t have been better. From a handful of customers in early 2013 to thousands of customers today, it’s been amazing to watch all different types of companies build trust with their customers and their colleagues, saving everyone lots of time and money in the process.
Today, we’re super excited to announce that we’re joining forces with Atlassian to accelerate our progress and our continued march toward transparency across the web.
This is a guest post from Alistair Mclachlan. Alistair is Head of Support at FiveStars Loyalty, a San Francisco based startup that helps businesses and communities thrive by turning every transaction into a relationship.
To hear ‘Server Down’ is to hear two words which instill fear into the heart of any Support Leader. System outages are unavoidable but while Engineering is scrambling to fix the issue, there are certain things we can do in Customer Support to mitigate the impact on paying customers.
During normal circumstances, FiveStars Support carefully treads the tightrope, balancing call deflection with issue resolution in a way that doesn’t negatively impact Customer Experience. But if you have a Support Team staffed to take 20 calls per hour, a sustained increase to ten times that volume is going to sink the ship unless you have effective outage planning.