An Insider's Look Into Incident Management: Keen IO
I recently met with Brad Henrickson, Director of Platform Engineering at Keen IO, to chat about engineering at the company, the effect of company culture on incident management, and an AMA that the team held in response to a couple recent backend incidents. Brad’s been working with tech companies for a while now including Co-founding Zoosk, and brings some great ideas to the table.
A Look Into Keen IO
Absolutely. Keen IO is a platform for event data analytics. Our customers use our API to push data to us - we collect it, store it, and make it super easy to retrieve and do custom analysis on that data. If you think about how we stack up against Google Analytics and other prescriptive analytics tools, we take a very different approach. As opposed to telling you what metrics you should track and what your dashboards are going to be, we give you the tools you need to build analytics in-house with your own dashboards and your own workflows on top of it. At the same time, we enable you to share these dashboards anywhere – with your teams or with your customers. We have recipes and starter kits that make it easy to get started, but it’s more of an open suite to build the analytics and the dashboards you need to solve your own business problems.
A lot of people use Keen to track the internal business metrics that matter to them while several SaaS companies use us to white-label the in-product dashboards that they show to their customers. It’s been really interesting to see what people end up building and tracking!
We recently split engineering into Experience, Middleware and Platform. I’m the Director of the Platform team. The Platform Team is specifically responsible for the storage, retrieval, and analysis of data stored in Keen. My role is essentially one to help people on the team determine what we want to accomplish together and what is best to contribute back to the company. For example, discussing how we could optimize funnel analysis for our users and helping to sort out who would do that work.
Evolution of the Stack
The first version the founders built was very much a V1 just to get something out in front of customers. At the time, we were using Mongo as a data store alongside a Python codebase which would cobble together results from our data store and provide them to our customers. You’re not thinking about, “How are we going to deal with tens of billions of events per month? Let’s just deal with millions and make the platform work.” As we brought on more customers, we inevitably had to start thinking about how we were going to scale and implement something that’s more performant for our customers at scale. We took a step back to re-imagine the best technologies we could bring to the table, while still building parity with the current solution. Overall, it turned out to be about a 9-month process where we went from this monolithic, web dev code base, to a more distributed system approach, focusing on scalability and availability. At this point, we brought Kafka, Storm, Cassandra, and a couple other pieces of technology to grow with our customers. Our V2.
Well, there are always bigger customers. If we went out tomorrow and signed the largest customer in the world, there very well could be a larger one the next day. And there are also huge costs that come with that.
"Instead of basing our infrastructure off of individual customers, we think about what our customer base is going to look like 2 years from now. Who are the types of customers that we want to be servicing? What do we theoretically want to be able to support?"
Scalability is always something we’re investing in compared to adjusting our infrastructure based on any given customer.
Dealing With An Outage
There was a period in February where we had some operational issues, including a ZooKeeper failure that impacted all of our customers. We were observing significant query slowdowns, which manifested in capacity issues with ZooKeeper, which is for orchestration and coordinating data. We weren’t monitoring it closely enough, causing things to be rescheduled internally, consequentially slowing down queries for our customers. It took us a little while to figure out why things were slow. As you can imagine, query slowdowns is a very nebulous thing.
We did a lot of deep investigating to figure out what was going on. Eventually, we found out that disk I/O was saturated since it had been configured in a suboptimal way. That was causing query slow downs on the order of 5 times the typical query time. Customers would then resubmit their queries, causing even more pile ups and the whole thing became this cascading the issue. To triage, we increased the amount of capacity in our ZooKeeper cluster while also reducing the amount of writes and reads against the cluster.
"It was a tactical solution, but obviously not the end of the story since you can’t just add more capacity to most infrastructure problems and call it a day."
The core problem hadn’t been address yet. Part of the main problem was that the hosts weren’t designed appropriately. We went in and reconfigured the disks to have a proper RAID config, moved around where the data volumes were pointed to, and ended up updating the version of ZooKeeper we were running. For consistency, we also moved the hosts to our new configuration management tools. These changes made the hosts far more responsive and allowed our infrastructure to perform efficiently.
Throughout the incident, we were in constant communication with our customers. We were very transparent about it and didn’t just say, “Hey, queries are slow” and runaway. We believe very much in transparency and a crucial part of that is telling our customers what’s happening. It’s saying “We’re having these problems right now, this is exactly what’s going on, here are some of the steps we’re taking to the fix issue, and here are some more steps we’re going to take to improve going forward.”
With any incident or issue, it's never enough to just be content with the fix. You have to make sure your customers trust you with the decisions you’re making. A huge part of that is straightforward, honest communication.
A lot of those learnings are very individual. I was part of the AMA and it was fantastic to hear from customers – both the ones that encouraged us and the ones that surfaced their frustrations.
"I think when you treat people with respect and you give people that transparency as we did, you get surprise."
I was incredibly surprised by the support we got from customers. We all try to have a lot of empathy for customers, but it was really surprising how empathetic they were for us. We’d hear things like, “We really love how your company operates and we’re going to stand by you through this.” It was refreshing to hear since there is a bias within engineers to make sure you’re up 100% of the time (which we strive for), but when you fall short, it’s comforting to hear customers aren’t just kicking down the door with torches. At the same time, there were a variety of customers with hard questions that made us think critically about the solution before jumping to conclusions on the right implementation.
It’s easy to be transparent when life is good. Not so much when you’re on-call. It’s a question that really revolves around how a company defines its culture and how you figure out the values of a company. At Keen, a lot of these values have been instilled since the very beginning. When I was first looking to join the company, there was a great post by our Chief Data Scientist, Michelle Wetzler where she talks about her process of figuring out what her compensation would be here – everything from the offer process to negotiations.
So, it’s not just about the service offering. It’s important across the board that we’re open and transparent with each other, for issues that are both hard and easy.
"When we went through a tough time back in February, we could have just not said anything and who knows how that would have played out, but true values don’t just sit in a HR folder. They percolate throughout the company and allow individuals to make decisions based off of them. It’s in our DNA and an important part of who we are."
Structuring Incident Communication At Keen
It’s dependent upon the situation. If it’s a small issue, with low overhead, whomever is on-call will update the status page. If it’s a larger issue, we’ll escalate to our communications team to craft the message and update the page. If it’s an even larger issue, we’ll continue to work with the communications team, but might even do a blog post, AMA, or send individual emails to customers. The degree of communication gets triaged based upon the severity. As an example, if query latencies are 5% slower than usual for a 5 minute period, it most likely isn’t going to require a StatusPage update, but if we lose the entire cluster and have major issues for 8 hours, it’s going to require a whole team effort including engineers fixing the problem, communications updating the status page, and customer support answering inbound tickets.
Whenever there is an incident, we have the notion of an incident manager. That incident manager is the one responsible for making the call on updating the page or asking for assistance from others on the communications team if they need help. The incident manager defaults to the person who got paged, but they are able to hand that over to someone else as they feel is necessary.
Yep, it’d be much easier if we had a simple web-application running on MySQL! We’re talking about a system that theoretically can scale to very large event volumes and is theoretically fault tolerant depending on how we use our technologies.
"Being an expert in Scala, Kafka, Cassandra, caching, provisioning hardware, networking is pretty much impossible. It’s an incredibly broad domain and you never really know exactly what has happened."
Did the data center lose power? Did something else happen? On top of that, it’s especially hard when so much state is involved. We have a highly stateful system. As an organization, we’ve had to grow and change our processes to solve these issues.
Where we’re at right now is partitioning between the Middleware and Platform team as I mentioned earlier. Platform works on storage and compute services – far backend problems. The Middleware team is higher in the stack and deals with caching, dispatching API requests, some deployment services, and a couple other pieces of tech. If an incident occurs, we have a top-level funnel that splits the incident based on whether it’s a middleware or platform problem. We also have a catch all bucket if we don’t exactly know where the incident fits, although we try to limit the catchall alerts. This specificity allows people to become experts in a particular domain, but at the same time, it also reduces the amount of people available for on-call. It means more people on call at any given time, but for a smaller set of potential problems. Instead of being on call once every 8 weeks, you’re now on call once every 4 weeks. That said, the partitioning lets us resolve issues much more rapidly as opposed to having everyone on the team learn the ins and outs of every piece of technology that we use.
Yep, it’s definitely evolved over time. We actually have a meeting called, "Hey, how about that on-call?" every 2 weeks. It lets us have an open dialogue about on-call and share both our successes and our problems.
"We’re essentially asking ourselves the question, 'What’s the best way we can do better for our customers and for our own lives?' That continual improvement dialogue has born a lot of fruits."
One conversation we’re having now is based around this Middleware and Platform split, while also recognizing that there’s a lot of stuff that doesn’t require much domain knowledge to deal with. As an example, what happens if there is an Apache instance on the fritz and we haven’t built a whole lot of automation to deal with it yet? Other people in the organization should be able to help out to help ease the burden. We’ve thought about some type of first layer triage that would give them the ability to learn more about our system while taking some of the weight off of the engineers on-call.
To be frank, we’ve been challenged by this. There’s this default notion that the person on call is the person to bear all of the weight, but it doesn’t have to be that way. You can find ways to service your customers to the best of your ability, but also have it be in line with the way you want your company to operate, and that’s where the interesting, hard conversations are. Otherwise, your default mode will keep on going and will cause burnout and resentment. So, I’m in full support of having those internal conversations. Ask your team - what’s going well and what’s not going well? Make sure people feel comfortable standing up to talk about it.
Our earlier architecture implementation was very much first come first serve. If you sent 1,000 queries to us all at once, we would try to run all 1,000 queries.
"It’s like going to a checkout line and waiting your turn. I remember when I was kid and going to Dairy Queen. All of a sudden, the local soccer team comes in right before you and now you’re at the back of the line and have to wait for 30 minutes. That’s essentially the way the system worked previously."
Since then, we’ve increased our sophistication. As an example, Customer A will have a line for servicing queries different from Customer B, which will be different from Customer C. If there’s extra capacity available, we’ll run some extra queries for you, but we try to do a degree of portioning for inbound queries. Further down the stack, we try not to partition too much or have extra infrastructure than needed since it’ll make us too brittle in the long run, increase costs to our customers, and reduce our overall service level.
When I think about Keen, I think about two major facets that have contributed to our success so far. The first is the technology and product piece. The second is something we’ve touched on, which is values. We operate a bit differently than other companies by distributing decision making to individual people as opposed to a centralized command and control operation. It tells our employees, "You know as well as anyone else here, if not better, so you should make the decision on what to do." We don’t have managers sitting around telling you what to do. For us, it’s more of a discussion within a team on what are the important things to work on as opposed to someone in a room somewhere passing those priorities down.
It’s specifically interesting when you get to talking about services, incidents, and other pieces. Engineers are empowered on an individual basis to think about how we can improve the experience for our customers. We trust the people working on the issue and they’re empowered on an individual basis to make the right decisions.
"We really believe in distributed decision making and having the power to make the change that you want to make. It let’s let the great people be great. Why else did you hire them?"
StatusPage has been huge for us. The folks at Runscope initially told us about the product. Prior to that, we were small enough to send an email or post on Twitter as a starting point.
The ability for us to have a simple interface to reach our customers and communicate clearly has really changed the way we interact with customers during incidents. Without it, we’d build our own entire status page solution and it wouldn’t be nearly as elegant or smooth. It’d be important, but we wouldn’t be able to focus on it without building a whole team around it. We’re also now making use of metrics. It’s been really fast and easy for us to push in our response times as well as event processing queues. For some, that may be too much information, but for others, it’s great to see. It lets us move a step closer to our customers and get information in front of our customers in a simple way.
It’s a simple interface that an engineer can deal with, someone on marketing can deal with, or someone on data science and customer success can deal with. It’s easy to use and incredibly valuable for us communicating to our customers at the most important times.
How does your incident management compare to Keen IO's? We'd love to learn more in the comments below.