There’s a lot to be said about making systems stable and reliable. On one hand, there are buggy systems that produce no errors: the payment portal for my neighborhood’s HOA hasn’t actually ever given me an error message, but it doesn’t behave correctly. That makes it bad.
On the other hand, you have software that behaves rather smoothly and tends to be a joy to use, but experiences occasional random errors. Maybe the front-end even handles these gracefully and the user never actually experiences problems—they’re exclusively the frustration that one software system feels in communicating with another. That, in turn, translates to slowness and compounds into scaling issues.
All software lives somewhere on this spectrum. Even the most well-tested software won’t match user expectations sometimes, and external errors (e.g., those produced when talking to a third-party API) cannot be avoided. And most commercial software has plenty of user-visible bugs, some of which are errors (in the technical sense) and some of which are not.
Today I want to write about exceptions that occur in code (specifically the sort that percolate to the top, like 500 HTTP responses, thrown exceptions, etc.). A healthy engineering org needs to monitor these. Almost everyone does, but for a variety of reasons not everyone takes action on them. Or, they don’t take the right actions. Either way, letting error rates creep up unaddressed can and will lead to catastrophic consequences: software that errors out is definitionally not doing its job, and software that doesn’t do its job isn’t going to get sold.
You will always have errors.
For Pinecast, I use Rollbar. It’s sufficient for my use case, tends to be quite fast, and hast many of the features that I care about. Runway uses Sentry’s hosted offering, which is another fine option. Stripe used (and perhaps still uses) internal tools for Ruby errors and Sentry for front-end errors1. When I was at Box, they tossed their errors into Splunk, and I have a vague recollection of using NewRelic.2
Without fail, though, these error tracking systems inevitably end up clogged up with noise. These systems have elaborate issue tracking features that I’ve never seen a team successfully use: the statuses (resolved, assigned, etc.), assignments, ticket links, and other features inevitably break down at scale.
I can imagine a small shop (say, under ten people) using this effectively. Errors, though, break out of the normal arithmetic of business operations: they grow as you build more features, but they also additionally scale independently of lines of code written as the number of users poking and prodding your app grows.
You can freeze your repo and stop deployments and only do non-technical work all day, and you’ll still accumulate new unique errors. The reason is simple: as you acquire more users (and specifically, more different users) with slightly different browsers and extensions and operating systems and GPUs and whatever else, you’ll end up exercising more code paths and edge cases. Any reasonably sized codebase will have bugs. And when you think you’ve fixed all of the bugs you know about, there will always be more bugs lurking and waiting. The mortal nature of our existence and the fuzzy, human considerations of the products we build preclude us from ending up with an immaculate, bug-free codebase. By consequence, even when nothing changes, we can always expect to see more errors.
I’ve seen teams have “cleanup sprints” to fix a bulk of errors at once as a means of keeping this trend under control. Or someone will be assigned to fixing errors on a rotation. This helps, but getting the number to zero is something of a pyrrhic victory: it’s only going to go back up.
Some organizations go through a revolving door of rewrites and refactors to address errors: the errors are sometimes a symptom of changing needs and old assumptions. These efforts can work, but they also carry a great deal of risk and can leave you with more problems than you’d started with.
The goal, though, isn’t really to stay at zero bugs. One of the great lessons of becoming a senior engineer is knowing that “perfect” is the enemy of “done” and pretty much every other kind of business impact, especially product work. You can whack-a-mole errors all day to keep the number at zero, or you can build value for the 99.999% of customers who aren’t encountering errors.
And this is true of even the best: AWS’s S3 advertises four nines of availability: one in every ten thousand requests to get an object from S3 is expected to maybe fail. S3 advertises eleven nines of durability (there’s a 99.999999999% chance they won’t lose your file), but there’s still a one in 100,000,000,000 chance they lose your file. And if you have, say, 100,000,000 small files, that’s a 1 in 1,000 chance you lose a file. Errors happen.
You need to read every error.
While it’s Sisyphean to try to keep the errors at zero, doing something to keep the errors low is still necessary. At the very least, the error rate shouldn’t increase superlinearly with the growth of your business (for some definition of growth that incorporates number of users, number of features, lines of code, etc.). What the process of addressing this looks like is going to vary; there’s no one-size-fits-all solution to addressing KTLO work.
Part of that process is reading every error. You can’t not read every error. At the very least, if you’re addressing some errors, you want to be addressing errors that actually matter. Reading only a sampling of errors means you’ll end up eventually reading mostly common errors that have gone unaddressed and missing the one or two more serious (but rare) problems.
A naive approach is to address the errors that are most frequent. And sometimes this is valuable: errors hit frequently by users are likely more of a problem. But it’s rarely the case that the most frequent errors are the most impactful:
At Runway, we have a third party vendor whose API frequently drops requests (TCP connection failures and timeouts, 504 errors, etc.). These requests are high-volume but low-priority, and their failure doesn’t impact the user experience.
Common browser extensions often produce huge numbers of completely inactionable errors that can be difficult to filter or block (e.g., if they modify a UI that’s managed by React, you’ll get lots of invariant errors).
Some errors appear more frequently because they’re part of processes that are retried. User experience might be degraded by a delay caused by the retry, but there is no other impact. The volume of errors is magnified by the number of attempts.
In fact, the most insidious errors are quite often the ones that don’t happen very often:
Some errors are just the tip of the iceberg. N+1 bugs, for instance, generate lots of excess database load but generally only produce errors for users with lots of records (when timeouts or memory limits are encountered).
Some user actions are necessarily infrequent. At Runway, we have a video editor tool. The video editor makes lots of API calls to edit and save the video composition. When a user is done editing, they might export the video with ~a single API call. If the API endpoint to export were to break, that one failed API call is likely very low-volume compared to many other much lower-impact failures that might occur during the editing process.
Some errors happen on a schedule. A cron job that runs once per day can only error once per day. An error with a rate of one per day is always going to rank lower than errors that can happen as a direct result of user action (assuming you have more than one user action per day).
And so going beyond naive prioritization, you need to look at all—or at least the majority—of your errors. Adopting a regular process for inspecting and triaging/fixing problems is a fundamental first step, and I don’t think there’s a single right way to go about this.
Pulumi recently wrote this blog post where they go into detail about their process for reading every error:
To quote one of the paragraphs right at the top:
You should read every error message that your system produces. Simple but effective. Our team pumps every 5XX into a Slack channel and reviewing each of these is a top priority for the current on-call engineer. There’s a little more to it, but that’s the gist! Commit to this process and your error rates are guaranteed to drop.
I certainly don’t want to say that they’re doing it wrong, because if it works for them, it works for them. And it could work for you too! But there are inevitable pitfalls that you’ll run up against, and some kinds of software will encounter those pitfalls sooner than others.
You can’t read every single error.
Not every error is created equal. Some errors, when investigated, have a clear root cause. There’s a bug in the code with a clear change that needs to be made, and it can be triaged and prioritized. Other errors live in a fuzzy gray area where it’s not entirely clear what you should be doing other than erroring.
Pinecast is home to a lot of great examples. In the early days, I’d have almost no errors: even with a few hundred DAUs, I’d have maybe one error per week. And almost always, these were “real” bugs that I could patch up and ship a fix for. Today, Pinecast integrates with dozens of third party services. Here’s a recent wishy-washy error:
ReadTimeout: HTTPSConnectionPool(host='itunes.apple.com', port=443): Read timed out. (read timeout=8)
timeout The read operation timed out
The software attempted to contact one of Apple’s servers and it failed with a timeout after eight seconds. This request was made as the result of a user request, and a user’s browser was sitting waiting for a response from our server.
It’s not clear why there was a timeout, and it’s not a pattern. Maybe something on their end crapped out, maybe the NAT gateway in my VPC crapped out, maybe a gremlin living in the fiber optic cable connecting AWS to Apple gobbled up the packets—who knows. It’s the public internet, requests fail.
When this specific failure occurs, the front-end shows a meaningful error to the user and allows them to retry. That’s about as good as you can do, really. We probably don’t want to auto-retry (eight seconds is quite a long time, a second timeout failure makes that request 16 seconds). We need user input for the followup of the response, so we can’t just queue the response for later.
Just to take a quick beat: I should underscore that this class of problem isn’t special. Almost every business has this flavor of error. If you interact with third parties, if you deal with locking or distributed systems (read: databases), if you run on composed-of-molecules hardware that exists in the real world, you will have some expected failures. Maybe a disk failed. Maybe your EC2 spot instance got preempted. At some point and at some scale, normal operating conditions3 will not meet your expectations and will translate to failures. Obviously nobody wants these failures, but they will exist.
HTTP doesn’t have a great status code for us here. We could respond with a 200 containing an error, but that’s not right. A 5XX error is definitely the right response (we fucked up, not you). 504 Gateway Timeout
is probably the most correct: in a way we’re acting as a gateway. A plain-old 500 would also suffice, as the error is internal to the server, and the details of our interaction with Apple is immaterial to the client anyway.
If you’re thinking ahead, you probably already see the problem: suddenly, you have requests that do fail but you don’t care, but that means your way of measuring errors is broken. Your rate of 500s probably can’t (practically) be zero at any real scale. At some point, you simply will not be able to read every error.
“Well what if we only read unique errors?” I challenge you: what is a unique error? How do you differentiate between a random bit flip in your router causing a once-per-month timeout in a request and “we forgot to set appropriate limits and now we’re timing out trying to upload 10GB of some random user-uploaded data to a third-party endpoint” if they both present with the same stack trace? Or even just “we’re trying to talk to a third party that’s having an incident”? The answer, really, is that you can’t really truly differentiate. Anyone who has maintained a React project knows that a litany of “Invariant” errors with different causes will have nearly-identical stack traces.
And for what it’s worth, you could work around all of this diligently and catch/handle every “expected” error. You could either add something to your HTTP responses so your middleware knows whether it’s an expected or unexpected 5XX. You could write your own framework, or use a non-HTTP transport where the notion of 5XX isn’t so fuzzy. You could build your own error reporting system so that you can still collect the “expected” errors but segment them off into their own world so alerting still works.
Will you? Probably not, because companies need to get work done and unless you’re FAANG and have teams of engineers who can devote themselves to this full time and drive a big migration to get the whole company on board, you’re either wasting time that could be spent on far more important problems or you’re working towards achieving an outcome that was never important in the first place.
WAYRTTD?
What are you really trying to do?
In every role I’ve had at every sizable company I’ve been at, there’s simply no way resources can practically be dedicated to looking at every error. The noise of expected errors makes a firehose approach—like posting every 500 error to Slack as Pulumi does—is simply going to train folks to ignore the stream of errors because more than a few are unactionable. Of course this works for folks, and that’s fine! Scale becomes a problem, though. Even at Runway—where we still have a small engineering team—we’re well past the point where this is feasible.
Stepping back, what is feasible? Our primary motivation is finding and fixing defects. Sometimes the symptoms of a defect are mostly-indistinguishable from signs of “normal” operation. Some errors we care about and some we don’t.
Here’s what Pulumi writes:
The team began obsessing over the user experience. Following this process builds a visceral understanding of how your system behaves. You know when a new feature has a bug before the first support tickets get opened. You know which customer workloads will require scaling investments in the next 90 days. You know what features see heavy usage and which ones customers ignore. You’re forced to confront every wart in your application.
The values/goals they state:
Great user experience should be important
Knowing about bugs soon after they’re experienced is important
Resource use should be observable
Process around problems should exist
Seeing a breadth of problems rather than just the most common problems
And so the goal was really never “look at every error,” it was more nuanced than that: the goal is to address errors, preferably quickly, and preferably with a bias towards problems that customers are impacted by, and preferably executing this process with frequency.
This is a problem with procedure in two parts:
A diverse sampling of recent errors should be investigated regularly.
Why bother collecting errors if you’re not going to look at them? Of course every team that cares about reliability and UX should be looking at errors. The approach should bias for breadth.The difficulty/complexity and impact of the fix should be documented and considered.
The reason some errors are common is because they don’t get fixed. If the most common errors were easily fixable, they wouldn’t be the most common errors anymore. Often the effort of fixing a problem is far lower than the expected impact on the customer experience or the business. Making sure that it’s understood why a fix hasn’t been written and deployed makes triaging very common issues alongside very infrequent issues possible.
How you break down the full list of errors to find the ones that are infrequent-but-serious is never going to be one-size-fits-all because how many errors are coming in and the processes for resolving them are going to vary wildly by the product type. Just having something in this case is better than nothing, and you can decide whether that’s working or needs improvement.
The key for me is in the second point: too often, errors are treated as untriaged mysteries up until a fix is being written for them (or until someone has expressed intent to write a fix). Up until that point, lots of people usually look at an error, investigate it, and decide it’s low impact or expected, or it’s too complex to figure out or write a fix for at the moment.
Everyone has done this: you look at a stack trace, read some code, dig into a git blame, maybe even write a quick test. And some amount of time later, you don’t have a clear answer and move on with your life. In doing so, not documenting the problem means that the organization is no closer to having that problem resolved. I’m very guilty of this myself: it’s easy to dip your toe into a quick investigation without committing any effort towards a resolution.
In a perfect world, each issue shows:
Any investigation effort including what the error is believed to be
An explanation of why the error is low-impact
An explanation of why the error is low-priority
Ultimately, the goal is to be able to go through the list of reported errors and be able to rank the list such that common problems that have been deemed less important can be separated from ones which haven’t. For what it’s worth, I’m not aware of any tools like Sentry or Rollbar that do this for you, but they should. Integrations with ticket tracking is probably the best way to manage this workflow.
What’s the goal?
The goal is better software. I think a lot of times we get deeply stuck at the point where it’s apparent that you aren’t keeping up with KTLO work and new errors, and the easy reaction is to try to have some all-encompassing solution. “We need to drive the number to zero!” No! You only need to improve the process enough to have the number trend downward (and remain downward-trending over time).
Perfectionism is naive and impossible. All software has bugs and probably always will. If you’re doing anything to address errors and keep your users happy, that’s the biggest hurdle you’ve needed to overcome. Beyond that, adding structure to your process and making sure your attention to KTLO work doesn’t lose momentum is the key to success. When the process isn’t working, first make sure you’re giving it resources, and then look at what parts of it are failing and iterate.
I don’t know what Stripe uses for other languages; I never worked on anything in another language while I was there.
I truthfully can’t remember what Uber used. Which is either a sign they were doing great or they were doing terribly.
Things in the real world failing is normal!
I think the goal isn't actually better software, it's durable business success, which then typically translates to happy users which translates to better software. Of course "better" is doing a lot of work here -- better to whom? Turtles all the way down 🙃.
Anyway, I think as relates to operational excellence, we'd do well to think of Meadow's advice to "dance with systems". Vis-a-vis happier users, we need to *understand* the health of our systems *as they appear to our users*, and we need to gain this understanding with a reasonable time investment.
I think this largely tracks with your suggestions and is in tension with Pulumi's blog post -- I mean, if it works for them, great! But as an argument I am strongly not convinced that the only way to have happy users is to read every error. I mean, plenty of folks don't do that and have happy users.
So I guess my biggest quibble here is just like, lets stop calling this stuff KTLO. Gardening our systems so to avoid haunted forests and increase the rate at which we can safely change (read: ship new features) is no less deep a domain or focus area than "distributed systems" or "user experience". I think calling it KTLO does it a disservice and primes us to think about it as a tax. Sure, *toil* is annoying, and some amount of toil is part of operational work, but toil and operational excellence and KTLO are not synonyms!
there is so much alpha in just look at errors that people are ignoring
usually points to parts of the software that are poorly observed
which then points to things that are unknown to others
which then again, points to places where large pareto chunks of opportunity (muh impact) have been not been noticed (low hanging fruit)