What happens when an app goes down? All about outages
Death, taxes, and outages: why being a software engineer isn't always so cushy.
A wise engineer once said that only 3 things are for sure: death, taxes, and outages. And when an app goes down, it’s a colloquial Titanic event for a company – all hands on deck, engineers getting paged at odd hours of the night, and frantic Slack Huddles until they find the culprit (it’s usually DNS). But what exactly is an outage? What does it mean for an app to go down? And why can’t teams just build apps that never go down?
Please enjoy the following 2,000 words on every engineer’s greatest nightmare.
The magical inner workings of cloud apps
The best way to understand what’s happening when an app goes down is to first understand what’s happening when an app isn’t down. What needs to go right for you to be able to open Twitter, scroll through your home page, click on a tweet, and send a reply? A lot!
Technical readers will be very familiar with the common engineering trope, “what happens when you type a URL into your browser?” – it’s a classic coding 101 interview question that is a lot more complicated than it seems. Instead let’s focus on all of the different components that a functioning app needs to have. All of these are hard at work, doing their jobs, when you use an app like Twitter:
1) A frontend
The frontend of an app is like the front of the house in a restaurant. The FOH’s job is to be the client interface: taking reservations, sitting you down, bringing you your food, etc. And an app frontend’s job is also to be the client interface: this is the part of the app that you actually see and interact with, click on, etc.
The frontend needs to sit on a server somewhere in the cloud, so that when you type that URL into your browser, there’s some nicely formatted, sleek looking HTML pages that get sent back to you. There’s not usually much raw processing power required for the stuff that the frontend does, so popular hosting options like Vercel will cache your frontend in data centers around the world so they’re closer to your users → load faster.
2) A backend (two parts)
The backend is like the back of the house in the restaurant. The BOH’s job is to prep and make the food, which is (maybe) the most important part of keeping the joint open in the first place. All of the data you see on Twitter – tweets, usernames, like counts, etc. – that’s all food that was prepped and cooked in the Twitter backend.
When it comes to understanding outages, I like to think of the backend as having two parts: the database, and everything else. The database stores the actual data in question: the number of likes, the text of the tweet, etc. The “everything else” takes care of retrieving that data, formatting it, and getting it ready for the frontend to use. Developers are big on separating data and logic; think of the “everything else” as all of the supporting code that makes the data that the database stores useful.
The backend is usually the performance bottleneck for an application: how fast you can retrieve and format data is the biggest difference between fast apps and slow apps. So backend infrastructure is very important. This is why you have complicated app deployment frameworks like Kubernetes1, to make sure that complex backends are running smoothly and quickly.
3) Networking
Networking is how all of your different app components talk to each other. Back in the day networking for apps was fairly simple, because your entire app ran on one big server. All a client (you on your laptop) needed to do was make requests to this server and get back a response.
Alas, for we have strayed from the simplicity of our fathers. Apps today, even ones that aren’t huge, can get incredibly complicated with tens, or even hundreds of different servers communicating with each other (“microservices”). Each of those servers needs permission to access those other servers on the right ports. In fact, this stuff has gotten so complicated that there are entire frameworks (like Envoy) that are just about managing the communication between your huge microservice fleet.
So you want to know what happens when an app goes down? Any single one of these things doesn’t work in a bad enough way, and your app is toast.
What happens when [] doesn’t work?
So what exactly happens when one of those app components doesn’t work? First, it’s worth mentioning that there are actually three types of major outages you can run into:
The obvious one,
Where an app literally won’t load at all for you,
The less obvious one,
Where an app loads, but some of the functionality is running into errors, like you try and tweet and it says “sorry, our servers are busy right now,” and
The most subtle, sinister one,
Where everything appears to be working fine, but on the backend, something is going horribly wrong and your data is getting messed up. A common manifestation of this one is you tweet, the app says everything went great, but in reality your tweet never actually got tweeted.
So with the fact that outages can manifest themselves in a bunch of different ways in mind…
Frontend outages
These are generally pretty rare, since frontends are relatively simple to run and tend to be cached on servers around the world these days. Occasionally, some innocuous frontend code trying to format a date correctly will misfire, and an entire page won’t load. But most likely, if your frontend is going to cause issues, it will be smaller, annoying things like a button not working, a UI not acknowledging an action, etc.. I think of these as bugs, not outages.
But there’s a thin line. Suppose the Twitter frontend is broken, so that when you write a tweet and click tweet, it doesn’t actually work. The backend is working completely fine, but the frontend won’t let you press that button; so everyone in the world is blocked from tweeting. I think most people would call this downtime or an outage.
The way I think about it is that the frontend is your interface to the backend. And if it’s sufficiently broken such that you can’t use the backend normally, then that’s an outage.
Backend outages
Unlike frontend outages, backend outages / downtime happen all the time. The most common culprit is the database. A few common scenarios:
Data is accidentally deleted from the database
This was the reason behind the massive Atlassian outage in 2022: a script accidentally deleted customer data. An innocent bystander might wonder: how dumb would you have to be to delete customer data from your database? It’s actually a lot easier than you think it might be. Consider the following SQL query that deletes a single user’s data from a users table, which is a common use case if you need to comply with GDPR regulations:
DELETE FROM users
WHERE user_id = ‘123’
See that clause at the end with the WHERE
? That tells the database to only delete the row for that particular customer. If you accidentally deleted that single line from the query, you’d have this:
DELETE FROM users
And, you guessed it, this would delete your entire users table. Nowadays most companies backup their data, etc. etc., but my point is that this kind of fat finger is not impossible at all.
A variation on the deletion theme is accidentally rewriting data that needs to be in a specific format, like Retool did a few years ago.
The database is under too much load and doesn’t work
If things are going well, your app grows and more users sign up every day. Which is great! But if your database isn’t equipped to handle all of that new data and new traffic, it can get so overwhelmed that it stops responding to requests normally. This happened when I was working at Retool several times. It’s also what caused a short AWS Lambda outage last year.
To fix the issue, you usually just need to size up the database. But it takes a while to figure out your app is down, find out why, and then actually fix it. Just because fixes are obvious or basis does not mean things will get resolved quickly (but more on that later).
🔍 Deeper Look 🔍
There are a bunch of different reasons that databases can struggle to scale to higher amounts of traffic. The obvious one is just storage size: when you deploy a database, you pick how much storage you want. And if you get a lot of users, you can run out of that storage (just like on an SD card in your camera). But your database can also run into issues with speed: if you have tons and tons of requests coming in, you might need to scale your database horizontally.
Database migrations
Database migrations (Technically post on this topic forthcoming) are a normal part of growing companies. At some point, you’ll want to move your production database to another vendor with a faster, cheaper, or otherwise better solution for you. And if you’re ever doing a database migration, statistically it seems like you’re more likely to experience downtime than not. Examples in the past couple of years include Slack and Google Cloud.
Networking outages
Perhaps the most common type of outage is a networking outage, or when the different parts of your application fail to talk to each other properly. There are about a million ways for networks to mess up (e.g. this Microsoft outage), but the far and away most common cause of network outages for software is DNS.
🧠 Jog your memory 🧠
DNS, or Domain Name Service, is how nicely human readable URLs like Twitter.com get mapped to IP addresses, which are like the street addresses of computers and networks.
This was the cause of the big Facebook/Instagram outage back in 2021, which I wrote about here:
The final piece you need to understand what happened to Facebook is called BGP routing. BGP stands for Border Gateway Protocol, and it’s commonly referred to as the “postal service” of the internet. Its job is simple: find the most efficient route for data to travel between two points on the internet. When you load www.facebook.com, BGP is responsible for finding the quickest, most efficient path to Facebook’s servers (and back), via the crazy, disorganized network of computers that is the internet.
The wacky thing about BGP is that it’s basically autonomous – there’s no central body controlling it, even though it’s seemingly one of the most important parts of internet bedrock. What that means is that misconfiguring anything related to BGP can take entire swaths of the internet offline, because traffic can’t find them.
This is what many people believe seems to have happened to Facebook. Big Blue operates its own data centers – thousands or even more interconnected servers – that store all of your data, host the app, and also carry internal Facebook services like email and internal tools. They seem to have (accidentally?) removed the BGP routes that connect their DNS – the mapping of Facebook domains to their server IP addresses – to the rest of the web. There was nothing wrong with Facebook’s servers or their apps; it’s just that we can’t access them via the internet.
DNS-related things like BGP are responsible for so many outages that it has become a meme at this point.
For more examples see Square, Microsoft, and Virgin Media.
[bonus] Misc. outages
The most fun part of outages is that they come up with new reasons to happen every day. Though most will in some part relate to the 3 groups we’ve covered, there are more (many more). For example, Cloudflare had a big outage last year because one of their data centers lost power (like literally, electricity) from the utility company doing unscheduled maintenance. Or it can be too hot in the summer, and your data center cooling systems break. Who knows!!
OK, but why do they take so long to resolve?
If you’ve read this far, hopefully you feel a decent degree of confidence in understanding why outages happen. But something that always bothered me is this: why does it sometimes take so long to fix these things? The Atlassian outage last year lasted 9 days, and you can bet your Bitcoin that as many engineers as possible were all hands on deck trying to fix it. Are they dumb?
The first thing is that most outages are resolved extremely quickly. But in the same sense that no local news channel has ever reported on no murders happening, you don’t hear much about small, quickly resolved outages (unless you’re a tool’s power user or something). So that’s important to mention. Having said that…
The reason that most outages take a while to fix is that it takes a while to figure out what has actually gone wrong. Developers have relatively sophisticated infrastructure to alert them that something isn’t working – like Datadog, PagerDuty, etc. – but that’s just step 1. When you know your app is down, you need to look for what in particular caused it to go down. And then once you know that it’s one thing (frontend, backend, database), you need to figure out which particular part of that thing caused it to go down. And then you need to figure out why that happened, so you can fix it and make sure it doesn’t happen again. All of this takes time! And sometimes the things that cause an app to go down are so comically niche and unrepeatable that it can be a genuine whodunit.
But that’s not always the reason that outages can take a while to resolve: sometimes the treatment isn’t entirely obvious from the diagnosis. Let’s use the Atlassian example again: it seems like they quickly realized why their apps were down, but the situation was so complicated that fixing that required a customer-by-customer, bespoke process:
So why is the backup taking weeks? On their “How Atlassian Does Resilience” page Atlassian confirms they can restore data deleted in a matter of hours:
“Atlassian tests backups for restoration on a quarterly basis, with any issues identified from these tests raised as Jira tickets to ensure that any issues are tracked until remedied.”
There is a problem, though:
Atlassian can, indeed, restore all data to a checkpoint in a matter of hours.
However, if they did this, while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point
So now each customer’s data needs to be selectively restored. Atlassian has no tools to do this in bulk.
They also confirm this is the root of the problem in the update:
“What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers.”
For the first several days of the outage, they restored customer data with manual steps. They are now automating this process. However, even with the automation, restoration is small, and can only be done in small batches:
“Currently, we are restoring customers in batches of up to 60 tenants at a time. End-to-end, it takes between 4 and 5 elapsed days to hand a site back to a customer." Our teams have now developed the capability to run multiple batches in parallel, which has helped to reduce our overall restore time.”
So there you have it, folks. To resolve an outage quickly you need to know it’s happening, find the culprit, and fix it, all within a matter of hours (or ideally quicker). Sometimes being a software engineer isn’t as cushy as people think it is!
Although you still probably don’t need it.
Such a fun read. My app went down from a memory leak and a database migration. Watching its last attempts at recovering, flagging errors everywhere, trying to restart itself, was a wild ride. It was beautiful in a way.