In what was probably the biggest IT failure in the history of humanity, computers across the globe got fried over the weekend from an errant software update. Planes were grounded, banks were offline, and machines in hospitals stopped working. There have been a big pile of articles analyzing what this means and why it’s a big deal. Instead of that, this post will explain what actually happened from a technical perspective, in simple, easy to understand language.
The outage itself: what went wrong
Let’s first focus on what actually happened. What broke?
Let’s start from the basics. Every device on the planet that runs software – whether it’s your phone, a laptop, a large server in the cloud, a TV at an airport, or a teleprompter in a studio – is built on an operating system. An operating system is the mastermind behind computers: it orchestrates all of the behind-the-scenes magic that lets something like Excel work on your laptop, or flight arrival times display on a TV.
The most popular desktop operating system in the world by far is Microsoft Windows. As of last month, more than 70% of desktop systems were running Windows, which may come as a surprise to Mac-loyal readers (15%). Windows is especially popular in commercial applications like factory robots, hospital devices, airplanes, and any type of computer that’s not a phone or laptop. The initial Windows release was 38 years ago, in November 1985.
Now, if something goes wrong with the operating system, like Windows, your device is completely fucked. If the operating system can’t run, the device can’t run; simple as that. And this is exactly what happened with the CrowdStrike outage. It (we will discuss the it) messed up the Windows operating systems that it runs on, so computers that use the software couldn’t start at all.
Of all of the types of outages out there, an operating system not being able to boot up is perhaps the most sinister. Because the only way to fix it is manually, by rebooting the computer in a special way that allows you to remove the offending code. But the people using these devices are certainly not computer experts, and in most cases probably don’t even know what an operating system is. This is part of why it took so long to resolve this outage, and required significant hand holding from CrowdStrike (more on this later).
So what caused this outage? And who is CrowdStrike?
What does CrowdStrike do?
If you grew up in the 90’s like I did, you probably had something like Norton installed on your computer.
Norton was (and is) antivirus software. It sits on your computer and attempts to protect it from viruses that hackers and malicious actors try and disperse around the web.
Antivirus software works in some generic ways – like checking for generally suspicious looking files, or files that it hasn’t seen before. But the main way that antivirus software does its job today is roughly the same way that “Wanted” signs work at the post office. Norton has a team of researchers finding all of the latest vulnerabilities and nasty software that hackers are using1. Whenever they find new stuff, they update your computer to look for that stuff, and sound the alarm if it’s there. They constantly publish these kinds of updates to the software so your computer knows what to look for.
This is pretty much exactly what CrowdStrike does, just in 2024 and with better marketing. But instead of selling their software to consumers like you and me, they’re focused on selling it to businesses. 300 of the Fortune 500 use CrowdStrike; it’s incredibly ubiquitous. That’s how this outage ended up affecting so many different types of devices in so many industries: the owners of those devices, from Delta to Chase, were all CrowdStrike customers.
Antivirus and the operating system
In order for modern antivirus software to work, it needs to have permission to live in the deepest recesses of a computer’s operating system. Like, way in there. It’s monitoring all of the little different things an operating system does, so it needs to be able to see those things and stop them if it doesn’t like what it sees. In fact, it probably has the deepest “security clearance” of any piece of software on any computer. It has total control over what your operating system runs, plus how and when it runs it.
This privileged access that antivirus software has to operating systems has long been a topic of controversy. Way back in 2008, a researcher wrote a whitepaper outlining vulnerabilities in antivirus software itself, and how bad things could get given this relationship. For more depth on this topic, I highly recommend Jan Kammerath’s post on the outage.
On Friday, CrowdStrike released a software update that had a new Channel File in it. Remember how antivirus companies have teams researching the latest vulnerabilities? When they find a new one, they update their software to start looking for it. That’s what a CrowdStrike Channel File is: a file configured to detect a specific type of malware. It’s like a new “Wanted” poster for your computer. This one was called something innocuous like:
C-00000291.sys
Unfortunately, this file had a major issue: it tried to access a piece of data on Windows that doesn’t actually exist. This is what developers call a Null Pointer Exception: a program tries to find a piece of data, it’s not there, and things go haywire. This is a very common kind of bug that software engineers accidentally overlook all the time, without drastic consequences; usually, the operating system just shuts down the offending program. I’ve encountered many Null Pointer Exceptions in my own code over the years. Generally, a run of the mill thing.
But in this case, it wasn’t run of the mill at all. Precisely because CrowdStrike has such an intimate relationship with the operating system, this failure in its software broke the entire operating system as a whole. Once this messed up file got sent to your system in the software update, you couldn’t even start your Windows machines in the first place. Hence, the dreaded screen of doom:
Why did this take so long to resolve?
Most incidents like this get resolved relatively quickly. But despite the fact that CrowdStrike pushed a fix to the offending file almost immediately after the issue was discovered, some systems continued to be offline for hours, and in many cases even days.
This, too, is a symptom of that privileged relationship with the operating system we talked about. Normal software can receive updates over the internet that fix a bug. But for many devices running different versions of CrowdStrike, in order to fix the issue and get the operating system running again, users needed to manually restart in safe mode – a special type of restart mode made for fixing stuff like this – and then manually find and delete that C-00000291.sys
file.
Remember: most users of these kinds of devices aren’t computer experts. Major hand holding is (and was) required to walk someone through how to do this type of thing (have you ever tried using the Terminal?). And that’s part of why even at the time of writing, some airlines are still cancelling flights.
Why Windows in particular?
The last thing (probably) worth mentioning is why this issue affected Microsoft Windows in particular. CrowdStrike does have sister products for MacOS and Linux, the other two largest desktop operating systems in the world. The simple explanation is that this particular channel file (remember: the wanted poster) was for a Microsoft Windows vulnerability, not a MacOS or Linux one. The even simpler one is that Windows is just dominant in the enterprise, commercial universe: most devices just don’t run on MacOS or Linux.
But that’s not the whole picture. Compared to Apple, Microsoft’s approach to building Windows is much more open, with a looser integration between software and hardware. Apple doesn’t allow software like CrowdStrike to have control over what the operating system does to the same degree; this could never have happened on a Mac. There’s also a narrative here (true or not, I can’t opine) about Microsoft’s weakened stance on security, and neglect of Windows since they started focusing on the cloud.
Questions? Email me or leave a comment.
Do not ask me how they do this.
Good article, thanks. One thing I'm wondering if you'd research more - I'm no expert. I have read often in the past that linux is the dominant operating system on the server side of the world (not the desktop side). Here's one link - https://w3techs.com/technologies/comparison/os-linux,os-windows. I have experience as an enterprise architect and the enterprise I worked in - linux had a far larger share of the servers. Also, if we looked at the percentage of services that people use that were unavailable - large numbers of services were unaffected.
Your article is excellent in explaining a complex subject in a way that a large number of people can understand. I'm hoping that the clarification of Windows not being the dominant OS for servers and the services people use - would just further help what you have accomplished so well.
Very timely article, thanks!