CrowdStrike IT Outage - Your questions answered
In this article, I outline the cause of the CrowdStrike outage affecting Microsoft Windows and attempt to answer the questions you may have.
Today (19 July 2024) saw a global IT outage caused by a bad update being rolled out by CrowdStrike on their Falcon endpoint security suite. Let's explore the cause and try and answer any questions in layman's terms.
⚠️ These are my opinions and understanding of an evolving situation and information here may be out of date or factually incorrect.
What is CrowdStrike?
CrowdStrike is a company that provides endpoint security. In particular the update was for endpoint agents on CrowdStrike's Falcon service. Falcon provides a suite of services which include USB device protection, antivirus and threat intelligence to name a few.
They control around 24% of the endpoint security market, meaning the number of systems affected is huge. Think of it as being like an advanced CCTV and lock system on your home; while your home will have a lock on the doors as standard, the addition of CCTV and better locks provide extra security and warning of intrusion.
What went wrong?
CrowdStrike rolled out an update for its endpoint security suite for Microsoft Windows which contained a bug. This bug meant that those Windows-based systems experienced a 'Blue Screen of Death' (or 'BSoD'), which is a crash state for the core Windows operating system.
How could this have happened?
There's many news outlets being critical of CrowdStrike for rolling out a defective update, but as a software developer myself, I know that these things aren't always clean cut. It's still too early to know how the defective update could be rolled out, but generally there's a code review, a quality assurance process and testing phase prior to rolling out updates.
The issue with software is; you can only account for scenarios that you know about, broadly speaking, and it's sometimes difficult to capture edge cases that can cause issues. That said, it could highlight a process issue within CrowdStrike where a critical bug was allowed to be released.
Who is affected?
Businesses including airlines, airports, rail operators, supermarkets, broadcasters, healthcare trusts, governments around the world are all affected. Essentially, any organisation that use CrowdStrike's Falcon endpoint securty on Windows systems with the bad update is affected in some way.
The degree to how those organisations are affected largely depend on whether the endpoints are part of their critical infrastructure or not, as well as whether an organisation has contingency plans in place.
Are CrowdStrike to blame for organisations being affected?
Yes and no. Yes, CrowdStrike on critical infrastructure causing outage is to blame for that critical infrastructure being affected, however there is an onus on any organisation to have contingency plans for when their critical infrasturcture is impacted.
Why does it only affect Windows?
The issue is specific to their Windows endpoint agent, but this could easily have occurred for MacOS or Linux too. This is in no way indicative with any kind of vulnerability with Windows itself.
When Windows starts, it also starts the endpoint agent from CrowdStrike. A bug in a single file (csagent.sys) causes the system to crash.
Are Microsoft partly to blame?
No. Microsoft cannot be held responsible for when a third party roll out updates that cause crashes in Windows. Windows is the operating system and the CrowdStrike agent is a piece of software that runs within that operating system.
Think of it like a car, if bad petrol causes a car to break down, it's not the responsibility of the car manufacturer to rectify the situation. Also, like the issue only affects Windows, in my scenario with the car, its an issue that only affects petrol cars and not those that run on diesel.
Why is this issue so difficult to fix?
When an issue like this prevents a system from starting, some kind of manual intervention is going to be the only solution. It involves booting the affected system into safe mode to enable a patch to be installed.
There are technologies such as netboot which allows organisations to redeploy affected systems with either a fix or with the CrowdStrike agent removed. This still comes with a massive overhead, and will inevitably put IT support departments within organisations under a massive time pressure, made worse with the trend for remote working. As a result of this, the rectification will take organisations some time to roll out.
Is this a hack or cyber attack?
No, all indicators point to this being a bug within an update that's rolled out across systems and not the result of a any kind of cyber attack.
Should organisations not roll out software updates?
Choosing not to, or to delay software updates exposes organisations to business critical risks. The general advise was and still is, to roll out updates, particularly those which address security at first opportunity.
That said, sometimes things do still go wrong, as we've seen today (19 July 2024).
Is there a fix?
CrowdStrike have issued a fix, anyone affected should refer to CrowdStrike's support article to find out how to deploy it. Microsoft are understood to be assisting CrowdStrike with the fix as well as putting in place mechanisms to roll it out.