Last Friday, a wave of digital disruptions swept across the globe as Microsoft users found themselves facing an unexpected and severe issue: their computers simply refused to power on.
This seemingly small glitch spiraled into a global crisis, causing major television networks to go dark, disrupting air travel with mass flight cancellations, and forcing countless hospital appointments to be rescheduled.
Initially, it seemed as though the culprit might be a Microsoft update gone awry. However, it soon became clear that the issue was not with Microsoft’s software but rather with an update from cybersecurity firm CrowdStrike. All affected systems were running CrowdStrike’s security solution: Falcon.
Now, CrowdStrike has released its preliminary post incident review (PIR), detailing exactly how the outage happened.
Here, we’ll take a look at what went wrong, what the firm is doing to prevent the same mistake happening again, and the lessons learned for all organizations.
What happened?
The chaos last Friday, which saw Microsoft PC users worldwide grappling with unresponsive computers, traced back to a problematic update for CrowdStrike’s Falcon Sensor product.
Falcon, CrowdStrike’s flagship platform, is designed to thwart breaches with a suite of cloud-based technologies that tackle various types of cyber threats, including malware.
CrowdStrike’s Falcon platform receives updates in two main ways: Sensor Content, which is delivered with the sensor itself, and Rapid Response Content, which is designed for swift adaptation to the evolving threat landscape. The latter is intended to address emerging threats quickly with frequent, smaller updates.
However, last week’s disaster was triggered by a flaw in one Rapid Response Content update. According to CrowdStrike’s PIR, the update in question contained an undetected error. This error led to a critical issue when the update was processed.
CrowdStrike detailed that the problematic update, labeled Channel File 291, caused an out-of-bounds memory read. This triggered an exception that was not handled properly, resulting in a Blue Screen of Death (BSOD) on affected Windows systems.
In essence, the root cause of the outage was a failure in the system responsible for validating the integrity of CrowdStrike’s updates. The defective content was pushed to Windows systems, causing widespread crashes and operational disruptions.
Why didn’t testing catch the flaw?
In the wake of last week’s widespread disruptions, there has been increased scrutiny over CrowdStrike’s quality assurance (QA) processes.
In statements, CrowdStrike has emphasized that its updates undergo a rigorous QA regimen designed to catch potential issues before they impact users. According to the firm, this process includes a combination of automated testing, manual testing, validation, and a controlled rollout.
The typical sensor release process involves comprehensive automated testing both before and after the code is integrated into the main codebase. Once the update is live, customers are able to apply it in a managed manner, minimizing risk.
However, the recent outage stemmed from a Rapid Response Content update, which follows a different set of procedures.
Unlike standard updates, Rapid Response Content is designed for rapid deployment to address emergent threats. This expedited process, while essential for timely responses, is subject to different testing protocols that are intended to balance speed with thoroughness–although clearly, the tests were not thorough enough.
Could this happen again?
While you can never say never, it appears CrowdStrike has taken this incident very seriously. They’ve outlined a number of new measures to prevent a similar outage from happening, including:
- Staggered deployment strategy: To enhance the safety of Rapid Response Content updates, CrowdStrike will adopt a staggered deployment approach. This involves gradually rolling out updates to increasing segments of the sensor base, starting with a smaller “canary” deployment. This phased approach allows for early detection of issues before the update is fully deployed.
- Improved monitoring: CrowdStrike plans to enhance its monitoring systems for both sensor and system performance during Rapid Response Content deployments. By actively collecting feedback throughout the deployment process, the company aims to identify and address any issues in real-time, facilitating a more controlled and informed rollout.
- Granular control for customers: Recognizing the need for greater flexibility, CrowdStrike will introduce features allowing customers more control over Rapid Response Content updates. This includes options for granular selection of deployment timing and scope, enabling users to better manage and schedule updates according to their operational needs.
Lessons Learned
Supply chain incidents like this have become troublingly prevalent in recent years. While the likes of SolarWinds and NotPetya stemmed from cyber-attacks, this incident highlights that devastating supply chain outages can also occur due to simple coding errors.
Despite the problem, research shows that over half of businesses don’t vet third-party vendors before onboarding. While, in the case of CrowdStrike, even a risk assessment might not have been enough to prevent outages, doing so is good practice–and could prevent a similar situation from unfolding in the future.
Moreover, with the proliferation of SaaS and AI applications, taking a methodical approach to supplier risk assessments is more important than ever before. Just think–the number of software suppliers your company uses is probably in the tens, if not hundreds.
Each and everyone of those suppliers could be the potential cause of an outage or data leakage.
To discover more about securing the software supply chain, read our guidance here.