Using the CrowdStrike outage as a case study, this talk explores how quality is created, maintained, and lost across code, release practices, organisational decisions, and wider ecosystems.
On 19 July 2024, what first looked like isolated IT issues at airports quickly became a global outage affecting hospitals, banks, broadcasters, supermarkets, and more. The trigger was small. The impact was enormous.
In this talk, I use the CrowdStrike outage to explore how quality is created, maintained, and lost in complex software systems.
First, I walk through what happened and why recovery was so difficult at scale. This was not just an application crash. Because the Falcon sensor ran at the kernel level, affected Windows machines could fail before normal recovery was possible, turning a bad update into a slow, manual repair effort.
From there, I examine the incident through two lenses.
The micro view looks at the engineering and release decisions that allowed a simple defect to escape, including validation choices, limited negative and regression testing, and release practices that increased blast radius.
The macro view looks at why the consequences were so widespread, including market concentration, ecosystem dependencies, customer rollout control, operational readiness, and the limits of single root-cause thinking.
Attendees will leave with practical ideas for safer change, stronger resilience, and a clearer understanding of how quality engineering helps teams improve the system in which quality emerges.