Friday was a mess for just about anyone who relies on Windows - whether they knew it or not. As Friday afternoon rolled along, the little social media I monitor was filled with ideas on what happened - including wild conspiracy theories, and a whole lot of people with not enough information insisting that they knew exactly how they would have fixed the issue.
The briefest explanation is that Gary (my made up name for the person who introduced the flaw) made a mistake.
It’s a Testing Issue
I found it a little cringe-inducing how many folks from the testing community blamed lack of testing or lack of test teams as the root cause. More than one person volunteered to work with crowdstrke to help them improve their testing. I don’t know crowdstrike’s engineering approach, but it’s ludicrous to hypothesize that they didn’t test the change. I have as many details as the rest of you who don’t work at CS, but it’s more likely that they tested the wrong binary, or tested in an environment where the flawed binary would work, or that the test system failed in some way.
Of course, if they had run the right test in the right way, they would have found this before they destroyed the world, but I am pretty certain that this issue is mired in at least some complexity.
It’s a Deployment Issue
There were a few others who claimed it was a deployment issue - that they should have released the change to smaller audiences, collected feedback, and only expanded worldwide once they knew the change was rock solid. This is absolutely good advice in general, but the offending (offensive?) error was in a file type that’s used to respond to active vulnerabilities - so there’s a balance between canary deployments, and getting changes out before the vulnerabilities are exploited. Since this is CS’s business, I find it naive to think that they somehow have never thought of this before.
It’s a Knowledge Issue
Most people are familiar with the Dunning Kruger Effect. It’s a cognitive bias where people overestimate their knowledge or competence in a particular area because they lack the necessary information to accurately assess their own abilities. Highly related to Dunning Kruger is the concept of illusory superiority. I probably first read about this concept in Thinking Fast and Slow, but it’s mentioned in a lot of books on my bookshelf (The Invisible Gorilla, Predictably Irrational, Why We Make Mistakes, etc.)
Illusory superiority bias means - briefly - that even though you know a lot, you don’t know as much as you think you do. One of the most common examples is this study that shows that 80% of drivers think they are above average (a mathematical impossibility). Illusory superiority occurs everywhere - in academics, parenting, general knowledge, and even/especially in workplace competence.
But probably most often, illusory superiority occurs in internet forums and social media where everyone, apparently, is an expert.
It’s a Human Issue
It’s always a people problem.
- Jerry Weinberg
My take on the crowdstrike issue (potentially also a victim of Dunning Kruger or Illusory Superiority) is that a human made a mistake. Many of the statements and memes over the last 48 hours have been to speculate why Gary checked in shitty code that broke the internet, and when Gary is going to get fired.
Gary should not get fired.
Humans make mistakes. One of my most dog-eared books is Human Error by James Reason - it’s an analysis and categorization of types of human error and why they exist that has been a basis for a lot of my own learning. I’ve also spent a good chunk of my career investigating and discussing software errors, impact, and cause. I’m a firm believer that every mistake is an opportunity to learn - whether it’s a build break, or an outage that crashes several million computers, it happened for a reason.
When I’ve conducted analysis into high impact issues, one of the questions I constantly bring up, is that at the time the mistake was made, Gary felt safe in doing what he did. Why did he feel safe? The problem isn’t that Gary is stupid or malicious - the system allowed Gary to feel safe in his change. Again - I don’t know details, but it’s safe to assume - given that CS has done hundreds or thousands of these changes with few errors, that there’s a system in place (test beds, deployments, static analysis, etc.) that checked the boxes for rollout.
People make mistakes when the systems they rely on for safety fail. Blaming a human for a system going down is almost never the right thing to do.
The system failed Gary.
Fix the System
This was a pretty expensive learning opportunity for Gary (but he will have a great story to tell). For the rest of us, it’s important to remember that humans make mistakes, and those mistakes provide us opportunities to learn and to fix the systems that enabled those mistakes to occur.
Continuous Integration pipelines exist so that developers can get quick feedback. When CI “misses” something, we can add additional testing or static analysis to “catch” that issue in the future. Canary deployments solve the problem of introducing risk to too large of an audience. Automation solves the problem of a human forgetting steps or details.
Software needs humans - but humans need systems.
Fix the system.
It Will Be OK
A huge #hugops to everyone who has been dealing with the aftermath of the CS issue, but my heart goes out to Gary. It was probably a shitty weekend for you, but it’s going to be ok.
-A 18:0
++
I avoided the trap of blaming CS’s lack of testing or even Gary himself. I do wonder what their rollout strategy is (or was) and as you pointed out, what systems failed that allowed this change to become problematic in the wild.
In the end I hope the postmortem is publicly available so we can all learn a bit more.