That one, then go up the chain of command.
Programming
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities [email protected]
It's a systematic multi-layered problem.
The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.
Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.
Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.
The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there's good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.
CrowdStrike ToS, section 8.6 Disclaimer
[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]
It's about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).
It certainly doesn't impose confidence in the overall stability. But it's also general ToS-speak, and may only be noteworthy now, after the fact.
Yesterday I was browsing /r/programming
:tabclose
We don't blame the leopards who ate the guy's face. We blame the guy who stuck his face near the leopards.
But how do you identify a leopard when you don't know about animals and it's wearing a shiny mask?
sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.
Is that the case at CrowdStrike?
I don't have any information on that, this was more like a criticism of where the world seems to be leading to
If a single person can make the system fail then the system has already failed.
If a single outsider can make Your system fail then it's already failed.
Now consider in the context of supply-chain tuberculosis like npm.
Left-pad right?
That is a lot of bile even for a rant. Agreed that it's nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what's wrong with the test suites? Did someone consciously decided the team would skimp on them?
As for blame, if we take the word of Crowdstrike's CEO then there is no individual negligence nor malice involved. Therefore this it is the company's responsibility as a whole, plain and simple.
Therefore this it is the company’s responsibility as a whole.
The governance of the company as a whole is the CEO's responsibility. Thus a company-wide failure is 100% the CEO's fault.
If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.
Real question is: what’s wrong with the test suites?
This is what I'm asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can't tell me that they did all of these 3 steps. -meme
Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.
That means spending time and money on developing such a system, which means increasing costs in the short term.. which is kryptonite for current-day CEOs
Hey, why not just ask Dave Plummer, former Windows developer...
https://youtube.com/watch?v=wAzEJxOo1ts
When anywhere from 8.5 million to over a billion systems went down, numbers I've read so far vary significantly, still that's way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.
I hope this incident shines more light on the greedy rich CEOs and the corners they cut, the taxes they owe, the underpaid employees and understaffed facilities, and now probably some hefty fines, as just a slap on the wrist of course..
Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.
Edited the title to have a by in front to make that a bit more clear