this post was submitted on 22 Jul 2024

254 points (95.0% liked)

Programming

17416 readers

37 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

founded 1 year ago

MODERATORS

[email protected]

254

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev (yieldcode.blog)

submitted 3 months ago* (last edited 3 months ago) by [email protected] to c/[email protected]

75 comments fedilink hide all child comments

(page 2) 29 comments

sorted by: hot top controversial new old

[–] [email protected] 6 points 3 months ago* (last edited 3 months ago)

That one, then go up the chain of command.

[–] [email protected] 24 points 3 months ago* (last edited 3 months ago) (4 children)

It's a systematic multi-layered problem.

The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.

Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.

Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.

[–] [email protected] 11 points 3 months ago (2 children)

The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there's good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.

load more comments (2 replies)

load more comments (3 replies)

[–] [email protected] 52 points 3 months ago (6 children)

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It's about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn't impose confidence in the overall stability. But it's also general ToS-speak, and may only be noteworthy now, after the fact.

load more comments (6 replies)

[–] [email protected] 23 points 3 months ago

Yesterday I was browsing /r/programming

:tabclose

[–] [email protected] 11 points 3 months ago (1 children)

We don't blame the leopards who ate the guy's face. We blame the guy who stuck his face near the leopards.

[–] [email protected] 2 points 3 months ago

But how do you identify a leopard when you don't know about animals and it's wearing a shiny mask?

[–] [email protected] 52 points 3 months ago (1 children)

sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.

[–] [email protected] 7 points 3 months ago (1 children)

Is that the case at CrowdStrike?

[–] [email protected] 12 points 3 months ago (1 children)

I don't have any information on that, this was more like a criticism of where the world seems to be leading to

[–] [email protected] 140 points 3 months ago (1 children)

If a single person can make the system fail then the system has already failed.

[–] [email protected] 21 points 3 months ago (1 children)

If a single outsider can make Your system fail then it's already failed.

Now consider in the context of supply-chain tuberculosis like npm.

[–] [email protected] 11 points 3 months ago

Left-pad right?

[–] [email protected] 14 points 3 months ago (3 children)

That is a lot of bile even for a rant. Agreed that it's nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what's wrong with the test suites? Did someone consciously decided the team would skimp on them?

As for blame, if we take the word of Crowdstrike's CEO then there is no individual negligence nor malice involved. Therefore this it is the company's responsibility as a whole, plain and simple.

[–] [email protected] 9 points 3 months ago

Therefore this it is the company’s responsibility as a whole.

The governance of the company as a whole is the CEO's responsibility. Thus a company-wide failure is 100% the CEO's fault.

If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.

[–] [email protected] 10 points 3 months ago

Real question is: what’s wrong with the test suites?

This is what I'm asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can't tell me that they did all of these 3 steps. -meme

load more comments (1 replies)

[–] [email protected] 134 points 3 months ago (12 children)

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

[–] [email protected] 18 points 3 months ago (1 children)

That means spending time and money on developing such a system, which means increasing costs in the short term.. which is kryptonite for current-day CEOs

load more comments (1 replies)

[–] [email protected] 19 points 3 months ago* (last edited 3 months ago) (1 children)

Hey, why not just ask Dave Plummer, former Windows developer...

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I've read so far vary significantly, still that's way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

load more comments (1 replies)

[+] [email protected] -26 points 3 months ago* (last edited 3 months ago) (2 children)

If you were a developer that knew you were responsible for developing ring zero code, massively deployed across corporate systems across the world, then you should goddamned properly test the update before deploying it.

This isn't a simple glitch like a calculation rounding error or some shit, the programmers of any ring zero code should be held fully responsible, for not properly reviewing and testing the code before deploying an update.

Edit: Why not just ask Dave Plummer, former Windows developer...

https://youtube.com/watch?v=wAzEJxOo1ts

[–] [email protected] 32 points 3 months ago* (last edited 3 months ago) (1 children)

If you system depends on a human never making a mistake, your system is shit.

It's not by chance that for example, Accountants have since forever had something which they call reconciliation where the transaction data entered from invoices and the like then gets cross-checked with something else done differently, for example bank account transactions - their system is designed with the expectation that humans make mistakes hence there's a cross-check process to catch those.

Clearly Crowdstrike did not have a secondary part of the process designed to validate what's produced by the primary (in software development that would usually be Integration Testing), so their process was shit.

Blaming the human that made a mistake for essentially being human and hence making mistakes, rather than the process around him or her not having been designed to catch human failure and stop it from having nasty consequences, is the kind of simplistic ignorant "logic" that only somebody who has never worked in making anything that has to be reliable could have.

My bet, from decades of working in the industry, is that some higher up in Crowdstrike didn't want to pay for the manpower needed for the secondary process checking the primary one before pushing stuff out to production because "it's never needed" and then the one time it was needed, it wasn't there, thinks really blew up massivelly, and here we are today.

[–] [email protected] -1 points 3 months ago (6 children)

Indeed, I fully agree. They obviously neglected on testing before deployment. So you can split the blame between the developer that goofed on the null pointer dereferencing and the blank null file, and the higher ups that apparently decided that proper testing before deployment wasn't necessary.

Ultimately, it still boils down to human error.

load more comments (6 replies)

load more comments (9 replies)

[–] [email protected] 40 points 3 months ago* (last edited 3 months ago)

I hope this incident shines more light on the greedy rich CEOs and the corners they cut, the taxes they owe, the underpaid employees and understaffed facilities, and now probably some hefty fines, as just a slap on the wrist of course..

[–] [email protected] 82 points 3 months ago (1 children)

Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.

[–] [email protected] 23 points 3 months ago

Edited the title to have a by in front to make that a bit more clear

load more comments