this post was submitted on 22 Jul 2024
254 points (95.0% liked)

Programming

17416 readers
37 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 1 year ago
MODERATORS
(page 2) 29 comments
sorted by: hot top controversial new old
[–] [email protected] 6 points 3 months ago* (last edited 3 months ago)

That one, then go up the chain of command.

[–] [email protected] 24 points 3 months ago* (last edited 3 months ago) (4 children)

It's a systematic multi-layered problem.

The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.

Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.

Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.

[–] [email protected] 11 points 3 months ago (2 children)

The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there's good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.

load more comments (2 replies)
load more comments (3 replies)
[–] [email protected] 52 points 3 months ago (6 children)

CrowdStrike ToS, section 8.6 Disclaimer

[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]

It's about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).

It certainly doesn't impose confidence in the overall stability. But it's also general ToS-speak, and may only be noteworthy now, after the fact.

load more comments (6 replies)
[–] [email protected] 23 points 3 months ago

Yesterday I was browsing /r/programming

:tabclose

[–] [email protected] 11 points 3 months ago (1 children)

We don't blame the leopards who ate the guy's face. We blame the guy who stuck his face near the leopards.

[–] [email protected] 2 points 3 months ago

But how do you identify a leopard when you don't know about animals and it's wearing a shiny mask?

[–] [email protected] 52 points 3 months ago (1 children)

sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.

[–] [email protected] 7 points 3 months ago (1 children)

Is that the case at CrowdStrike?

[–] [email protected] 12 points 3 months ago (1 children)

I don't have any information on that, this was more like a criticism of where the world seems to be leading to

[–] [email protected] 140 points 3 months ago (1 children)

If a single person can make the system fail then the system has already failed.

[–] [email protected] 21 points 3 months ago (1 children)

If a single outsider can make Your system fail then it's already failed.

Now consider in the context of supply-chain tuberculosis like npm.

[–] [email protected] 11 points 3 months ago

Left-pad right?

[–] [email protected] 14 points 3 months ago (3 children)

That is a lot of bile even for a rant. Agreed that it's nonsensical to blame the dev though. This is software, human error should not be enough to cause such massive damage. Real question is: what's wrong with the test suites? Did someone consciously decided the team would skimp on them?

As for blame, if we take the word of Crowdstrike's CEO then there is no individual negligence nor malice involved. Therefore this it is the company's responsibility as a whole, plain and simple.

[–] [email protected] 9 points 3 months ago

Therefore this it is the company’s responsibility as a whole.

The governance of the company as a whole is the CEO's responsibility. Thus a company-wide failure is 100% the CEO's fault.

If the CEO does not resign over this, the governance of the company will not change significantly, and it will happen again.

[–] [email protected] 10 points 3 months ago

Real question is: what’s wrong with the test suites?

This is what I'm asking myself too. If they tested it, and they should have, then this massive error would not happen: a) controlled test suites and machines in their labors, b) at least one test machine connected through internet and acting like a customer, tested by real human, c) update in waves throughout the day. They can't tell me that they did all of these 3 steps. -meme

load more comments (1 replies)
[–] [email protected] 134 points 3 months ago (12 children)

Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.

[–] [email protected] 18 points 3 months ago (1 children)

That means spending time and money on developing such a system, which means increasing costs in the short term.. which is kryptonite for current-day CEOs

load more comments (1 replies)
[–] [email protected] 19 points 3 months ago* (last edited 3 months ago) (1 children)

Hey, why not just ask Dave Plummer, former Windows developer...

https://youtube.com/watch?v=wAzEJxOo1ts

When anywhere from 8.5 million to over a billion systems went down, numbers I've read so far vary significantly, still that's way too much failure for a simple borked update to a kernel level driver, not even made by Microsoft.

load more comments (1 replies)
load more comments (9 replies)
[–] [email protected] 40 points 3 months ago* (last edited 3 months ago)

I hope this incident shines more light on the greedy rich CEOs and the corners they cut, the taxes they owe, the underpaid employees and understaffed facilities, and now probably some hefty fines, as just a slap on the wrist of course..

[–] [email protected] 82 points 3 months ago (1 children)

Note: Dmitry Kudryavtsev is the article author and he argues that the real blame should go to the Crowdstrike CEO and other higher-ups.

[–] [email protected] 23 points 3 months ago

Edited the title to have a by in front to make that a bit more clear

load more comments
view more: ‹ prev next ›