this post was submitted on 21 Jul 2024

196 points (77.4% liked)

Technology

58712 readers

4013 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

196

CrowdStrike Isn't the Real Problem (lemmy.world)

submitted 2 months ago by [email protected] to c/[email protected]

139 comments fedilink hide all child comments

This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren't revolutionary. They're basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it's ultimately the customers and their data that suffer.

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 8 points 2 months ago

This doesn't seem to be a problem with disaster recovery plans. It is perfectly reasonable for disaster recovery to take several hours, or even days. As far as DR goes, this was easy. It did not generally require rebuilding systems from backups.

In a sane world, no single party would even have the technical capability of causing a global disaster like this. But executives have been tripping over themselves for the past decade to outsource all their shit to centralized third parties so they can lay off expensive IT staff. They have no control over their infrastructure, their data, or, by extension, their business.

[–] [email protected] 5 points 2 months ago (2 children)

Is there a way to remotely boot into network activated recovery mode? Genuine question, I never looked into it.

[–] [email protected] 5 points 2 months ago (1 children)

For physical servers there are out of band management systems like Dell DRAC that allows you to manage the server even when the OS is broken or non existent.

For clients there are systems like Intel vPRO and AMD AMT. I have not used either of them but they apparently work similarly to the systems used on servers.

[–] [email protected] 1 points 2 months ago

Ah neat, I'll look those up. Thanks a lot!

[–] [email protected] 4 points 2 months ago (1 children)

A expensive kvm card, or Pikvm for the home server.

[–] [email protected] 3 points 2 months ago (1 children)

At least for virtual servers, There has to be a cheaper software equivalent, as my cheap VPS allows this (via vnc) with no issues.

[–] [email protected] 2 points 2 months ago* (last edited 2 months ago) (1 children)

Virtual servers (as opposed to hardware workstations or servers) will usually have their "KVM" (Keyboard Video Mouse) built in to the hypervisor control plane. ESXi, Proxmox (KVM - Kernel Virtual Machine), XCP-ng/Citrix XenServer (Xen), Nutanix (KVM-like), and many others all provide access to this. It all comes down to what's configured on the hypervisor OS.

VMs are easy because the video and control feeds are software constructs so you can just hook into what's already there. Hardware (especially workstations) are harder because you don't always have a chip on the motherboard that can tap that data. Servers usually have a dedicated co-computer soldered onto the motherboard to do this, but if there's nothing nailed down to do it, your remote access is limited to what you can plug in. PiKVM is one such plug-in option.

[–] [email protected] 2 points 2 months ago

Thank you for the explanation, I really appreciate it. Bystanders will probably too :)

[–] [email protected] 44 points 2 months ago (2 children)

Bloated IT budgets?

Where do you work, and are they hiring?

[–] [email protected] 18 points 2 months ago

The bloat isn't for workers, otherwise there'd be enough people to go reboot the machines and fix the issue manually in a reasonable amount of time. It's only for executives, managers, and contracts with kickbacks. In fact usually they buy software because it promises to cut the need for people and becomes an excuse for laying off or eliminating new hire positions.

[–] [email protected] 14 points 2 months ago

As the post was stating, they get bloated by relying on vendors rather than in-house IT/Security.

My grandfather works IT for my state government tho and it's a pretty good gig according to him

[–] [email protected] 1 points 2 months ago

Well said!

[–] [email protected] 18 points 2 months ago (1 children)

I think it's most likely a little of both. It seems like the fact most systems failed at around the same time suggests that this was the default automatic upgrade /deployment option.

So, for sure the default option should have had upgrades staggered within an organisation. But at the same time organisations should have been ensuring they aren't upgrading everything at once.

As it is, the way the upgrade was deployed made the software a single point of failure that completely negated redundancies and in many cases hobbled disaster recovery plans.

[–] [email protected] 25 points 2 months ago (2 children)

Speaking as someone who manages CrowdStrike in my company, we do stagger updates and turn off all the automatic things we can.

This channel file update wasn’t something we can turn off or control. It’s handled by CrowdStrike themselves, and we confirmed that in discussions with our TAM and account manager at CrowdStrike while we were working on remediation.

[–] [email protected] 4 points 2 months ago

There was a "hack" mentioned in another thread - you can block it via firewall and then selectively open it.

[–] [email protected] 6 points 2 months ago (2 children)

That's interesting. We use crowdstrike, but I'm not in IT so don't know about the configuration. Is a channel file, somehow similar to AV definitions? That would make sense, and I guess means this was a bug in the crowdstrike code in parsing the file somehow?

[–] [email protected] 8 points 2 months ago (1 children)

Yes, CrowdStrike says they don’t need to do conventional AV definitions updates, but the channel file updates sure seem similar to me.

The file they pushed out consisted of all zeroes, which somehow corrupted their agent and caused the BSOD. I wasn’t on the meeting where they explained how this happened to my company; I was one of the people woken up to deal with the initial issue, and they explained this later to the rest of my team and our leadership while I was catching up on missed sleep.

I would have expected their agent to ignore invalid updates, which would have prevented this whole thing, but this isn’t the first time I’ve seen examples of bad QA and/or their engineering making assumptions about how things will work. For the amount of money they charge, their product is frustratingly incomplete. And asking them to fix things results in them asking you to submit your request to their Ideas Portal, so the entire world can vote on whether it’s a good idea, and if enough people vote for it they will “consider” doing it. My company spends a fortune on their tool every year, and we haven’t been able to even get them to allow non-case-sensitive searching, or searching for a list of hosts instead of individuals.

[–] [email protected] 3 points 2 months ago

Thanks. That explains a lot of what I didn't think was right regarding the almost simultaneous failures.

I don't write kernel code at all for a living. But, I do understand the rationale behind it, and it seems to me this doesn't fit that expectation. Now, it's a lot of hypothetical. But if I were writing this software, any processing of these files would happen in userspace. This would mean that any rejection of bad/badly formatted data, or indeed if it managed to crash the processor it would just be an app crash.

The general rule I've always heard is that you want to keep the minimum required work in the kernel code. So I think processing/rejection should have been happening in userspace (and perhaps even using code written in a higher level language with better memory protections etc) and then a parsed and validated set of data would be passed to the kernel code for actioning.

But, I admit I'm observing from the outside, and it could be nothing like this. But, on the face of it, it does seem to me like they were processing too much in the kernel code.

[–] [email protected] 2 points 2 months ago

Yes to all of that.

[–] [email protected] 20 points 2 months ago (3 children)

Issue is definitely corporate greed outsourcing issues to a mega monolith IT company.

Most IT departments are idiots now. Even 15 years ago, those were the smartest nerds in most buildings. They had to know how to do it all. Now it's just installing the corporate overlord software and the bullshit spyware. When something goes wrong, you call the vendor's support line. That's not IT, you've just outsourced all your brains to a monolith that can go at any time.

None of my servers running windows went down. None of my infrastructure. None of the infrastructure I manage as side hustles.

[–] [email protected] 6 points 2 months ago* (last edited 2 months ago)

I've seen the same thing. IT departments are less and less interested in building and maintaining in-house solutions.

I get why, it requires more time, effort, money, and experienced staff to pay.

But you gain more robust systems when it's done well. Companies want to cut costs everywhere they can, and it's cheaper to just pay an outside company to do XY&Z for you and just hire an MSP to manage your web portals for it, or maybe a 2-3 internal sys admins that are expected to do all that plus level 1 help desk support.

Same thing has happened with end users. We spent so much time trying to make computers "friendly" to people, that we actually just made people computer illiterate.

I find myself in a strange place where I am having to help Boomers, older Gen-X, and Gen-Z with incredibly basic computer functions.

Things like:

Changing their passwords when the policy requires it.
Showing people where the Start menu is and how to search for programs there.
How to pin a shortcut to their task bar.
How to snap windows to half the screen.
How to un-mute their volume.
How to change their audio device in Teams or Zoom from their speakers to their headphones.
How to log out of their account and log back in.
How to move files between folders.
How to download attachments from emails.
How to attach files in an email.
How to create and organize Browser shortcuts.
How to open a hyperlink in a document.
How to play an audio or video file in an email.
How to expand a basic folder structure in a file tree.
How to press buttons on their desk phone to hear voicemails.

It's like only older Millennials and younger gen-X seem to have a general understanding of basic computer usage.

Much of this stuff has been the same for literally 30+ years. The Start menu, folders, voicemail, email, hyperlinks, browser bookmarks, etc. The coat of paint changes every 5-7 years, but almost all the same principles are identical.

Can you imagine people not knowing how to put a car in drive, turn on the windshield wipers, or fill it with petrol, just because every 5-7 years the body style changes a little?

[–] [email protected] 7 points 2 months ago

Man, as someone who's cross discipline in my former companies, the way people treat It, and the way the company considers IT as an afterthought is just insane. The technical debt is piled high.

[–] [email protected] 3 points 2 months ago

And you probably paid less to not have that happen as well!

[+] [email protected] -7 points 2 months ago (2 children)

C++ is the problem. C++ is an unsafe language that should definitely not be used for kernel space code in 2024.

[–] [email protected] 3 points 2 months ago (1 children)

Let's rewrite everything in Rust. That'll surely solve the world's problems.

[–] [email protected] -2 points 2 months ago

Thank you. Finally someone understands. Jokes aside though, I think we can acknowledge that C/C++ have caused decades of problems due to their lack of memory safety.

[–] [email protected] 4 points 2 months ago (1 children)

the virus definition is not written in c++. And even then, the problem was that the file was full of zeros.

[–] [email protected] -2 points 2 months ago* (last edited 2 months ago) (2 children)

Maybe I heard some bad information, but I thought the issue was caused by a null pointer exception in C/C++ code. If you have a link to a technical analysis of the issue I would be interested to read it.

[–] [email protected] 1 points 2 months ago (1 children)

No one does, it's not public yet, if ever. This is close enough.

The real problem was, among others, lack of testing, regardless of the programming language used. Blaming C++ is dumb af. Put a chimpanzee behing the wheel of a Ferrari and you'll still run into... problems.

[–] [email protected] 0 points 2 months ago* (last edited 2 months ago) (1 children)

I'll reiterate, if it was a null pointer exception (I honestly don't know that it was, but every comment I've made is based on that assumption, so let's go with it for now) then I absolutely can blame C++, and the code author, and the code reviewer, and QA. Many links in the chain failed here.

C++ is not a memory safe language, and while it's had massive improvements in that area in the last two decades, there are languages that make better guarantees about memory safety.

[–] [email protected] 1 points 2 months ago

but it very probably was not a memory error. Rust isn't magic. It probably could not have prevented this bug anyway.

[–] [email protected] 3 points 2 months ago

They said it was a "logic error". so i think it was more likely some divide by zero or something like that

load more comments