this post was submitted on 20 Mar 2025

502 points (99.6% liked)

Technology

67151 readers

3691 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

502

FOSS infrastructure is under attack by AI companies (thelibre.news)

submitted 3 days ago by [email protected] to c/[email protected]

72 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] -2 points 2 days ago (1 children)

I'm not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn't just about every small site/service behind CF anyway?

[–] [email protected] 5 points 2 days ago (1 children)

Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn't want to require those because it would likely impact legitimate users as well as bots.

[–] [email protected] 1 points 1 day ago (1 children)

Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?

[–] [email protected] 0 points 1 day ago (1 children)

Odd. I just tried

curl https://www.scrapingcourse.com/cloudflare-challenge

and got

Enable JavaScript and cookies to continue

I'm clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).

Anyways, I'm reading through this blog post on using cURL with cloudflare-protected sites and I'm finding it interesting.

[–] [email protected] 1 points 1 day ago

Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don't want to ever challenge anyone. I was just saying CF blocks ML crawlers.

[–] [email protected] 26 points 2 days ago (1 children)

AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN'T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

[–] [email protected] 3 points 1 day ago

Thanks for hosting your instances. I use them often and they're really well maintained

[–] [email protected] 22 points 2 days ago

It's also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources--the bots expand to fill the container.

[–] [email protected] 25 points 2 days ago (4 children)

ELI5 why the AI companies can't just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects' servers?

[–] [email protected] 11 points 1 day ago (1 children)

Too many people overestimate the actual capabilities of these companies.

I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.

You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don't)

Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness

[–] [email protected] 4 points 1 day ago

Also, everyone's solution to a problem is stupid if they're only given 5 minutes to work on it.

Combine that with it being "free" for them to query the website and expensive to have enough local storage to replicate, even temporarily, all the stuff they want to scrape and it's kind of a no brainier to 'just not do that'. The only thing stopping them is morals / whether they want to keep paying rent.

[–] [email protected] 17 points 2 days ago

Because that would cost you money, so just "abusing" someone else's infrastructure is much cheaper.

[–] [email protected] 16 points 2 days ago (1 children)

Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don't work with ai, but I'd bet this is the reason and also I don't intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)

[–] [email protected] 12 points 2 days ago

Also having your victim host the costs is an added benefit

[–] [email protected] 28 points 2 days ago* (last edited 2 days ago) (1 children)

i hear there's a tool called (I think) 'nepenthe' that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP's you're certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.

[–] [email protected] 7 points 2 days ago (1 children)

nepenthe

It's a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they're hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it's only going to do anything to things knocked together by people who haven't thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.

[–] [email protected] 2 points 2 days ago

might hit a few times, or maybe there's a version that can puff stuff up the data in the sense of space, and salt it in the sense of utility.

[–] [email protected] 8 points 2 days ago

They're afraid

load more comments