I'm not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn't just about every small site/service behind CF anyway?
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn't want to require those because it would likely impact legitimate users as well as bots.
Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?
Odd. I just tried
and got
Enable JavaScript and cookies to continue
I'm clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).
Anyways, I'm reading through this blog post on using cURL with cloudflare-protected sites and I'm finding it interesting.
Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don't want to ever challenge anyone. I was just saying CF blocks ML crawlers.
AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN'T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)
Thanks for hosting your instances. I use them often and they're really well maintained
It's also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources--the bots expand to fill the container.
ELI5 why the AI companies can't just clone the git repos and do all the slicing and dicing (running git blame
etc.) locally instead of running expensive queries on the projects' servers?
Too many people overestimate the actual capabilities of these companies.
I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.
You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don't)
Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness
Also, everyone's solution to a problem is stupid if they're only given 5 minutes to work on it.
Combine that with it being "free" for them to query the website and expensive to have enough local storage to replicate, even temporarily, all the stuff they want to scrape and it's kind of a no brainier to 'just not do that'. The only thing stopping them is morals / whether they want to keep paying rent.
Because that would cost you money, so just "abusing" someone else's infrastructure is much cheaper.
Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don't work with ai, but I'd bet this is the reason and also I don't intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)
Also having your victim host the costs is an added benefit
i hear there's a tool called (I think) 'nepenthe' that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP's you're certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.
nepenthe
It's a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they're hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it's only going to do anything to things knocked together by people who haven't thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.
might hit a few times, or maybe there's a version that can puff stuff up the data in the sense of space, and salt it in the sense of utility.
They're afraid