this post was submitted on 20 Mar 2025

507 points (99.6% liked)

Technology

67338 readers

4622 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

507

FOSS infrastructure is under attack by AI companies (thelibre.news)

submitted 4 days ago by [email protected] to c/[email protected]

74 comments fedilink hide all child comments

(page 2) 25 comments

sorted by: hot top controversial new old

[–] [email protected] 46 points 4 days ago (7 children)

I too read Drew DeVault's article the other day and I'm still wondering how the hell these companies have access to "tens of thousands" of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

load more comments (7 replies)

[–] [email protected] 61 points 4 days ago

Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I'm expecting the next spam wave.

[–] [email protected] 12 points 4 days ago* (last edited 4 days ago) (4 children)

Assuming we could build a new internet from the ground up, what would be the solution? IPFS for load-balancing?

[–] [email protected] 10 points 4 days ago (2 children)

what would be the solution?

Simple, not allowing anonymous activity. If everything was required to be crypto-graphically signed in such a way that it was tied to a known entity then this could be directly addressed. It's essentially the same problem that e-mail has with SPAM and not allowing anonymous traffic would mostly solve that problem as well.

Of course many internet users would (rightfully) fight that solution tooth and nail.

[–] [email protected] 15 points 4 days ago (2 children)

Proof of work before connections are established. The Tor network implemented this in August of 2023 and it has helped a ton.

load more comments (2 replies)

load more comments (1 replies)

[–] [email protected] 37 points 4 days ago (1 children)

There is no technical solution that will stop corporations with deep pockets in a capitalist society

[–] [email protected] 6 points 4 days ago (3 children)

Maybe letters through the mail to receive posts.

[–] [email protected] 1 points 4 days ago

How long will USPS last?

load more comments (2 replies)

[–] [email protected] 52 points 4 days ago (2 children)

The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

load more comments (2 replies)

[–] [email protected] 63 points 4 days ago (1 children)

I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

[–] [email protected] 53 points 4 days ago (2 children)

How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

[–] [email protected] -4 points 4 days ago (5 children)

What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

Yes, because like or not that's the only possible solution. If all traffic was required to be signed and the signatures were tied to an entity then you could refuse unsigned traffic and if signed traffic was causing problems you'd know who it was and have recourse.

I don't like this solution but it's the only way forward that I can see.

load more comments (5 replies)

[–] [email protected] 14 points 4 days ago (3 children)

What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

You're right. Which is exactly why companies should be exhibiting better behaviour and self regulate before they make the internet infinitely worse off for everyone.

[–] [email protected] 28 points 4 days ago* (last edited 4 days ago) (1 children)

self regulation is a joke. a few bad apples always spoil the bunch.

what needs to happen is regulation, period. force all companies to abide by laws that just make sense, and all these problems go away.

see: GDPR

[–] [email protected] 0 points 4 days ago (6 children)

What did GDPR solve? Did we get rid of advertisers sharing data?

load more comments (6 replies)

[–] [email protected] 3 points 4 days ago

Exactly, we've already seen this in the past. GDPR is a good example. Whilst I'm glad this regulation exists, it wouldn't be necessary if megacorps would have behaved.

[–] [email protected] 5 points 4 days ago

according to history, this sadly never works

[–] [email protected] 25 points 4 days ago (1 children)

If an AI is detecting bugs, the least it could do is file a pull request, these things are supposed to be master coders right? 🙃

[–] [email protected] 4 points 4 days ago

to me, ai is a bit like bucket of water if you replace the water with "information". Its a tool and it cant do anything on its own, you could make a program and instruct it to do something but it would work just as chaotically as when you generate stuff with ai. It annoys me so much to see so many(people in general) think that what they call ai is in anyway capable of independent action. It just does what you tell it to do and it does it based on how it has been trained, which is also why relying on ai trained by someone you shouldnt trust is bad idea.

[–] [email protected] 114 points 4 days ago* (last edited 4 days ago) (3 children)

Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn't strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

[–] [email protected] 33 points 4 days ago

AI scrapers are a massive issue for Lemmy instances. I'm gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn't even think of the ones lying about it.

I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

[–] [email protected] 29 points 4 days ago (1 children)

Any idea what the point of these are then? Sounds like its reporting a fake bug.

[–] [email protected] 80 points 4 days ago (2 children)

The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:

Get your LLM to create what it thinks are valid bug reports/issues
Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
Use those outcomes to assign how "good" or "bad" that generated issue was
Use that scoring as a way to feed back into the model to influence it to create more "good" issues

If this is what's happening, then it's essentially offloading your LLM's reinforcement learning scoring to open source maintainers.

[–] [email protected] 40 points 4 days ago

Thats wild. I don't have much hope for llm's if things like this is how they are doing things and I would not be surprised given how well they don't work. Too much quantity over quality in training.

load more comments (1 replies)

[–] [email protected] 6 points 4 days ago* (last edited 4 days ago) (1 children)

Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won't admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.

Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.

Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬

load more comments (1 replies)

load more comments