Could Reddit's data be "poisoned" to prevent its use in training AI? : reddit

this post was submitted on 26 Feb 2024

166 points (93.7% liked)

17637 readers

394 users here now

News and Discussions about Reddit

Welcome to !reddit. This is a community for all news and discussions about Reddit.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules

Rule 1- No brigading.

**You may not encourage brigading any communities or subreddits in any way. **

YSKs are about self-improvement on how to do things.

Rule 2- No illegal or NSFW or gore content.

**No illegal or NSFW or gore content. **

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts.

Provided it is about the community itself, you may post non-Reddit posts using the [META] tag on your post title.

Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

:::spoiler Rule 10- Majority of bots aren't allowed to participate here.

founded 1 year ago

MODERATORS

[email protected]

166

Could Reddit's data be "poisoned" to prevent its use in training AI? (lemmy.world)

submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/[email protected]

61 comments fedilink hide all child comments

In case you didn't know, you can't train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn't be able to sell it for that purpose.

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 4 points 8 months ago

In the year 3000: we have hyper intelligent AI, but it's limited to knowledge from 2022.

[–] [email protected] 11 points 8 months ago

It seems this assumes that Reddit cares about the quality of the data, but as long as they can sell it I doubt they care.

[–] [email protected] 11 points 8 months ago (1 children)

No, because the upvote ratio on posts and comments will be used to signal higher quality content.

It would take considerable effort and coordination to generate low quality content and give it an upvote history that isn't obviously suspicious and do that for enough content that it actually matters to the training.

Even if you could accomplish that, you can't backdate this activity, so they could simply filter out posts and comments after a recent date and still have an enormous amount of data to train.

[–] [email protected] 2 points 8 months ago* (last edited 8 months ago) (1 children)

Upvoted content is not higher quality. An AI trained only on the top posts of Reddit would be very funny though.

They could filter posts by time, but that prevents any further data from being used which still limits the value of Reddit to buyers. Even all of Reddit pre-AI is probably too small to be useful indefinitely.

[–] [email protected] 2 points 8 months ago

If the goal of training is to produce output that users "like" or engage with, then yes, upvoted content is higher quality. The definition of quality here will certainly depend on their goals.

My point is a bunch of spammed content intended to poison AI training is unlikely to gather upvotes, and so it could easily be filtered out if they're also okay with discarding some human generated content that was not upvoted.

[–] [email protected] 2 points 8 months ago

Anonymous could wipe the servers.

[–] [email protected] 30 points 8 months ago

With the amount of bot generated content on Reddit already that data can't be of much value

[–] [email protected] 6 points 8 months ago* (last edited 8 months ago)

For extreme cases = maybe something like this;

It is not („actual code or anything“) in fact.It is just random formatting;

Though_something_like-this-might-also_work.png

.But honestly‘ just leaving some- random pieces here@ and there‘ is probably enough to be a head‘ache .

[–] [email protected] -5 points 8 months ago (1 children)

I think people are going a bit overboard with the reddit hate. I used the site for over a decades, learned a lot and had a lot of laughts and fun and gripes. To want to "burn it down" now because of api and ai stuff is understandable, but seems just vindictive. What public good is supposed to be achieved with this?

[–] [email protected] 1 points 8 months ago

The public good is denying corporations the ability to enshittify more of the internet. The public good is creating a precedent saying that bad actions have consequences, so that corporations don't act badly in the future.

[–] [email protected] 10 points 8 months ago (1 children)

I don't think that you can prevent Reddit data from being used for AI training, but you could reduce its value. Based on that, I'd probably

Generate low quality text that machines would have a hard time sorting out.
Replace your current Reddit content with said gibberish.

I'm saying this based on the following:

I don't think that Reddit has any sort of complex content versioning system; at most, I think that it keeps your deleted posts/comments.
Odds are that the data is filtered before being used for "training", and both user karma + content score play a role on that. As such, it would be pointless to add nonsense content that humans will downvote.

Funnily enough, AI might be a good way to generate this poisoning data against AI. For example I asked Gemini "Generate three paragraphs of nonsense text, containing three sentences each.", and here's the output

Gemini output

The purple teapot sang opera while the moonbeams juggled kumquats. A nearby rhinoceros tap-danced with a flock of butterflies, their laughter echoing through the forest of lollipops. Suddenly, a gust of wind carrying the scent of chocolate chip cookies swept through, carrying the teapot and rhinoceros off on a whimsical adventure.

The clock chimed a symphony of kazoo sounds as the grumpy gnome polished his collection of spoons. Outside, a flock of polka-dotted penguins argued passionately about the best flavor of bubblegum. Meanwhile, a mischievous squirrel knitted a tiny sweater for a bumblebee, humming happily as it buzzed around the garden.

The gigantic rubber ducky sailed across the sea of marshmallow fluff, its captain a pirate with a monocle and a penchant for pickles. In the distance, a mermaid with hair made of spaghetti twirled underwater, chasing after a school of goldfish wearing tiny tutus. On an island of cheese, a group of singing cacti serenaded the sun with their off-key melodies.

You could tweak the prompt to get something even more nonsense or even more passable, but you get the idea.

[–] [email protected] 1 points 8 months ago (1 children)

Reddit's surely got a copy of the PushShift archives, it'll have all the pre-sabotage versions of those comments.

[–] [email protected] 5 points 8 months ago (1 children)

The PS archives are publicly available. If either OpenAI or Google were to use it, they wouldn't pay Reddit Inc. a single penny; and yet Google is paying it 60 million dollars do to do. This means that there's content that they cannot retrieve through the PS archives that would still be valuable as LLM data.

[–] [email protected] 1 points 8 months ago (1 children)

They're paying Reddit to not sue them.

Regardless, the content that's available through PS is the content that people are talking about overwriting or deleting. They can't edit or delete stuff that PushShift couldn't see in the first place.

[–] [email protected] 1 points 8 months ago (1 children)

They’re paying Reddit to not sue them.

Given how many defences Google would have against that ant called Reddit suing it, ranging from actual fair points to "ackshyually", I find it unlikely.

Regardless, the content that’s available through PS is the content that people are talking about overwriting or deleting. They can’t edit or delete stuff that PushShift couldn’t see in the first place.

Emphasis mine. Can you back up this claim?

I'm asking this because the content from PS is up to March/2023, it's literally a year old. There was a lot of activity in Reddit in the meantime, and it's from my impression that people talking about this are the ones who already erased their content in the APIcalypse, but kept using Reddit because there's some subject "stuck" there that they'd like to use.

[–] [email protected] 1 points 8 months ago

Academic Torrents has Reddit data up to December 2023. This data isn't live-updated, my understanding is that it's scraped when it's first posted. That's how services like removeddit worked, it would show the "original" version of a post or comment from when it was scraped rather than the edited or deleted version that Reddit shows now.

The age isn't really the most important thing when it comes to training a base AI model. If you want to teach it about current events there are better ways to do that than social media scrapes. Stuff like Reddit is good for teaching an AI about how people talk to each other.

[–] [email protected] -1 points 8 months ago (1 children)

I'm buying Reddit when it's traded

[–] [email protected] 6 points 8 months ago (1 children)

When it's done, sell it to me. I have some leftover coins from buying cigs and bread, they'll be more than enough to buy Reddit once enough time passes.

[–] [email protected] -2 points 8 months ago (1 children)

you won't be able to afford deez stonks

[–] [email protected] 7 points 8 months ago (1 children)

Fuck, you're right. I forgot that the 1 cent coins went out of circulation here in Brazil.

...if I give you a five cents coin, you're allowed to keep the change, OK?

[–] [email protected] 1 points 8 months ago (1 children)

haha, the Real trades at 5/USD. you're going to need to chop down or burn some more rainforest or something to come up with the coin

[–] [email protected] 3 points 8 months ago

haha, the Real trades at 5/USD.

Ah, it's one pila for five tacos? That's great, you don't need to bother with the change then!

Serious now, the Reddit stock will drop quickly after the IPO. For me the only question is how quickly - hours? or months?

you’re going to need to chop down or burn some more rainforest

I'm grabbin' my axe!

Just kidding. I live as "close" to the rainforest as Berlin is from Tunis.

[–] [email protected] 14 points 8 months ago

There's something you're missing about this, and that's how low quality the human generated content is on that site. The default subs are utter dumpster fires, with the top few comments typically being pop culture references being yelled into the ether, followed by unhinged rants, nutty takes, and assorted nonsense, all with poor spelling, grammar, and often the entirely wrong word used.

Flooding the place with AI content would be an improvement.

[–] [email protected] 5 points 8 months ago* (last edited 8 months ago)

You can train AI models on AI generated content though. AI collapse only occurs if you train it on bad AI generated content. Bots and people talking gibberish are just as bad for training an AI model. But there are ways to filter that from the training data. Such as language analysis. They will also most likely filter out any lowly upvoted comments, or those edited a long time since their original post date.

And if you start posting now, any sufficiently good AI generated material, which other humans will like and upvote, will not be bad for the model.

[–] [email protected] 35 points 8 months ago (2 children)

In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output.

This is incorrect in the general case. You can run into problems if you do it incorrectly or in a naive manner. But this is stuff that the professionals have figured out months or years ago already. A lot of the better AIs these days are trained on "synthetic data", which is data that's been generated by other AIs.

I've seen a lot of people fall for wishful thinking on this subject. They don't like AI for whatever reason, they hear some news article that says something that sounds like "AI won't work because of problem X", and so they grab hold of that. "Model collapse" is one of those things, it's not really a problem that serious researchers consider insurmountable.

If you don't want Reddit to use your posts to train AI then don't post on Reddit. If you already did post on Reddit, it's too late, you already gave them your content. Bear this in mind next time you join a social media site, I guess.

[–] [email protected] 2 points 8 months ago (1 children)

Training on synthetic data is not a quality improvement, it's just an edge case reducer for a small set of edge cases by decreasing "overfitting", and it is only even able to achieve that if you're very very careful with what you add and how. If you're ONLY training on AI generated data repeatedly then it does start to degrade and loose coherence after a few generations of training

[–] [email protected] 2 points 8 months ago (1 children)

Which is why nobody trains on ONLY AI generated data.

Really, experts have thought of this stuff already. Because they're experts. Synthetic data means that the amount of "real" data required is much less, so giant repositories like Reddit aren't so important.

[–] [email protected] 1 points 8 months ago

No, "much less" training data isn't possible with synthetic data. That's not what it's there for. The experts would tell you as much if you asked them.

[–] [email protected] 8 points 8 months ago (1 children)

Biased models are still absolutely a massive concern to serious researchers.

"AI collapse" isn't the only mechanism to throw a monkey wrench into someone's AI ambitions.

Intentionally introducing and reinforcing biases in an automated fashion adds an additional burden to those developing a model. I haven't actually looked into the economic asymmetry of those attacks, though.

[–] [email protected] 7 points 8 months ago

Absolutely this. Ai isn't some bastion of truth. I envision a future where AIS trained by different stakeholders, e.g. Dem vs repub, us vs Russia vs china. Etc... All fighting for eyeballs. It's just gonna get harder to tell what's real from fake because of the insane amount of content these bots are gonna churn out. It's already a huge problem with human monitored sources.

[–] [email protected] 4 points 8 months ago

They probably want you to edit your comments to poison them.

They probably are using AI bots to make astroturf posts already.

Imagine how much it’s worth to Google to train an AI to recognize other AI generated posts. Imagine how much it’s worth to Google to have a training set of “poisoned” data (and to able to compare it to the original post, which they can do since reddit saves your edits on the backend). Not to mention training on genuine reaction by users to AI posts, to obvious poisoning. They’ll be able to use that to train their own AI to not be defeated by these issues.

I don’t know what should be done but I feel like trying to defeat the AI training actually plays right into their hands.

[–] [email protected] 6 points 8 months ago

Yes of course you could poison it. The thing is, Reddit leaders don't care. They want to inflate the price as much as possible, get as much money from it, and then bail when the time is right.

Many downstream customers also don't care. They're riding the AI bubble to get richer, not because they actually care about high-quality products. Of course there is some cool legitimate AI work and research, as we have all seen, but the expensive decisions are being made based on expected short-run profit.

[–] [email protected] 10 points 8 months ago

Frankly, reddit is so overrun with bots and nazis that it's already a superfund site.

[–] [email protected] 7 points 8 months ago

"It's fascinating yet concerning to see Reddit's move to sell its content for training AI models. This feels like a plot twist in a 'Black Mirror' episode, making us wonder if we're heading towards a 'Brave New World' or just caught in the 'Net' of progress. While the potential to enhance AI's grasp of human banter is immense, it also opens a Pandora's box of privacy and ethical issues. It's a byte-sized dilemma in a world hungry for data, urging us for a transparent dialogue to ensure we don't scroll past our own rights. Perhaps, in a twist of fate, it's humanity's destiny to carve a future where we coexist with, if not be guided by, the very machines we've created."

[–] [email protected] 44 points 8 months ago* (last edited 8 months ago) (3 children)

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn't be able to sell it for that purpose.

It feels like Reddit was already using bots to make posts after they killed 3rd party apps. It's been pointed out a lot here how so many comment chains on the site these days make no sense unless they are AI/bots.

[–] [email protected] 2 points 8 months ago

It's not just the content, it's the ecosystem

If you're training ai, you need a way to evaluate outputs. What better way than through karma score?

[–] [email protected] 4 points 8 months ago (2 children)

Even before then, you'd always find comments in any larger section that were irrelevant praise posted by bots to generate a "realistic" Reddit account to sell later to marketing companies.

Hell I believe I once used a tool to value my Reddit account at like $200 and it literally told me how kind my responses were. Also to generate comment karma, responding to a post early is much more valuable than a good response.

[–] [email protected] 2 points 8 months ago (1 children)

I don't suppose you remember the tool? I'm curious about mine. lol

[–] [email protected] 1 points 8 months ago

I can't remember the specific site and it may not be up anymore. I either found it by googling "Reddit account value" or words to that effect, or stumbled across the link in Reddit.

I do remember it worked a bit like redditmetis.com as it knew the age of the account and karma, but also use of kind Vs obscene language. I was also a mod of subreddit that just made everyone mods for the heck of it

I think I already type like generative AI too, which may be worth something nowadays. Honestly setting up a bit that uses a large language model to pump vaguely relevant top level comments out soon after posts are posted will probably net you more karma in a month than a decade using it sincerely, although for this reason, I presume old accounts are particularly valued now.

[–] [email protected] 2 points 8 months ago

Oh I forgot about people selling accounts. Not only will they be training on a bunch of bot posts, they’ll be training on ad spam as well.

I never bothered deleting my account or deleting my posts. But I might consider selling my account.

[–] [email protected] 13 points 8 months ago

Fr. Couple of months ago I went to check and all I saw were posts with a ton of upvotes and no comments or posts with a ton of upvotes and a thousand comments, not a single comment with anything of substance.

load more comments