this post was submitted on 22 Feb 2024
1020 points (98.7% liked)

Technology

69660 readers
3006 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
(page 4) 50 comments
sorted by: hot top controversial new old
[–] [email protected] 16 points 1 year ago (1 children)

I went through my comment history and changed all my comments with 100+ karma to a bunch of nonsense I found on the Internet, mostly from bots posting YouTube comments. It's mostly English words so it shouldn't get discarded for being gibberish. But they didn't make coherent information. I was sad to see some of my posts go away but I don't want to feed the imitative AI.

Also did the first 6 pages of my "controversial" comments.

I know they have backups, but that's why I didn't simply delete them. Hopefully these edited versions get into the training set and fuck it up, even if only a little.

It's be funny if someone could come up with a "drop table" post that would maybe make it into the set...

load more comments (1 replies)
[–] [email protected] 10 points 1 year ago (5 children)

I'm so confused about how AI learning is supposed to work. Does it just need any data at all in significant quantity, is the quality of the data almost irrelevant? Because otherwise surely they could just feed it back issues of scientific American, or the scanned copies of the library of congress, I can't reasonably believe that Reddit is going to add anything unless it's just pure on adulterated quantity that's important.

[–] [email protected] 4 points 1 year ago* (last edited 1 year ago) (2 children)

If you wanted the AI to just create book-like texts than you could train it purely on books from a library but if you want it to converse like a human being you need training data that imitates that.

load more comments (2 replies)
load more comments (4 replies)
[–] [email protected] 9 points 1 year ago (2 children)

Is it time to go back to Reddit and post the stupidest shit possible, for science of course

load more comments (2 replies)
[–] [email protected] 6 points 1 year ago (1 children)

While reddit has some of the most unhinged posts on the internet, it's also home to some of the most insightful and niche knowledge on the internet. For every insane venting politically misguided post, there's posts about electronic configurations, coding, athletic conditioning, parenting, psychology, astronomy, and media criticism.

[–] [email protected] 8 points 1 year ago (1 children)

But about half of those posts are wrong, or misinformation.

Seriously, go into any somewhat popular Reddit thread on a subject you are familiar with. There will be multiple highly upvoted parent comments going into great detail on the subject, and they will be completely wrong about all of it.

load more comments (1 replies)
[–] [email protected] 38 points 1 year ago (5 children)

since they're gorging on reddit data, they should take the next logical step and scrape 4chan as well

[–] [email protected] 5 points 1 year ago

Good, it’s hard getting LLMs to return slurs one letter at a time.

[–] [email protected] 8 points 1 year ago (2 children)

Imagine training an AI exclusively off of 4chan posts.

Tbf Tay bot and other chat bots that learned by interacting with users sorta already did this, just indirectly over time.

[–] [email protected] 6 points 1 year ago

pretty sure someone did train an ai off 4chan before

[–] [email protected] 8 points 1 year ago

Imagine training an AI exclusively off of 4chan posts.

I'd pay good money to see that dumpster fire lol

[–] [email protected] 16 points 1 year ago

Turns out Poole was a decade ahead of AI, with the self-destructing threads.

load more comments (2 replies)
[–] [email protected] 29 points 1 year ago (2 children)

I wasted some mental health on that and I want that it would be the thing Google would learn on.

Comment editing routine is as follows:

  1. Start with mass find&replacing by a mask 'not' to 'indeed', delete all n't, replace 'and' with 'but'.
  2. Take all groups like [*](*) and change a content of links in brackets to How to play a cowbell tutorial video.
  3. Remove double line breaks to a single one so it'd all be single-paragraph messages with a failed markdown.
  4. Delete commas and replace dots with question marks.
  5. Change register of letters by counting the next letter to redo by the next number in the π sequence.
  6. Do a table of all pronouns and replace half of them to Red Pants, half to Blue Pants to keep it political.
  7. And, finally, end every 13th message with a disclaimer Retired 2023, thirteen year daily forums volunteer, Windows MVP 2010-2020..
[–] [email protected] 11 points 1 year ago* (last edited 1 year ago) (1 children)

If they have access to Reddit’s database then they have all the previous versions of everything, including deleted comments and deleted accounts.

You don’t think they paid to simply scrape, did you? They already do that.

[–] [email protected] 9 points 1 year ago

Do they have the access to all my grammatical mistakes?

REEEEEEEEEE!

[–] [email protected] 4 points 1 year ago (1 children)

Here is an alternative Piped link(s):

How to play a cowbell

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I'm open-source; check me out at GitHub.

load more comments (1 replies)
[–] [email protected] 0 points 1 year ago

Mine among them, I hope. So cool, my calls to all good people to assemble and go kill all bad people will be used by big LLMs. Aw

[–] [email protected] 7 points 1 year ago* (last edited 1 year ago)

Did reddit pay a dime for that content? I guess not. That is what social media is all about.

[–] [email protected] 1 points 1 year ago

Another wave of new and undecided users coming to Lemmy! Reddit CEO is on our side after all.

[–] [email protected] 30 points 1 year ago* (last edited 1 year ago) (5 children)

Hey guys, let's be clear.

Google now has a full complete set of logs including user IPs (correlate with gmail accounts), PRIVATE MESSAGES, and also reddit posts.

They pinky promise they will only train AI on the data.

I can pretty much guarantee someone can subpoena google for your information communicated on reddit, since they now have this PII (username(s)/ip/gmail account(s)) combo. Hope you didn't post anything that would make the RIAA upset! And let's be clear... your deleted or changed data is never actually deleted or changed... it's in an audit log chain somewhere so there's no way to stop it.

"GDPR WILL SAVE ME!" - gdpr started in 2016. Can you ever be truly sure they followed your deletion requests?

[–] [email protected] 6 points 1 year ago (1 children)

it's in an audit log chain somewhere so there's no way to stop it.

Gut feel based on common tech platform procedures, right? (As opposed to a sourceable certainty.)

I’d bet $100 you’re right. That said, I’d give a caveat if I were you and I were going with my instincts.

load more comments (1 replies)
[–] [email protected] 3 points 1 year ago

They definitely won't be selling any of that to scammers /s

[–] [email protected] 26 points 1 year ago (2 children)

"lets be clear"

You're making things up and presenting them as facts, how is any of this "clear"?

[–] [email protected] 6 points 1 year ago

How do you think Reddit is restoring posts that people have been deleting?

Do you think Google’s deal simply allowed them to scrape old.reddit? Hell no, there is probably a live replica of Reddit prod at Google somewhere, including deleted posts and all edits.

You don’t think they paid $60m just scrape, do you?

load more comments (1 replies)
[–] [email protected] 17 points 1 year ago (4 children)

Where does it say they have access to PII?
I would imagine reddit would be anonymising the data. Hashes of usernames (and any matches of usernames in content), post/comment content with upvote/downvote counts. I would hope they are also screening content for PII.
I dont think the deal is for PII, just for training data

load more comments (4 replies)
load more comments (1 replies)
load more comments
view more: ‹ prev next ›