this post was submitted on 17 May 2024
85 points (100.0% liked)

Technology

37699 readers
284 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
top 23 comments
sorted by: hot top controversial new old
[–] [email protected] 6 points 5 months ago (1 children)

So what I am reading is that I should open a few Reddit accounts and start replying to posts with random excerpts from Master P era dirty south lyrics en masse.

[–] [email protected] 7 points 5 months ago* (last edited 5 months ago) (1 children)

ChatGPT says:

Yes, there are strategies to post wrong answers that could "poison" the training data of language models while still allowing human readers to recognize the errors. Here are a few approaches:

  1. Subtle Semantic Errors: Provide answers that contain subtle but significant semantic errors. For example, use synonyms incorrectly or swap terms in a way that changes the meaning but might be overlooked by automated systems. For instance, "Paris is the capital of Germany" instead of "Berlin is the capital of Germany."
  1. Contextual Incongruities: Embed answers with facts that are contextually incorrect but appear correct at a surface level. For example, "The sun rises in the west and sets in the east."
  1. Formatting and Punctuation: Use formatting or punctuation that disrupts automated parsing but is obvious to a human reader. For example, "The capital of France is Par_is." or "Water freezes at 0 degrees F@harenheit."
  1. Obvious Misspellings: Introduce deliberate misspellings that are noticeable to human readers but might not be corrected by automated systems, like "The chemical symbol for gold is Au, not Gld."
  1. Logical Inconsistencies: Construct answers that logically contradict themselves, which humans can spot as nonsensical. For example, "The tallest mountain on Earth is Mount Kilimanjaro, which is located underwater in the Pacific Ocean."
  1. Nonsense Sentences: Use sentences that look structurally correct but are semantically meaningless. For example, "The quantum mechanics of toast allows it to fly over rainbows during lunar eclipses."
  1. Annotations or Meta-Comments: Add comments or annotations within the text that indicate the information is incorrect or a test. For example, "Newton's second law states that F = ma (Note: This is incorrect for the purpose of testing)."

While these methods can be effective in confusing automated systems and LLMs, they also have ethical and legal implications. Deliberately poisoning data can have unintended consequences and may violate the terms of service of the platform. It's crucial to consider these aspects before attempting to implement such strategies.

[–] [email protected] 3 points 5 months ago (2 children)

Huh... Will this message then get re-ingested by chatgpt? Did it just poison itself?

[–] [email protected] 1 points 5 months ago

Thanks to a few centuries of upper nobility, we already know that marrying your cousin for several generations is not always a good idea. It'll be interesting to see what happens after a few iterations of AIs being trained on data mostly produced by other AIs (or variations of themselves). I suppose it largely depends on how well the training data can be curated.

[–] [email protected] 2 points 5 months ago* (last edited 5 months ago)

I feel like the ingest system will be sophisticated enough to throw away pieces of text that begin with a message like "ChatGPT says". Probably even stuff that follows the "paragraph with assumptions and clarifications followed by a list followed by a brief conclusion" structure - everything old has been ingested already, and most of the new stuff containing this is probably AI generated.

[–] [email protected] 40 points 5 months ago* (last edited 5 months ago) (1 children)

So you're saying instead of tacking "site:reddit.com" onto my Google search, I can now use ChatGPT to get the same information, except without the original context, and it will often be wrong? Amazing!

And this also means that companies will fill Reddit with fake comments promoting their brand to ensure that their brand gets mentioned in ChatGPT responses, right? Can't wait!

[–] [email protected] 1 points 5 months ago

don't worry, google also has a partnership with reddit! why doesn't reddit just have an open api like they used to? good question!

[–] [email protected] 17 points 5 months ago (1 children)

I can't wait for redditors to see the opportunity to poison ChatGTP hard.

[–] [email protected] 15 points 5 months ago (2 children)

They might not even have to. I bet there are bots already having entire discussions by themselves on there.

Anti Commercial-AI license

[–] [email protected] 10 points 5 months ago

/r/subredditsimulator

[–] [email protected] 2 points 5 months ago (3 children)

The license does not apply to posts and replies in Reddit, right? Thank god I created a blog to post about any stuff that I want, without license or restrictions from Reddit. Before the AI breakthrough and what happened to Reddit. But even if so, do AI tools understand such a license text and evaluate if they can or cannot use the material?

[–] [email protected] 5 points 5 months ago

From what I understand LLMs are just large heuristic machines. They gather a lot of statistics on token order and return an answer to that with something that statistically should higher than other options. There's no "understanding". So to answer your question, no, they don't understand the license.

Content is most likely scraped wholesale from websites, possibly run through some clean up to possibly filter out absolute garbage, and fed into an LLM to train it. An LLM can be tricked to reveal its training data (e.g repeat "fruit" forever). It's in those cases where copyright infringement is detected and if action can and has be taken. There are court cases currently in review, the most popular being the one against Github Copilot for infringing on the license of sourcecode it ingested.

Anti Commercial-AI license

[–] [email protected] 4 points 5 months ago (1 children)

do AI tools understand such a license text and evaluate if they can or cannot use the material?

So, this is the fun part: AI tools don't auto-ingest material to process it. The developers choose the materials to feed into the models.

And while the tech bros can understand your licenses, they don't give a flying fuck, because they think they'll be billionaires beyond consequences by the time anyone discovers that their work in particular has been ripped off.

[–] [email protected] 2 points 5 months ago

Well the companies and developers don't decide for every single material. In example what I expect is, that they program the scraper with rules to respect licenses of individual projects (such as on Github probably). And I assume those scraper tools are AI tools themselves, programmed with AI tool assist on top of it. There are multiple AI layers!

At this point, I don't think that any developer knows exactly what the AI tools are fed with, if they use automatically scraped public sources from the internet.

[–] [email protected] 7 points 5 months ago* (last edited 5 months ago)

No, that user has the license on all of their comments

[–] [email protected] 17 points 5 months ago (2 children)

At this rate I'll be having a Lemmy user superiority complex in no time

[–] [email protected] 3 points 5 months ago

You don't think LLMs are being trained off of this content too? Nobody needs to bother "announcing a deal" for it, it's being freely broadcast.

[–] [email protected] 10 points 5 months ago* (last edited 5 months ago)

(I'm on New Game++++ now)

[–] [email protected] 43 points 5 months ago (3 children)

Reddit has become one of the internet’s largest open archives of authentic, relevant, and always up-to-date human conversations about anything and everything.

Reddit CEO ~~Steve Huffman~~ says

But refuses to pay the users or at least moderators who build Reddit to what it is now. Instead, it pushes more advertisements and sells data to AI companies for millions of Dollars.

[–] [email protected] 6 points 5 months ago

I've made a note to ask for the pay when I see mod postings. Luckily I'm finding more and more mod postings so there's lots of opportunities to remind mods that they're lining Reddits pockets for free.

[–] [email protected] 17 points 5 months ago (1 children)

And he will continue to do so as long as people keep using the platform. Seems to work well for him.

[–] [email protected] 3 points 5 months ago

I can't even really blame them, to be honest. It's just a shame it has to be this way

[–] [email protected] 19 points 5 months ago

Also hinders mods, users and especially disabled people to do their work.