this post was submitted on 18 Jul 2024
1 points (100.0% liked)

TechTakes

1859 readers
30 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago
MODERATORS
 

we appear to be the first to write up the outrage coherently too. much thanks to the illustrious @self

you are viewing a single comment's thread
view the rest of the comments
[–] fasterandworse@awful.systems 0 points 10 months ago (1 children)
[–] Lumisal@lemmy.world 0 points 10 months ago (2 children)

Well since you want to use computers to continue the discussion, here's also ChatGPT:

Determining the exact percentage of copyrighted data used to train a large language model (LLM) is challenging for several reasons:

  1. Scale and Variety of Data Sources: LLMs are typically trained on vast and diverse datasets collected from the internet, including books, articles, websites, and social media. This data encompasses both copyrighted and non-copyrighted content. The datasets are often so large and varied that it is difficult to precisely categorize each piece of data.

  2. Data Collection and Processing: During the data collection process, the primary focus is on acquiring large volumes of text rather than cataloging the copyright status of each individual piece. While some datasets, like Common Crawl, include metadata about the sources, they do not typically include detailed copyright status information.

  3. Transformation and Use: The data used for training is transformed into numerical representations and used to learn patterns, making it even harder to trace back and identify the copyright status of specific training examples.

  4. Legal and Ethical Considerations: The legal landscape regarding the use of copyrighted materials for AI training is still evolving. Many AI developers rely on fair use arguments, which complicates the assessment of what constitutes a copyright violation.

Efforts are being made within the industry to better understand and address these issues. For example, some organizations are working on creating more transparent and ethically sourced datasets. Projects like RedPajama aim to provide open datasets that include details about data sources, helping to identify and manage the use of copyrighted content more effectively【6†source】.

Overall, while it is theoretically possible to estimate the proportion of copyrighted content in a training dataset, in practice, it is a complex and resource-intensive task that is rarely undertaken with precision.

[–] froztbyte@awful.systems 0 points 10 months ago

you should speak to a physicist, they might be able to find a way your density can contribute to science

[–] fasterandworse@awful.systems 0 points 10 months ago (1 children)

"exact percentage"

just fuck right off. wasting my fucken time.

[–] Lumisal@lemmy.world 0 points 10 months ago (2 children)

You're the one who linked to an exact percentage, not me. Have a good day.

[–] fasterandworse@awful.systems 0 points 10 months ago (2 children)

re-read your chatgpt response and think about whether the percentages in my original link could be too high or too low.

[–] froztbyte@awful.systems 0 points 10 months ago

too high or too low

trick question everyone knows this late on a friday you want a body high for that nice mellow low feeling

[–] fasterandworse@awful.systems 0 points 10 months ago

but, like, really think this time. at this point i'm not arguing with you, i'm trying to help you.

[–] froztbyte@awful.systems 0 points 10 months ago (1 children)

no, you utter fucking clown. they're literally posting to take the piss out of you, the only person in the room who isn't getting that everyone is laughing at them, not with them

[–] Lumisal@lemmy.world 0 points 10 months ago

Whatever makes you feel better buddy