this post was submitted on 06 Sep 2024

1722 points (90.2% liked)

Technology

70248 readers

3759 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

1722

The Irony of 'You Wouldn't Download a Car' Making a Comeback in AI Debates (lemmy.world)

submitted 8 months ago by [email protected] to c/[email protected]

484 comments fedilink hide all child comments

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

(page 6) 50 comments

sorted by: hot top controversial new old

[–] [email protected] 10 points 8 months ago

it's rich cunts asking for handouts again. hey, we call this feasibility, you should have thought about it before, not now. your business is not feasible. fuck off forever. thanks.

[–] [email protected] 2 points 8 months ago* (last edited 8 months ago)

They are laundering the creative works of humans. That's it. The end. They are laundering machines for art. They should be treated and legislated as such.

[–] [email protected] 8 points 8 months ago (4 children)

So if I watch all Star Wars movies, and then get a crew together to make a couple of identical movies that were inspired by my earlier watching, and then sell the movies, then this is actually completely legal.

It doesn't matter if they stole the source material. They are selling a machine that can create copyright infringements at a click of a button, and that's a problem.

This is not the same as an artist looking at every single piece of art in the world and being able to replicate it to hang it in the living room. This is an army of artists that are enslaved by a single company to sell any copy of any artwork they want. That army works as long as you feed it electricity and free labor of actual artists.

Theft actually seems like a great word for what these scammers are doing.

If you run some open source model on your own machine, that's a different story.

[–] [email protected] 7 points 8 months ago (2 children)

You've made a lot of confident assertions without supporting them. Just like an LLM! :)

load more comments (2 replies)

load more comments (3 replies)

[–] [email protected] 65 points 8 months ago (11 children)

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous "poetry poetry poetry..." ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a "blurry JPEG of the Internet"... the point being, you wouldn't allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

[–] [email protected] 4 points 8 months ago (9 children)

ML techniques have been very useful in compression, yes, but it's sort of nuts to say that a data structure that encodes only (sometimes overly so for certain regions of its latent space/embedding space/semantics space/whatever you want to call it right now) relationships between values rather than value sequences themselves as storing contiguous copyright protected works is storing partiularized creative works in particularly identifiable manner.

load more comments (9 replies)

[+] [email protected] -8 points 8 months ago (7 children)

Equating LLMs with compression doesn't make sense. Model sizes are larger than their training sets. if it requires "hacking" to extract text of sufficient length to break copyright, and the platform is doing everything they can to prevent it, that just makes them like every platform. I can download © material from YouTube (or wherever) all day long.

[–] [email protected] 16 points 8 months ago (1 children)

Model sizes are larger than their training sets

Excuse me, what? You think Huggingface is hosting 100's of checkpoints each of which are multiples of their training data, which is on the order of terabytes or petabytes in disk space? I don't know if I agree with the compression argument, myself, but for other reasons--your retort is objectively false.

[–] [email protected] 2 points 8 months ago* (last edited 8 months ago)

Just taking GPT 3 as an example, its training set was 45 terabytes, yes. But that set was filtered and processed down to about 570 GB. GPT 3 was only actually trained on that 570 GB. The model itself is about 700 GB. Much of the generalized intelligence of an LLM comes from abstraction to other contexts.

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Language Models are Few-Shot Learners

*Did some more looking, and that model size estimate assumes 32 bit float. It's actually 16 bit, so the model size is 350GB... technically some compression after all!

load more comments (6 replies)

load more comments (9 replies)

[–] [email protected] 36 points 8 months ago (1 children)

You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.

It's always fun reading "but you just don't understand".

[–] [email protected] 7 points 8 months ago (13 children)

On the other hand, it's hard to have a serious discussion with people who insist that building a LLM or diffusion model amounts to copying pieces of material into an obfuscated database. And then having to deal with the typical reply after explanation is attempted of "that isn't the point!" but without any elaboration strongly implies to me that some people just want to be pissy and don't want to hear how they may have been manipulated into taking a pro-corporate, hyper-capitalist position on something.

load more comments (13 replies)

[–] [email protected] 22 points 8 months ago (2 children)

The joke is of course that "paying for copyright" is impossible in this case. ONLY the large social media companies that own all the comments and content that has accumulated by the community have enough data to train AI models. Or sites like stock photo libraries or deviantart who own the distribution rights for the content. That means all copyright arguments practically argue that AI should be owned by big corporations and should be inaccessible to normal people.

Basically the "means of generation" will be owned by the capitalists, since they are the only ones with the economic power to license these things.

That is basically the worst case scenario. Not only will the value of work diminish greatly, the advances in productivity will also be only accessible to big capitalists.

Of course, that is basically inevitable anyway. Why wouldn't they want this? It's just sad seeing the stupid morons arguing for this as if they had anything to gain.

[–] [email protected] 4 points 8 months ago

It's just sad seeing the stupid morons arguing for this as if they had anything to gain.

The real money shot here... How did we get to a point where people will argue against common working slave good?

There is a pattern too... Iraq, Afghanistan, israeli genocide, bailouts. Anytime there is money to be made for the regime, we got solid 30% of population working as hard for zealots.

Them 2 decades later when the two wars failed, we can't find a single guy who support either war around 🤡

The same is somehow now shilling we "shouldn't invafe ukraine but Israeli needs tools to defend themselves"

[–] [email protected] 13 points 8 months ago (4 children)

I'm getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this "there isn't enough free data" claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization's specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

load more comments (4 replies)

[–] [email protected] 3 points 8 months ago

I hate to say this but "let the market decide" if Ai is something the consumer wants/needs they'll pay for it otherwise let it die.

[–] [email protected] -5 points 8 months ago (1 children)

A perfect analogy.

[–] [email protected] 2 points 8 months ago

I don't feel it is. They aren't saying that their physical requirements should be free (computers, engineers, programmers, electricity, etc...) which is what is being used for the analogy (cheese, ingredients, etc...).

It would be better to claim "I run a sandwich shop and couldn't afford to run it if I had to pay for every recipe, idea, and technique I use in the business."

Now, it's not as simple as this, and I'm not claiming it is. But this example isn't anywhere near correct. It's like the old claim that pirating something is the same as stealing it. The usage on one thing doesn't equal the loss of something physical.

It's one of those reasons why laws about this are difficult. Too strict and no one would be able to do "fan"-anything and many other issues ("if it uses AI" takes out many digital tools, etc...), too loose and you don't really have laws at all.

[–] [email protected] 34 points 8 months ago (2 children)

Considering that original works are discarded, it's strange how effective they're at plagiarizing them

[–] [email protected] -1 points 8 months ago (1 children)

In the same way that a person can learn the material and also use that knowledge to potentially plagiarize it, though. It's no different in that sense. What is different is the speed of learning and both the speed and capacity of recall. However, it doesn't change the fundamental truths of OP's explanation.

Also, when you're talking specifically about music, you're talking about a very limited subset of note combinations that will sound pleasing to human ears. Additionally, even human composers commonly struggle to not simply accidentally reproduce others' work, which is partly why the music industry is filled with constant copyright litigation.

load more comments (1 replies)

[–] [email protected] 0 points 8 months ago

Yep, its definitely not possible that nice small businesses like universal and sony would sue without an actual case in order to try and crush competitors with costs.

[–] [email protected] 0 points 8 months ago* (last edited 8 months ago) (1 children)

Counteroffer. We eliminate copyright laws all together. For anyone and everyone.

Let move to a system in which we found the projects before their release. And once released they are available to everyone for free.

Also let's make a system where everyone can work a basic work like 20-30 hours a week and get a living wage and the rest of the time we can just produce art of any kind of thing for free to anyone as we'll already had our needs covered and we won't have the need to monetize every second of out existence.

[–] [email protected] 11 points 8 months ago

And free cotton candy and rainbows for everybody!

load more comments