this post was submitted on 22 Dec 2024
1569 points (97.5% liked)

Technology

60058 readers
2807 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

(page 4) 50 comments
sorted by: hot top controversial new old
[–] [email protected] 61 points 1 day ago (5 children)

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

load more comments (5 replies)
[–] [email protected] 1 points 1 day ago

Doesn't seem like this helps out all the writers / artists that the LLM stole from.

[–] [email protected] 3 points 1 day ago (1 children)

Are you threatening me with a good time?

First of all, whether these LLMs are "illegally trained" is still a matter before the courts. When an LLM is trained it doesn't literally copy the training data, so it's unclear whether copyright is even relevant.

Secondly, I don't think that making these models "public domain" would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn't matter if it's public domain or not because you can't get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it'll greatly enhance AI development.

[–] [email protected] 2 points 1 day ago (4 children)

The LLM does reproduce copyrighted data though.

load more comments (3 replies)
[–] [email protected] 85 points 1 day ago (7 children)

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

[–] [email protected] 10 points 1 day ago* (last edited 1 day ago) (1 children)

Public domain wouldn't be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can't copy a company with unique customers and physical property.

load more comments (1 replies)
[–] [email protected] 58 points 1 day ago (3 children)

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank's assets, and now effectively owns the bank.

load more comments (3 replies)
load more comments (5 replies)
[–] [email protected] 25 points 1 day ago (1 children)

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

[–] [email protected] 13 points 1 day ago (1 children)

It's like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same "buy an album from a record store" model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify's solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their "buy an album" business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

[–] [email protected] 6 points 1 day ago (1 children)

Bandcamp still runs on this mode though, and quite well

[–] [email protected] 8 points 1 day ago (1 children)

It's also one of the few places that have lossless audio files available for download. I'm a big fan of Bandcamp. I like having all my music local.

load more comments (1 replies)
[–] [email protected] 9 points 1 day ago* (last edited 1 day ago) (2 children)

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs... probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don't have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

[–] [email protected] 2 points 1 day ago (1 children)

Doesn't Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

What are bitnet models and what does that change in a nutshell?

[–] [email protected] 4 points 1 day ago* (last edited 1 day ago)

What are bitnet models and what does that change in a nutshell?

Read the pitch here: https://github.com/ridgerchu/matmulfreellm

Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.

The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn't more efficient than "regular" LLMs.

Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

No one really knows, because they're so closed and opaque!

But it appears that their models perform relatively poorly for thier "size." Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.

load more comments (1 replies)
[–] [email protected] 38 points 1 day ago (4 children)

It could also contain non-public domain data, and you can't declare someone else's intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

[–] [email protected] 7 points 1 day ago

It wouldn't contain any public-domain data though. That's the thing with LLMs, once they're trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn't re-create your tax data on command, that data is now gone, but if it's seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

[–] [email protected] 4 points 1 day ago

Right, like I did. They're safeguarding Disney and other places like that now. It's just the little guys who get screwed.

https://imgur.com/a/these-are-new-niki-mice-drawings-phone-company-chainsaws-merms-donut-logos-burger-mc-winfruit-computers-republunch-political-party-logos-Rhgi0OC

[–] [email protected] 13 points 1 day ago (1 children)

So what you're saying is that there's no way to make it legal and it simply needs to be deleted entirely.

I agree.

[–] [email protected] 4 points 1 day ago (13 children)

There's no need to "make it legal", things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

Training an AI doesn't involve copying the training data, the AI model doesn't literally "contain" the stuff it's trained on. So it's not likely that existing copyright law makes it illegal to do without permission.

load more comments (13 replies)
[–] [email protected] 17 points 1 day ago (1 children)

Forcing a bunch of neural weights into the public domain doesn't make the data they were trained on also public domain, in fact it doesn't even reveal what they were trained on.

[–] [email protected] 127 points 1 day ago* (last edited 1 day ago) (6 children)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

[–] [email protected] 6 points 1 day ago* (last edited 1 day ago) (1 children)

If we can't train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

[–] [email protected] 0 points 1 day ago (4 children)

Unlicensed from the POV of the trainer, meaning they didn't contact or license content from someone who didn't approve. If it's posted under Creative Commons, that's fine. If it's otherwise posted that it's not open in any other way and not for corporate use, then they need to contact the owner and license it.

load more comments (4 replies)
[–] [email protected] 6 points 1 day ago

It's already illegal in some form. Via piracy of the works and regurgitating protected data.

The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

The US justice system is different for different people.

[–] [email protected] 26 points 1 day ago (1 children)

Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

[–] [email protected] -3 points 1 day ago (1 children)

Not really. The same way you can't sell live and public performance music for profit and not get sued. Case law right there, and the fact it's performance vs publicly published doesn't matter. How the owner and originator classifies or licenses it is the defining classification. It's going to be years before anyone sees this get a ruling in court though.

[–] [email protected] 11 points 1 day ago (14 children)

That's not what's going on here, though. The LLM model doesn't contain the actual copyrighted data, it's the result of analyzing the copyrighted data.

An analogous example would be a site like TV Tropes. TV Tropes doesn't contain the works that it's discussing, it just contains information about those works.

load more comments (14 replies)
[–] [email protected] 35 points 1 day ago (1 children)

They pulled a very pubic and out in the open data heist

Oh no, not the pubes! Get those curlies outta here!

[–] [email protected] 14 points 1 day ago

Best correction ever. Fixed. ♥️

[–] [email protected] 3 points 1 day ago (2 children)

But wouldn't that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?

Probably not "burden of proof in a court of law" prove though.

[–] [email protected] 2 points 1 day ago* (last edited 1 day ago) (1 children)

in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.

[–] [email protected] 4 points 1 day ago (1 children)

But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it's not direct evidence of originating content, it's derivative.

You'd have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying "Yes, we definitely trained on unlicensed data".

[–] [email protected] 3 points 1 day ago

so like I am not making any comment on anything but the legal system here. but it’s absolutely the case that you can win a lawsuit on purely circumstantial evidence if the defense is unable to produce a compelling alternative set of circumstances which can lead to the same outcome.

[–] [email protected] 8 points 1 day ago (1 children)

Making it open source doesn't change how it works. It doesn't need the data after it's been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.

[–] [email protected] 3 points 1 day ago (2 children)

So you're saying the data wouldn't exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?

[–] [email protected] 16 points 1 day ago (1 children)

That is how LLM works, they don't store the data as data, but as weight values.

[–] [email protected] 1 points 1 day ago (4 children)

So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.

load more comments (4 replies)
load more comments (1 replies)
load more comments (1 replies)
load more comments
view more: ‹ prev next ›