overview for ocassionallyaduck

Judge dismisses authors' copyright lawsuit against Meta over AI training in c/[email protected]

[–] [email protected] 1 points 1 week ago* (last edited 1 week ago)

Ingesting all the artwork you ever created by obtaining it illegally and feeding it into my plagarism remix machine is theft of your work, because I did not pay for it.

Separately, keeping a copy of this work so I can do this repeatedly is also stealing your work.

The judge ruled the first was okay but the second was not because the first is "transformative", which sadly means to me that the judge despite best efforts does not understand how a weighted matrix of tokens works and that while they may have some prevention steps in place now, early models showed the tech for what it was as it regurgitated text with only minor differences in word choice here and there.

Current models have layers on top to try and prevent this user input, but escaping those safeguards is common, and it's also only masking the fact that the entire model is built off of the theft of other's work.

Judge dismisses authors' copyright lawsuit against Meta over AI training in c/[email protected]

[–] [email protected] 0 points 1 week ago (4 children)

There is nothing intelligent about "AI" as we call it. It parrots based on probability. If you remove the randomness value from the model, it parrots the same thing every time based on it's weights, and if the weights were trained on Harry Potter, it will consistently give you giant chunks of harry potter verbatim when prompted.

Most of the LLM services attempt to avoid this by adding arbitrary randomness values to churn the soup. But this is also inherently part of the cause of hallucinations, as the model cannot preserve a single correct response as always the right way to respond to a certain query.

LLMs are insanely "dumb", they're just lightspeed parrots. The fact that Meta and these other giant tech companies claim it's not theft because they sprinkle in some randomness is just obscuring the reality and the fact that their models are derivative of the work of organizations like the BBC and Wikipedia, while also dependent on the works of tens of thousands of authors to develop their corpus of language.

In short, there was a ethical way to train these models. But that would have been slower. And the court just basically gave them a pass on theft. Facebook would have been entirely in the clear had it not stored the books in a dataset, which in itself is insane.

I wish I knew when I was younger that stealing is wrong, unless you steal at scale. Then it's just clever business.

Judge dismisses authors' copyright lawsuit against Meta over AI training in c/[email protected]

[–] [email protected] 12 points 1 week ago (9 children)

Terrible judgement.

Turn the K value down on the model and it reproduces text near verbatim.

We have to solve the money problem! in c/[email protected]

[–] [email protected] 0 points 3 weeks ago (1 children)

Provided there is an "upper limit" on what scale we are talking, Ive often wondered, couldn't private users also host a sharded copy of a server instance to offset load and bandwidth? Like Folding@Home, but for site support.

I realize this isn't exactly feasible today for most infra, but if we're trying to "solve" the problem, imagine if you were able to voluntarily, give up like 100gb HDD space and have your PC host 2-3% of an instance's server load for a month or something. Or maybe just be a CDN node for the media and bandwidth heavy parts to ease server load, while the server code is on different machines.

This kind of distributed "load balancing" on private hardware may be a complete pipe dream today, but it think if might be the way federated services need to head. I can tell you if we could get it to be as simple as volunteers spinning up a docker, and dropping the generated wireguard key and their IP in a "federate" form to give the mini-node over to an instance, it would be a lot easier to support sites in this way.

Speaking for myself, I have enough bandwidth and space I could lend some compute and offset a small amount of traffic. But the full load of a popular instance would be more than my simple home setup is equipped for. If contributing hosting was as easy as contributing compute, it could have a chance to catch on.

I fucked up in c/[email protected]

[–] [email protected] 0 points 1 month ago

Bring nothing for them, admit nothing. Consult a lawyer for a few hundred to get advice.

Depending on the size of the company and severity of the information in question, you just need to say very noncommital and very specific phrases (lawyer speak):

I understand your concerns, however I think there has been a misunderstanding. How can I help resolve your concerns?

Again consult a lawyer. And tread the fine line between being contrite and admitting anything. Even in my advice I almost worded it wrong. See a lawyer. You admit nothing concrete because your admissions become fact, give them nothing without a lawyer, because even if you truly did give them the only PC and USB stick with it on it, acknowledge nothing directly (again this becomes admission in a legal sense). Whatever evidence they have right now, it is on them to prove intent unless you do it for them. And they don't want to spend on legal fees for no reason.

In the end you want them to have the feeling that this was truly a misunderstanding while giving no admissions to concrete actions like copying, moving, or replicating any data. They may have records, but that's a different matter.

It may be that a lawyer could just act as an intermediary to clarify, and ask what evidence they might need to feel assured in the safety of their IP. Note: most documents produced while at an employer are technically company IP in most cases, and you aren't alone wanting to reference your own code work, but many companies are paranoid about this for good reason.

Try not to lose your nerve. You just need to plant your feet, and keep in mind you have done nothing wrong morally. You have fucked up technically, but your mission now is to keep their trust in your somehow without digging your hole any deeper. So again, speak with a lawyer and get the right language.

The company does not want to waste money suing you. So don't hand them an obvious violation that have to pursue. Try to keep it together. Remember your intent. And convey that, as obliquely and in indirect statements as you can.

I don't want any issues with XXXX corp, and always planned to have a good reference from my time here. The files XXX Corp is concerned with today hold no value to me, only my personal photos and data do. If you could acknowledge there's been no wrongdoing I think we can easily resolve all this.

Avoid the temptation to let them set the terms too much. They will ask for devices and hardware. What you want is a document clearing you of any wrongdoing before you provide anything that could be used against you. Then handing things over might be okay. Again. Speak to a lawyer, even briefly.

Firefox introduces AI as experimental feature in c/[email protected]

[–] [email protected] 0 points 7 months ago (1 children)

Thing is, for your average user with no GPU and whp never thinks about RAM, running a local LLM is intimidating. But it shouldn't be. Any system with an integrated GPU, and the more RAM the better, can run simple models locally.

The not so dirty secret is that ChatGPT 3 vs 4 isn't that big a difference, and neither are leaps and bounds ahead of the publically available models for about 99% of tasks. For that 1% people will ooh and aah over it, but 99% of use cases are only seeing marginal gains on 4o.

And the simplified models that run "only" 95% as well? They can use 90% fewer resources give pretty much identical answers outside of hyperspecific use cases.

Running a a "smol" model as some are called, gets you all the bang for none of the buck, and your data stays on your system and never leaves.

I've been yelling from the rooftops to some stupid corporate types that once the model is trained, it's trained. Unless you are training models yourself, there is no need for the massive AI clusters, just for the model. Run it local on your hardware at a fraction of the cost.