this post was submitted on 18 Jul 2024
1 points (100.0% liked)

TechTakes

1432 readers
16 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 1 year ago
MODERATORS
 

we appear to be the first to write up the outrage coherently too. much thanks to the illustrious @self

(page 3) 50 comments
sorted by: hot top controversial new old
[–] [email protected] 0 points 4 months ago (2 children)

Mistral isn't trained on copy righted data. It's based off selective databases that were open use. This article in general is full of false information. But I suppose most people only read the headlines.

[–] [email protected] 0 points 4 months ago (2 children)

https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/8#6527a6fca6eaf92e6c26fa59

Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field.

The "open web" is full of copyrighted material.

[–] [email protected] 0 points 4 months ago* (last edited 4 months ago)

We had a social contract!

Mustafa Suleyman

[–] [email protected] 0 points 4 months ago

but it's apache2 sega! tooooootes freebies!

[–] [email protected] 0 points 4 months ago (1 children)
[–] [email protected] 0 points 4 months ago (1 children)
[–] [email protected] 0 points 4 months ago (1 children)

if you're not gonna read the fucken thing then fuck off.

[–] [email protected] 0 points 4 months ago (2 children)

I did read the thing, then provided an article explaining why detecting copyrighted material / determining if something is written by AI is very inaccurate.

Perhaps take your own advice to "read the fucken thing" next time instead of making yourself look like an idiot. Though I doubt you've ever heard of "better to stay silent and let them think you the fool than to speak and remove all doubt".

Btw, I even recall that Ars specifically covered the company you linked to in a separate article as well. I'd be glad to provide it once you've come to your senses and want to discuss things like an adult.

[–] [email protected] 0 points 4 months ago (2 children)

Mistral’s Mixtral-8x7B-Instruct-v0.1 produced copyrighted content on 22% of the prompts.

did you know that a lesser-known side effect of the infinite monkeys approach is that they will produce whole sections of copyright content abso-dupo-lutely by accident? wild, I know! totes coinkeedink!

I’d be glad to provide it once you’ve come to your senses and want to discuss things like an adult

jesus fucking christ you must be a fucking terrible person to work with

I've seen toddlers throw more mature tantrums

[–] [email protected] 0 points 4 months ago (2 children)

I'm too old to discuss against bad faith arguments.

Especially with people who won't read the information I provide them showing their initial information was wrong.

One is a company that has something to sell, the other an article with citations showing why it's not easy to determine what percentage of a data set is infringing on copyright, or whether exact reproduction via "fishing expedition" prompting is a useful metric to determine if unauthorized copyright was used in training.

The dumbest take though is attacking Mistral of all LLMs, even though it's on an Apache 2.0 license.

[–] [email protected] 0 points 4 months ago (1 children)
[–] [email protected] 0 points 4 months ago (2 children)

Well since you want to use computers to continue the discussion, here's also ChatGPT:

Determining the exact percentage of copyrighted data used to train a large language model (LLM) is challenging for several reasons:

  1. Scale and Variety of Data Sources: LLMs are typically trained on vast and diverse datasets collected from the internet, including books, articles, websites, and social media. This data encompasses both copyrighted and non-copyrighted content. The datasets are often so large and varied that it is difficult to precisely categorize each piece of data.

  2. Data Collection and Processing: During the data collection process, the primary focus is on acquiring large volumes of text rather than cataloging the copyright status of each individual piece. While some datasets, like Common Crawl, include metadata about the sources, they do not typically include detailed copyright status information.

  3. Transformation and Use: The data used for training is transformed into numerical representations and used to learn patterns, making it even harder to trace back and identify the copyright status of specific training examples.

  4. Legal and Ethical Considerations: The legal landscape regarding the use of copyrighted materials for AI training is still evolving. Many AI developers rely on fair use arguments, which complicates the assessment of what constitutes a copyright violation.

Efforts are being made within the industry to better understand and address these issues. For example, some organizations are working on creating more transparent and ethically sourced datasets. Projects like RedPajama aim to provide open datasets that include details about data sources, helping to identify and manage the use of copyrighted content more effectively【6†source】.

Overall, while it is theoretically possible to estimate the proportion of copyrighted content in a training dataset, in practice, it is a complex and resource-intensive task that is rarely undertaken with precision.

[–] [email protected] 0 points 4 months ago

you should speak to a physicist, they might be able to find a way your density can contribute to science

[–] [email protected] 0 points 4 months ago (1 children)

"exact percentage"

just fuck right off. wasting my fucken time.

[–] [email protected] 0 points 4 months ago (2 children)

You're the one who linked to an exact percentage, not me. Have a good day.

[–] [email protected] 0 points 4 months ago (2 children)

re-read your chatgpt response and think about whether the percentages in my original link could be too high or too low.

[–] [email protected] 0 points 4 months ago

but, like, really think this time. at this point i'm not arguing with you, i'm trying to help you.

load more comments (1 replies)
[–] [email protected] 0 points 4 months ago (1 children)

no, you utter fucking clown. they're literally posting to take the piss out of you, the only person in the room who isn't getting that everyone is laughing at them, not with them

[–] [email protected] 0 points 4 months ago

Whatever makes you feel better buddy

[–] [email protected] 0 points 4 months ago (2 children)

I've read the article you've posted:. it does not refute the fucking datapoint provided, it literally DOES NOT EVEN MENTION MISTRAL AT ALL.

so all I can tell you is to take your pearlclutching tantrum bullshit and please fuck off already

[–] [email protected] 0 points 4 months ago (8 children)

god these weird little fuckers’ ability to fill a thread with garbage is fucking notable isn’t it? something about loving LLMs makes you act like an LLM. how depressing for them.

load more comments (8 replies)
[–] [email protected] 0 points 4 months ago

Yes, clearly I'm the one throwing a tantrum 🙄

Btw, you can just fact check my claim about what Mistral is licenced under. The article talks about copyright and AI detection in general, which to anyone with basic critical thinking skills could then understand would apply to other LLMs like Mistral.

You might want to look up what pearl clutching means as well. You're using it wrong:

https://dictionary.cambridge.org/us/dictionary/english/pearl-clutching

Considering I've done the opposite of a shocked reaction. While at it, maybe also look up "projection"

https://www.psychologytoday.com/us/basics/projection

Anyhow, have a good day.

[–] [email protected] 0 points 4 months ago

she wrote harry potter with an llm, didn't she?

[–] [email protected] 0 points 4 months ago (1 children)

you're conflating "detecting ai text" with "detecting an ai trained on copyrighted material"

send the relevant article or shut up

[–] [email protected] 0 points 4 months ago (6 children)

Ignoring the logical inconsistency you just spouted for a moment (can't tell if it's written by AI but knows it used copyrighted material? Do you not hear yourself?), you do realize Mistral is released under the Apache 2.0 license, a highly permissive scheme that has no restrictions on use or reproduction beyond attribution, right?

I think it's clear you're arguing in bad faith however with no intention of changing your misinformed opinion at this point. Perhaps you'd enjoy an echo chamber like the "fuckai" Lemmy instance.

[–] [email protected] 0 points 4 months ago

holy shit you really are quite dumb. the fuck is wrong with you?

actually don’t answer that

[–] [email protected] 0 points 4 months ago (1 children)
[–] [email protected] 0 points 4 months ago

the reading comprehension of a llm and the contextual capacity of a gnat

load more comments (4 replies)
[–] [email protected] 0 points 4 months ago (25 children)

Great, just as I've decided to switch some services to Proton (mail and VPN).

Now I'll have to reconsider this decision.

load more comments (25 replies)
[–] [email protected] 0 points 4 months ago

the usefulness of any feature should be measured in how deep you can bury its "opt-in" option in the settings pages without hurting its adoption

[–] [email protected] 0 points 4 months ago* (last edited 4 months ago) (3 children)

"Pro privacy" company that cucked to the state to get a climate activist arrested (against their privacy policy that they sneakily change after the fact) are actually a bunch of typical corporate grifters that sell out their userbase to promote shitty llm garbage? Nawwwwwww. Say it ain't so! It's like every week or month after I argue about these shitty fake privacy companies with idiots in c/privacy I recieve massive vindication. Maybe this is my sign to become a man of faith.

[–] [email protected] 0 points 4 months ago (12 children)

What's your alternative to the fake privacy company? I'm assuming the correct thing would be: if your threat model does not include governments, self hosted email, or if it does include governments, probably don't use email.

[–] [email protected] 0 points 4 months ago (1 children)

Self hosted email is its own can of worms. I wouldn't recommend it to anyone outside of experienced IT people. You'll end up blacklisted before you send your first email if you do anything wrong (and there's a lot that can go wrong), and it doesn't solve any security problems email has.

Anything sent over email just isn't private. That goes for Proton customers when they send or receive anything from a non-Proton address too. The one thing privacy email providers can actually do is keep your inbox from being scanned by LLMs and advertisers. That doesn't prevent the inboxes and outboxes of your contacts from being scanned, though.

If you use email, the best thing you can do is be mindful of what kinds of information you send through it. Use aliases via services like simple login or anonaddy when possible. Having a leaked email is a security vulnerability. Once bad actors have your email, they now have half of what they need to breach multiple accounts.

[–] [email protected] 0 points 4 months ago* (last edited 4 months ago) (9 children)

have been that sysadmin setting up a company email server. postfix is trivial to set up, absolutely the easiest experience. following that, though, was weeks of supplicant emails to MS to beg them please not to block us. My recommendation was never do this again, use a third-party outgoing email vendor, email is lost.

load more comments (9 replies)
load more comments (11 replies)
[–] [email protected] 0 points 4 months ago* (last edited 4 months ago) (1 children)

I'm also sick of hearing about Swiss privacy laws. Their intelligence service got busted covering for a US and German spy front operation in Switzerland. If it happened once, I promise it has happened before and since.

Edit for those who can't click: a front company in Switzerland sold fake encrypted communications services around the world for years, possibly decades, with the assistance of Swiss intelligence agencies.

[–] [email protected] 0 points 4 months ago (1 children)

it's as if swiss privacy laws are only useful for big time, pre-crypto, old school money laundering

load more comments (1 replies)
[–] [email protected] 0 points 4 months ago (1 children)

I must have missed the climate activist getting arrested because of protonmail. Any link or a name to search from?

[–] [email protected] 0 points 4 months ago (3 children)
[–] [email protected] 0 points 4 months ago

as we said in the article, they can't ignore subpoenas

[–] [email protected] 0 points 4 months ago

Well that's disappointing as hell...

[–] [email protected] 0 points 4 months ago
[–] [email protected] 0 points 4 months ago

Oh for fucks sake. I don't use their email but I don't want to have to switch VPN service AGAIN.

load more comments
view more: ‹ prev next ›