Privacy

2488 readers

346 users here now

Welcome! This is a community for all those who are interested in protecting their privacy.

Rules

PS: Don't be a smartass and try to game the system, we'll know if you're breaking the rules when we see it!

Be civil and no prejudice
Don't promote big-tech software
No apathy and defeatism for privacy (i.e. "They already have my data, why bother?")
No reposting of news that was already posted
No crypto, blockchain, NFTs
No Xitter links (if absolutely necessary, use xcancel)

Related communities:

Some of these are only vaguely related, but great communities.

founded 6 months ago

MODERATORS

[email protected]

107

South Korea says DeepSeek transferred user data to China and the U.S. without consent (www.cnbc.com)

submitted 1 month ago by [email protected] to c/[email protected]

15 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 month ago (1 children)

Exactly.

But Ed said you could also use your own models, train it yourself.

[–] [email protected] 2 points 1 month ago* (last edited 1 month ago)

From my own fractured understanding, this is indeed true, but the "DeepSeek" everybody is excited about, which performs as well as OpenAI's best products but faster, is a prebuilt flagship model called R1. (Benchmarks here.)

The training data will never see the light of day. It would be an archive of every ebook under the sun, every scraped website, just copyright infringement as far as the eye can see. That would be the source they would have to release to be open source, and I doubt they would.

But DeepSeek does have the code for "distilling" other companies' more complex models into something smaller and faster (and a bit worse) - but, of course, the input models are themselves not open source, because those models (like Facebook's restrictive Llama model) were also trained on stolen data. (I've downloaded a couple of these distillations just to mess around with them. It feels like having a dumber, slower ChatGPT in a terminal.)

Theoretically, you could train a model using DeepSeek's open source code and ethically sourced input data, but that would be quite the task. Most people just add an extra layer of training data and call it a day. Here's one such example (I hate it.) I can't even imagine how much data you would have to create yourself in order to train one of these things from scratch. George RR Martin himself probably couldn't train an AI to speak in a comprehensible manner by feeding it his life's work.