Privacy

31935 readers

629 users here now

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
Don't promote proprietary software
Try to keep things on topic
If you have a question, please try searching for previous discussions, maybe it has already been answered
Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
Be nice :)

Related communities

Chat rooms

[Matrix/Element]Dead
Discord

much thanks to @gary_host_laptop for the logo design :)

founded 5 years ago

MODERATORS

[email protected]

340

Microsoft CEO of AI: Online content is 'freeware' for models • The Register (www.theregister.com)

submitted 4 months ago by [email protected] to c/[email protected]

57 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 24 points 4 months ago (1 children)

I'm fine with that, but let's put some rules against this.

Any AI models should be able to determine the source of their data to a defined level of accuracy.
There should be a well-defined way to block data from being used by AI. If one of these ways (e.g. robots.txt) has been breached, the model has to be rebuilt without the data, and reparations made to the content owners.

[–] [email protected] 4 points 4 months ago (3 children)

What you're asking for is literally impossible.

A neural network is basically nothing more than a set of weights. If one word makes a weight go up by 0.0001 and then another word makes it go down by 0.0001, and you do that billions of times for billions of weights, how do you determine what in the data created those weights? Every single thing that's in the training data had some kind of effect on everything else.

It's like combining billions of buckets of water together in a pool and then taking out 1 cup from that and trying to figure out which buckets contributed to that cup. It doesn't make any sense.

[–] [email protected] 0 points 4 months ago

Sounds like homeopathy lol

[–] [email protected] 3 points 4 months ago

It’s not impossible lol. All a company would need to do is keep track of where they were getting content. If I use a script to download as much of the internet as possible and end up with a bunch of copyrighted content I could still get in trouble, hell there was even a guy arrested for downloading jstor without authorization.. Stop letting these guys get away with crimes just because you like the idea of the end product

[–] [email protected] 11 points 4 months ago (1 children)

Respectfully, I worked for Alexa AI on compositional ML, and we were largely able to do exactly this with customer utterances, so to say it is impossible is simply not true. Many companies have to have some degree of ability to remove troublesome data, and while tracing data inside a model is rather difficult (historically it would be done during the building of datasets or measured at evaluation time) it's definitely something that most big tech companies will do.

[–] [email protected] 2 points 4 months ago

Sorry, I misinterpreted what you meant. You said "any AI models" so I thought you were talking about the model itself should somehow know where the data came from. Obviously the companies training the models can catalog their data sources.

But besides that, if you work on AI you should know better than anyone that removing training data is counter to the goal of fixing overfitting. You need more data to make the model more generalized. All you'd be doing is making it more likely to reproduce existing material because it has less to work off of. That's worse for everyone.