this post was submitted on 22 Jul 2024
1 points (100.0% liked)

Hacker News

2171 readers
1 users here now

A mirror of Hacker News' best submissions.

founded 1 year ago
MODERATORS
top 1 comments
sorted by: hot top controversial new old
[–] [email protected] 0 points 3 months ago

This is the best summary I could come up with:


For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, giving them ongoing access to their content.

Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Mr. Longpre said.

“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said.

“Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers.”


The original article contains 1,235 words, the summary contains 171 words. Saved 86%. I'm a bot and I'm open source!