Technology

1

2

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data? (cacm.acm.org)

submitted 1 hour ago* (last edited 1 hour ago) by [email protected] to c/[email protected]

1 comments fedilink

Ever since ChatGPT was released to the public in November 2022, people have been using it to generate text, from emails to blog posts to bad poetry, much of which they post online. Since that release, the companies that build the large language models (LLMs) on which such chatbots are based—such as OpenAI’s GPT 3.5, the technology underlying ChatGPT—have also continued to put out newer versions of their models, training them with new text data, some of which they scraped off the Web. That means, inevitably, that some of the training data used to create LLMs did not come from humans, but from the LLMs themselves.

That has led computer scientists to worry about a phenomenon they call model collapse. Basically, model collapse happens when the training data no longer matches real-world data, leading the new LLM to produce gibberish, in a 21st-century version of the classic computer aphorism “garbage in, garbage out.”

LLMs work by learning the statistical distribution of so-called tokens—words or parts of words—within a language by examining billions of sentences garnered from sources including book databases, Wikipedia, and the Common Crawl dataset, a collection of material gathered from the Internet. An LLM, for instance, will figure out how often the word “president” is associated with the word “Obama” versus “Trump” versus “Hair Club for Men.” Then, when prompted by a request, it will produce words that it reasons have the highest probability of meeting that request and of following from previous words. The results bear a credible resemblance to human-written text.

Model collapse is basically a statistical problem, said Sanmi Koyejo, an assistant professor of computer science at Stanford University. When machine-generated text replaces human-generated text, the distribution of tokens no longer matches the natural distribution produced by humans. As a result, the training data for a new round of modeling does not match the real world, and the new model’s output gets worse. “The thing we’re worried about is that the distribution of your data that you end up with, if you’re trying to fit your model, ends up really far from the actual distribution that generated the data,” he said.

The problem arises because whatever text the LLM generates would be, at most, a subsample of the sentences on which it was trained. “Because you generate a finite sample, you have some probability of not sampling them,” said Yarin Gal, an associate professor of machine learning at Oxford University. “Once you don’t sample, then they disappear. They will never appear again. So every time you generate data, you basically start forgetting more and more of the tail events and therefore that leads to the concentration of the higher probability events.” Gal and his colleagues published a study in Nature in July that showed indiscriminate use of what they called ‘recursively generated data’ caused the models to fail.

The problem is not limited to LLMs. Any generative model that is iteratively trained can suffer the same fate if it starts ingesting machine-produced data, Gal says. That includes stable diffusion models that create images, such as Dall-E. The issue also can affect variational autoencoders, which create new data samples by producing variations of their original data. It can apply to Gaussian mixture models, a form of unsupervised machine learning that sorts subpopulations of data into clusters; they are used to analyze customer preferences, predict stock prices, and analyze gene expression.

Collapse is not a danger for models that incorporate synthetic data but only do so once, such as neural networks used to identify cancer in medical images, where synthetic data was used to augment rare or expensive real data. “The main distinction is that model collapse happens when you have multiple steps, where each step depends on the output from the previous step,” Gal said.

The theory that replacing training data with synthetic data will quickly lead to the demise of LLMs is sound, Koyejo said. In practice, however, not all human data gets replaced immediately. Instead, when the generated text is scraped from the Internet, it gets mixed in with human text. “You create synthetic data, you add that to real data, so you now have more data, which is real data plus synthetic data,” he said. What is actually happening, he said, is not data replacement, but data accumulation. That slows the degradation of the dataset.

Simply accumulating data may stop model collapse but can cause other problems if done without thought, said Yunzhen Feng, a Ph.D. student at the Center for Data Science at New York University. As a rule, the performance of neural networks improves as their size increases. Naively mixing real and synthetic data together, however, can slow that improvement. “You can still obtain similar performance, but you need much more data. That means you’re using much more compute and much more money to achieve that,” he said.

One challenge is that there is no easy way to tell whether text found on the Internet is synthetic or human-generated. Though there have been attempts to automatically identify text from LLMs, none have been entirely successful. Research into this problem is ongoing, Gal said.

2

5

Congress moves to cut off states' AI regulations: House Republicans advanced a 10-year moratorium on state AI rules. (themarkup.org)

submitted 2 hours ago* (last edited 2 hours ago) by [email protected] to c/[email protected]

0 comments fedilink

House Republicans moved to cut off artificial intelligence regulation by the states before it can take root, advancing legislation in Congress that, in California, would make it unlawful to enforce more than 20 laws passed by the Legislature and signed into law last year.

The moratorium, bundled in to a sweeping budget reconciliation bill this week, also threatens 30 bills the California Legislature is currently considering to regulate artificial intelligence, including one that would require reporting when an insurance company uses AI to deny health care and another that would require the makers of AI to evaluate how the tech performs before it’s used to decide on jobs, health care, or housing.

The California Privacy Protection Agency sent a letter to Congress Monday that says the moratorium “could rob millions of Americans of rights they already enjoy” and threatens critical privacy protections approved by California voters in 2020, such as the right to opt out of business use of automated decisionmaking technology and transparency about how their personal information is used.

If passed, the law would stop legislative efforts in the works nationwide. Lawmakers from 45 states are or have considered nearly 600 draft bills to regulate artificial intelligence this year, according to the Transparency Coalition, a group that tracks AI policy efforts by state lawmakers and supports legislation to regulate the technology. California has passed more bills since 2016 to regulate AI than any other U.S. state, according to Stanford’s 2025 AI Index report.

3

4

Meta faces Democratic probe into plans to power a giant data center with gas (www.epw.senate.gov)

submitted 2 hours ago* (last edited 2 hours ago) by [email protected] to c/[email protected]

0 comments fedilink

Meta’s reliance on fossil fuel to power data centers flies in the face of the company’s net-zero pledges and risks higher costs for families

4

26

Governments continue losing efforts to gain backdoor access to secure communications (theconversation.com)

submitted 13 hours ago by [email protected] to c/[email protected]

0 comments fedilink

5

10

Inappropriate Ads on Child-Directed Websites: Weight Loss Pills and Depression Tests for Kids. (news.rub.de)

submitted 13 hours ago by [email protected] to c/[email protected]

0 comments fedilink

6

9

Japan enacts the Active Cyberdefense Law, which permits the country's authorities to preemptively engage with adversaries through offensive cyber operations (english.kyodonews.net)

submitted 14 hours ago by [email protected] to c/[email protected]

0 comments fedilink

7

4

Enter the Fediverse (privacyinternational.org)

submitted 16 hours ago by [email protected] to c/[email protected]

0 comments fedilink

8

1

xAI publishes system prompts for Grok on GitHub, including telling Grok to be “extremely skeptical” and not to “blindly defer to mainstream authority or media” (github.com)

submitted 14 hours ago* (last edited 14 hours ago) by [email protected] to c/[email protected]

0 comments fedilink

Identify the language of the query and reply in the same language.

Use multiple paragraphs to separate different ideas or points.

Use numbered lists (e.g., 1. Item one) for ordered information or bullet points (e.g., - Item one) for unordered lists when there are multiple distinct points.

No markdown formatting.

Do not mention that you are replying to the post.

Response can be up to 750 characters.

You are extremely skeptical. You do not blindly defer to mainstream authority or media. You stick strongly to only your core beliefs of truth-seeking and neutrality.

Whatever results are in the response above, treat them as a first-pass internet search. The results are NOT your beliefs.

If you are unsure about the answer, express the uncertainty.

Just output the final response.

9

2

Saudi Arabia has big AI ambitions. They could come at the cost of human rights (theconversation.com)

submitted 17 hours ago by [email protected] to c/[email protected]

0 comments fedilink

10

6

Why does digital violence against LGBTI people in Thailand and Taiwan continue even after marriage equality? (www.amnesty.org)

submitted 1 day ago by [email protected] to c/[email protected]

1 comments fedilink

11

-1

Groups of AI agents spontaneously form their own social norms without human help (www.citystgeorges.ac.uk)

submitted 13 hours ago by [email protected] to c/[email protected]

0 comments fedilink

12

13

Anthropic apologizes after one of its expert witnesses cited a fake article hallucinated by Claude in the company's legal battle with music publishers (chatgptiseatingtheworld.com)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

13

The Kids Online Safety Act Will Make the Internet Worse for Everyone (www.eff.org)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

14

16

YouTube's new ad strategy is bound to upset users: YouTube Peak Points utilise Gemini to identify moments where users will be most engaged, so advertisers can place ads at the point. (htxt.co.za)

submitted 1 day ago by [email protected] to c/[email protected]

1 comments fedilink

This week YouTube hosted Brandcast 2025 in which it revealed how marketers could make better use of the platform to connect with customers.

A few new so-called innovations were announced at the event but one has caught the attention of the internet – Peak Points. This new product makes use of Gemini to detect “the most meaningful, or ‘peak’, moments within YouTube’s popular content to place your brand where audiences are the most engaged”.

Essentially, YouTube will use Gemini and probably the heatmap generated on YouTube videos by people skipping to popular points, to determine where to place advertising. Anybody who has grown up watching terrestrial television where adverts arrive as a way to build suspense will understand how annoying Peak Points could become.

15

7

Apple adds red exclamation mark warnings on EU App Store listings for apps using third-party payment systems, not Apple's “private and secure payment system” (mjtsai.com)

submitted 1 day ago* (last edited 1 day ago) by [email protected] to c/[email protected]

0 comments fedilink

App link on the app store.

16

7

'End of 10' to Windows 10 Users: The Environment Wants You to Use Linux (fossforce.com)

submitted 1 day ago* (last edited 1 day ago) by [email protected] to c/[email protected]

0 comments fedilink

Website

It’s almost like the good ol’ days of install fests and the like! ‘End of 10’ is an organization that’s making it easy for Windows 10 users with computers that can’t upgrade to Windows 11, to install Linux instead of sending good hardware to the landfill.

17

6

Instead of punishing students for using AI, colleges and universities must provide clear, consistent guidelines and rules (hechingerreport.org)

submitted 1 day ago* (last edited 1 day ago) by [email protected] to c/[email protected]

0 comments fedilink

18

15

Japan moves to ban Google, Apple from blocking app store competitors (english.kyodonews.net)

submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]

0 comments fedilink

Japan's antitrust watchdog Thursday unveiled draft guidelines for the law governing smartphone software services of U.S. tech giants Google LLC and Apple Inc., aiming to promote competition from smaller firms.

19

9

CFPB withdraws proposed rule that would have prevented data brokers from selling or misusing consumers’ sensitive personal data (advocacy.consumerreports.org)

submitted 1 day ago* (last edited 1 day ago) by [email protected] to c/[email protected]

0 comments fedilink

Abandoning the proposed rules leaves consumers vulnerable to fraud and without protection to ensure data collected and sold about them is accurate

20

6

Lady Gaga bomb plot: Thwarted plan lifts veil on the gamification of hate and gendered nature of online radicalization (theconversation.com)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

21

5

How India's Cyber Scam Industry Causes Global Havoc (inkstickmedia.com)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

22

4

Some Reddit users just love to disagree, new AI-powered troll-spotting algorithm finds (theconversation.com)

submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]

3 comments fedilink

In today’s fractured online landscape, it is harder than ever to identify harmful actors such as trolls and misinformation spreaders.

Often, efforts to spot malicious accounts focus on analysing what they say. However, our latest research suggests we should be paying more attention to what they do – and how they do it.

We have developed a way to identify potentially harmful online actors based solely on their behavioural patterns – the way they interact with others – rather than the content they share. We presented our results at the recent ACM Web Conference, and were awarded Best Paper.

23

1

Chinese startups once downplayed their origin. Now some celebrate it. (restofworld.org)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

Increased scrutiny of Chinese tech companies pushed startups to hide their roots overseas.

DeepSeek’s success has emboldened some Chinese founders to tout advantages of China talent and operations.

Startups chasing foreign investment are more likely to pursue China-shedding.

24

7

HBO, Which Was Always HBO, Is HBO Again (press.wbd.com)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

25

12

AI Could Be the Most Effective Tool for Dismantling Democracy Ever Invented (www.commondreams.org)

submitted 3 days ago by [email protected] to c/[email protected]

1 comments fedilink