this post was submitted on 17 May 2025
4 points (100.0% liked)

Technology

60 readers
67 users here now

Share interesting Technology news and links.

Rules:

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Encouraged:

founded 5 days ago
MODERATORS
 

Ever since ChatGPT was released to the public in November 2022, people have been using it to generate text, from emails to blog posts to bad poetry, much of which they post online. Since that release, the companies that build the large language models (LLMs) on which such chatbots are based—such as OpenAI’s GPT 3.5, the technology underlying ChatGPT—have also continued to put out newer versions of their models, training them with new text data, some of which they scraped off the Web. That means, inevitably, that some of the training data used to create LLMs did not come from humans, but from the LLMs themselves.

That has led computer scientists to worry about a phenomenon they call model collapse. Basically, model collapse happens when the training data no longer matches real-world data, leading the new LLM to produce gibberish, in a 21st-century version of the classic computer aphorism “garbage in, garbage out.”

LLMs work by learning the statistical distribution of so-called tokens—words or parts of words—within a language by examining billions of sentences garnered from sources including book databases, Wikipedia, and the Common Crawl dataset, a collection of material gathered from the Internet. An LLM, for instance, will figure out how often the word “president” is associated with the word “Obama” versus “Trump” versus “Hair Club for Men.” Then, when prompted by a request, it will produce words that it reasons have the highest probability of meeting that request and of following from previous words. The results bear a credible resemblance to human-written text.

Model collapse is basically a statistical problem, said Sanmi Koyejo, an assistant professor of computer science at Stanford University. When machine-generated text replaces human-generated text, the distribution of tokens no longer matches the natural distribution produced by humans. As a result, the training data for a new round of modeling does not match the real world, and the new model’s output gets worse. “The thing we’re worried about is that the distribution of your data that you end up with, if you’re trying to fit your model, ends up really far from the actual distribution that generated the data,” he said.

The problem arises because whatever text the LLM generates would be, at most, a subsample of the sentences on which it was trained. “Because you generate a finite sample, you have some probability of not sampling them,” said Yarin Gal, an associate professor of machine learning at Oxford University. “Once you don’t sample, then they disappear. They will never appear again. So every time you generate data, you basically start forgetting more and more of the tail events and therefore that leads to the concentration of the higher probability events.” Gal and his colleagues published a study in Nature in July that showed indiscriminate use of what they called ‘recursively generated data’ caused the models to fail.

The problem is not limited to LLMs. Any generative model that is iteratively trained can suffer the same fate if it starts ingesting machine-produced data, Gal says. That includes stable diffusion models that create images, such as Dall-E. The issue also can affect variational autoencoders, which create new data samples by producing variations of their original data. It can apply to Gaussian mixture models, a form of unsupervised machine learning that sorts subpopulations of data into clusters; they are used to analyze customer preferences, predict stock prices, and analyze gene expression.

Collapse is not a danger for models that incorporate synthetic data but only do so once, such as neural networks used to identify cancer in medical images, where synthetic data was used to augment rare or expensive real data. “The main distinction is that model collapse happens when you have multiple steps, where each step depends on the output from the previous step,” Gal said.

The theory that replacing training data with synthetic data will quickly lead to the demise of LLMs is sound, Koyejo said. In practice, however, not all human data gets replaced immediately. Instead, when the generated text is scraped from the Internet, it gets mixed in with human text. “You create synthetic data, you add that to real data, so you now have more data, which is real data plus synthetic data,” he said. What is actually happening, he said, is not data replacement, but data accumulation. That slows the degradation of the dataset.

Simply accumulating data may stop model collapse but can cause other problems if done without thought, said Yunzhen Feng, a Ph.D. student at the Center for Data Science at New York University. As a rule, the performance of neural networks improves as their size increases. Naively mixing real and synthetic data together, however, can slow that improvement. “You can still obtain similar performance, but you need much more data. That means you’re using much more compute and much more money to achieve that,” he said.

One challenge is that there is no easy way to tell whether text found on the Internet is synthetic or human-generated. Though there have been attempts to automatically identify text from LLMs, none have been entirely successful. Research into this problem is ongoing, Gal said.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 8 hours ago

Self-Enshitification. :)