this post was submitted on 06 May 2025
85 points (100.0% liked)

technology

23800 readers
185 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 4 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 47 points 4 weeks ago (14 children)

According to the article, it's a bigger problem for the "reasoning models" than for the older-style LLMs. Since those explicitly break problems down into multiple smaller steps, I wonder if that's creating higher hallucination rates because each step introduces the potential for errors/fabrications. Even a very small amount of "cognitive drift" might have a very large impact on the final answer if it compounds across multiple steps.

[–] [email protected] 21 points 4 weeks ago* (last edited 4 weeks ago) (10 children)

AI alchemists discovered that the statistics machine will be in a better ball park if you give it multiple examples and clarifications as part of your asks. This is called Chain of Thought prompting. Example:

Then the AI Alchemists said, hey we can automate this by having the model eat more of it's own shit. So a reasoning model will ask it self "What does the user want when they say < Your prompt>?" This will generate text that it adds to your query, to generate the final answer. All models with "chat memory" effectively eat their own shit, the tech works by reprocessing the whole chat history (sometimes there's a cache) every time you reply. Reasoning models because of the emulation of chain of thought eat more of their own shit than non-reasoning models do.

Some reasoning models are worse than others because some refeed the entire history of the reasoning, and others only refeed the current prompt's reasoning.

Essentially it's a form of compound error.

[–] [email protected] 3 points 4 weeks ago (1 children)

Information isnt attached to the users query, the CoT still happens in the output of the model like in the first example that you gave. This can be done without any finetuning on the policy, but reinforcement learning can also be used to encourage the chat output to break the problem down in to "logical" steps. chat models have always passed in the chat history back into the next input while appending the users turn, thats just how they work (I have no idea if o1 passes the CoT into the chat history though, so i cant comment). But it wouldnt solely account for the massive degradation of performance between o1 and o3/o4

[–] [email protected] 4 points 4 weeks ago* (last edited 4 weeks ago)

From my other comment about o1 and o3/o4 potential issues:

The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1’s reasoning is not user accessible, and it’s purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the input. On an o1 model that's happening once per turn only for output and then being accumulated.

load more comments (8 replies)
load more comments (11 replies)