My observation
Humans think about different things and concepts for different periods of time. Saying "and" takes less effort to think of than "telephone", as that is more context sensetive.
Example
User: What color does an apple have?
LLM: Apples are red.
Here, the inference time it takes to generate the word "Apple" and "are" is exactly the same time as it takes it to generate "red", which should be the most difficult word to come up with. It should require the most amount of compute.
Or let's think about this the other way around. The model thought just as hard about the word "red", as it did the way less important words "are" and "Apples".
My idea
We add maybe about 1000 new tokens to an LLM which are not word tokens, but thought tokens
or reasoning tokens
. Then we train the AI as usual. Every time it generates one of these reasoning tokens, we don't interpret it as a word and simply let it generate those tokens. This way, the AI would kinda be able to "think" before saying a word. This thought is not human-interpretable, but it is much more efficient than the pre-output reasoning tokens of o1, which uses human language to fill its own context window with.
Chances
- My hope for this is to make the AI able to think about what to say next like a human would. It is reasonable to assuma that at first in training, it doesn't use the reasoning tokens all that much, but later on, when it has to solve more difficult things in training, it will very likely use these reasoning tokens to improve its chances of succeeding.
- This could drastically lower the amount of parameters we need to get better output of models, as less thought-heavy tasks like smalltalk or very commonly used sentence structures could be generated quickly, while more complex topics are allowed to take longer. It would also make better LLMs more accessible to people running models at home, as not the parameters, but the inference time is scaled.
- It would train itself to provide useful reasoning tokens. Compared to how o1 does it, this is a much more token-friendly approach, as we allow for non-human-text generation, which the LLM is probably going to enjoy a lot, as it fills up its context less.
- This approach might also lead to more concise answers, as now it doesn't need to use CoT (chain of thought) to come to good conclusions.
Pitfalls and potential risks
- Training an AI using some blackboxed reasoning tokens can be considered a bad idea, as it's thought proccess is literally uninterpretable.
- We would have to constrain the amount of reasoning tokens, so that it doesn't take too long for a single normal word-token output. This is a thing with other text-only LLMs too, they tend to like to generate long blocks of texts for simple questions.
- We are hoping that during training, the model will use these reasoning tokens in its response, even though we as humans can't even read them. This may lead to the model completely these tokens, as they don't seem to lead to a better output. Later on in training however, I do expect the model to use more of these tokens, as it realizes how useful it can be to have thoughts.
What do you think?
I like this approach, because it might be able to achieve o1-like performace without the long wait before the output. While an o1-like approach is probably better for coding tasks, where planning is very important, in other tasks this way of generating reasoning tokens while writing the answer might be better.
How about adding a mechanism for storing the raw, embedding-dimensional vectors as a part of the sequence instead of introducing a set of additional discrete "invisible" tokens? So basically something like checking the final element of each vector in the sequence before the final linear layer and if the element is larger than, say, 0, giving the vector as-is as the output instead of passing through the de-embedding process. Then, when generating the next token, one could just interleave the thought vectors between the embedded "real" tokens after the embedding. This would allow the "thoughts" of the LLM to be continuous and thus more nuanced - a transformer doesn't need the sequence to be discrete, that's something imposed on LLMs by the nature of natural language. Could be an advatage over traditional CoT!
One other reason as to why something like this might beat o1's thought document (at least for some tasks) is the way the attention mechanism works: it's much more natural to attend to nearby tokens than to far away ones.
Training thought tokens like this is pretty simple in principle: one could construct a loss for them based on whether they increase the odds of producing the correct token next. Probably should pair that with some minimum increase threshold (below which we actually penalize for thought token generation) and an increasing penalty for outputting multiple thought tokens in a row (in addition to the hard constraint suggested in the OP). The training does pose one major challenge, though: it would need to be done autoregressively instead of pushing the whole sequence through at once, as we don't have ground truth for these thought tokens. So this would slow things down quite a bit!