this post was submitted on 28 Mar 2025

1 points (100.0% liked)

TechTakes

1842 readers

13 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]

Gemini 2.5 "reasoning", no real improvement on river crossings. (awful.systems)

submitted 1 month ago by [email protected] to c/[email protected]

33 comments fedilink hide all child comments

So I signed up for a free month of their crap because I wanted to test if it solves novel variants of the river crossing puzzle.

Like this one:

You have a duck, a carrot, and a potato. You want to transport them across the river using a boat that can take yourself and up to 2 other items. If the duck is left unsupervised, it will run away.

Unsurprisingly, it does not:

https://g.co/gemini/share/a79dc80c5c6c

https://g.co/gemini/share/59b024d0908b

The only 2 new things seem to be that old variants are no longer novel, and that it is no longer limited to producing incorrect solutions - now it can also incorrectly claim that the solution is impossible.

I think chain of thought / reasoning is a fundamentally dishonest technology. At the end of the day, just like older LLMs it requires that someone solved a similar problem (either online or perhaps in a problem solution pair they generated if they do that to augment the training data).

But it outputs quasi reasoning to pretend that it is actually solving the problem live.

top 33 comments

sorted by: hot top controversial new old

[–] [email protected] 0 points 1 month ago

Take the potato and the carrot, show the carrot to the duck and make it follow you, dude can swim.

One trip, solved.

[–] [email protected] 0 points 1 month ago

Sounds like a perfectly reasonable duck.

[–] [email protected] 0 points 1 month ago* (last edited 1 month ago) (4 children)

I just asked the same question to my preferred onprem model, Mistral-Small 3.1.

13s on a 7900xtx

justinledwards/mistral-small-3.1-Q6_K:latest

To successfully transport the duck, carrot, and potato across the river without losing any of them, you need to plan your trips carefully. Here’s a step-by-step strategy:

First Trip: Take the duck across the river first.

Items on this side: Carrot, Potato
Items on the other side: Duck

Return Trip: Go back alone to the original side.

Items on this side: Carrot, Potato (you)
Items on the other side: Duck

Second Trip: Take either the carrot or the potato across the river.

Let's say you take the carrot first for illustration.

Items on this side: Potato
Items on the other side: Duck, Carrot (you)

Return Trip: Go back with the duck to the original side.

Items on this side: Potato, Duck
Items on the other side: Carrot

Third Trip: Leave the duck behind and take the potato across the river.

Items on this side: Duck
Items on the other side: Carrot, Potato (you)

Return Trip: Go back alone to get the duck.

Items on this side: Duck (you)
Items on the other side: Carrot, Potato

Final Trip: Take the duck across the river. All items are now on the other side.

By following these steps, you ensure that the duck is never left unsupervised and all items make it safely across the river.

[–] [email protected] 0 points 1 month ago

While all the comments are correct, i.e. what the fuck is wrong with you, this is also wrong since it immediately leaves the duck unsupervised after the first trip. So really not sure what you're trying to prove here.

[–] [email protected] 0 points 1 month ago

@kata1yst @diz boiling the oceans for this

[–] [email protected] 0 points 1 month ago

What's your point, that your favourite LLM doesn't reason either?

[–] [email protected] 0 points 1 month ago

no-one fucking asked you to spam us with pointless llm trash

the fuck is wrong with you

[–] [email protected] 0 points 1 month ago (2 children)

I'm kinda tired, but this puzzle's shoved itself into my brain. The obvious solution I can see is, roughly speaking:

Take the duck and carrot across
Take the duck back
Take the duck and potato across

[–] [email protected] 0 points 1 month ago (1 children)

My two solutions:

Eat the carrot. Take the duck and potato across.
It's a row boat. Take the carrot and potato, supervise the duck as it swims behind you.

I'm not doing three river crossings, you can't make me.

[–] [email protected] 0 points 1 month ago

another solution:

take duck, carrot and potato at once. if boat is fine if you put duck and carrot in but will sink if you put in duck, carrot and potato then you're already on horrifyingly narrow engineering margins and probably shouldn't use it in the first place

[–] [email protected] 0 points 1 month ago* (last edited 1 month ago) (1 children)

Yeah, exactly. There's no trick to it at all, unlike the original puzzle.

I also tested OpenAI's offerings a few months back with similarly nonsensical results: https://awful.systems/post/1769506

All-vegetables no duck variant is solved correctly now, but I doubt it is due to improved reasoning as such, I think they may have augmented the training data with some variants of the river crossing. The river crossing is one of the top most known puzzles, and various people have been posting hilarious bot failures with variants of it. So it wouldn't be unexpected that their training data augmentation has river crossing variants.

Of course, there's very many ways in which the puzzle can be modified, and their augmentation would only cover obvious stuff like variation on what items can be left with what items or spots on the boat.

[–] [email protected] 0 points 1 month ago (5 children)

It's just overtrained on the puzzle such that it mostly ignores your prompt. Changing a few words out doesn't change that it recognises the puzzle. Try writing it out in ASCII or uploading an image with it written or some other weird way that it hasn't been specifically trained on and I bet it actually performs better.

[–] [email protected] 0 points 1 month ago

"it can't be that stupid, you must be prompting it wrong"

[–] [email protected] 0 points 1 month ago* (last edited 1 month ago)

Not really. Here's the chain-of-word-vomit that led to the answers:

https://pastebin.com/HQUExXkX

Note that in "its impossible" answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can't actually choose the correct answer (a problem shared with the monkeys on typewriters).

I'd say it clearly isn't ignoring the prompt or differences from the original river crossings. It just can't actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.

[–] [email protected] 0 points 1 month ago (2 children)

oh look it's a loadbearing "just" in the wild. better hope you can shore that fucker up with some facts

Try writing it out in ASCII

my poster in christ, what in the fuck are you on about. stop prompting LLMs and go learn some things instead

some other weird way that it hasn’t been specifically trained on and I bet it actually performs better

"no no see, you just need to prompt it different. just prompt it different bro it'll work bro I swear bro"

god, every fucking time

[–] [email protected] 0 points 1 month ago (1 children)

Well has anyone tried prompting it in EBCDIC? How do we know doing so won't immediately create the super intelligence that "or whatever"s us to silicon Valhalla? Asking for a friend.

[–] [email protected] 0 points 1 month ago

you know, I was briefly considering trying to, and I figured you'd probably have to be forcing it by content escaping tricks or something (at least I presume their APIs will do basic type-checking...)

got other yaks to do atm tho

[–] [email protected] 0 points 1 month ago (1 children)

All along my mistake was that I was prompting it in unicode instead of latin1, alphameric BCD, or "modified UTF-8".

[–] [email protected] 0 points 1 month ago (1 children)

I thought everyone knew that you had to structure prompts in ALGOL 420 to get the best performance by going close to the metal

[–] [email protected] 0 points 1 month ago (1 children)

I use UTF-9 to efficiently handle Unicode on my PDP-10.

[–] [email protected] 0 points 1 month ago

@bitofhope @techtakes Surely you need a PDP-9 for that?

[–] [email protected] 0 points 1 month ago (1 children)

write it out in ASCII

My dude what do you think ASCII is? Assuming we're using standard internet interfaces here and the request is coming in as UTF-8 encoded English text it is being written out in ASCII

Sneers aside, given that the supposed capability here is examining a text prompt and reason through the relevant information to provide a solution in the form of a text response this kind of test is, if anything, rigged in favor of the AI compared to some similar versions that add in more steps to the task like OCR or other forms of image parsing.

It also speaks to a difference in how AI pattern recognition compared to the human version. For a sufficiently well-known pattern like the form of this river-crossing puzzle it's the changes and exceptions that jump out. This feels almost like giving someone a picture of the Mona Lisa with aviators on; the model recognizes that it's 99% of the Mona Lisa and goes from there, rather than recognizing that the changes from that base case are significant and intentional variation rather than either a totally new thing or a 'corrupted' version of the original.

[–] [email protected] 0 points 1 month ago (2 children)

Exactly. It's overtrained on the test, ignoring the differences. If you instead used something it recognises but doesn't recognise as the test pattern (having the same tokens/embeddings) it will perform better. I'm not joking, it's a common tactic to get around censoring. You're just going around the issue. What I'm saying is they've trained the model so much on benchmarks that it is indeed dumber.

[–] [email protected] 0 points 1 month ago (1 children)

I don't think that the actual performance here is as important as the fact that it's clearly not meaningfully "reasoning" at all. This isn't a failure mode that happens if it's actually thinking through the problem in front of it and understanding the request. It's a failure mode that comes from pattern matching without actual reasoning.

[–] [email protected] 0 points 1 month ago* (last edited 1 month ago) (1 children)

It’s a failure mode that comes from pattern matching without actual reasoning.

Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.

Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.

In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.

Here's the full logs:

https://pastebin.com/HQUExXkX

Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.

[–] [email protected] 0 points 1 month ago (1 children)

That's fascinating, actually. Like, it seems like it shouldn't be possible to create this level of grammatically correct text without understanding the words you're using, and yet even immediately after defining "unsupervised" correctly the system still (supposedly) immediately sets about applying a baffling number of alternative constraints that it seems to pull out of nowhere.

OR alternatively despite letting it "cook" for longer and pregenerate a significant volume of its own additional context before the final answer the system is still, at the end of the day, an assembly of sochastic parrots who don't actually understand anything.

[–] [email protected] 0 points 1 month ago* (last edited 1 month ago)

Yeah it really is fascinating. It follows some sort of recipe to try to solve the problem, like it's trained to work a bit like an automatic algebra system.

I think they had employed a lot of people to write generators of variants of select common logical puzzles, e.g. river crossings with varying boat capacities and constraints, generating both the puzzle and the corresponding step by step solution with "reasoning" and re-printing of the state of the items on every step and all that.

It seems to me that their thinking is that successive parroting can amount to reasoning, if its parroting well enough. I don't think it can. They have this one-path approach, where it just tries doing steps and representing state, just always trying the same thing.

What they need for this problem is to take a different kind of step, reduction (the duck can not be left unsupervised -> the duck must be taken with me on every trip -> rewrite a problem without the duck and with 1 less boat capacity -> solve -> rewrite the solution with "take the duck with you" on every trip).

But if they add this, then there's two possible paths it can take on every step, and this thing is far too slow to brute force the right one. They may get it to solve my duck variant, but at the expense of making it fail a lot of other variants.

The other problem is that even seemingly most elementary reasoning involves very many applications of basic axioms. This is what doomed symbol manipulation "AI" in the past and this is what is dooming it now.

[–] [email protected] 0 points 1 month ago (1 children)

The machine I love can't be dumb, I love the machine and I can't love what is dumb.

[–] [email protected] 0 points 1 month ago

another classic induncetive reasoning completed successfully!

[–] [email protected] 0 points 1 month ago (1 children)

I bet it generates stochastic nonsense you’ll read like tea leaves

[–] [email protected] 0 points 1 month ago (2 children)

Bet

[–] [email protected] 0 points 1 month ago

Butt

( Y )

[–] [email protected] 0 points 1 month ago

The accumulated filth of all their slop and murder will foam up about their waists and all the whores and prompt enjoyers will look up and shout: 'Bet' - and I'll whisper 'no.'