TechTakes

1996 readers

81 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]

Apple: ‘Reasoning’ AIs fail hard if they actually have to think (pivot-to-ai.com)

submitted 2 weeks ago by [email protected] to c/[email protected]

7 comments fedilink hide all child comments

Video version

top 7 comments

sorted by: hot top controversial new old

[–] [email protected] 0 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that's public improves, vs the one that's held back.

[–] [email protected] 0 points 2 weeks ago (1 children)

Latter test fails if they write a specific bit of code to put out the 'llms fail the river crossing' fire btw. Still a good test.

[–] [email protected] 0 points 2 weeks ago

It would have to be more than just river crossings, yeah.

Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.

[–] [email protected] 0 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

The promptfondlers on places like /r/singularity are trying so hard to spin this paper. "It's still doing reasoning, it just somehow mysteriously fails when you it's reasoning gets too long!" or "LRMs improved with an intermediate number of reasoning tokens" or some other excuse. They are missing the point that short and medium length "reasoning" traces are potentially the result of pattern memorization. If the LLMs are actually reasoning and aren't just pattern memorizing, then extending the number of reasoning tokens proportionately with the task length should let the LLMs maintain performance on the tasks instead of catastrophically failing. Because this isn't the case, apple's paper is evidence for what big names like Gary Marcus, Yann Lecun, and many pundits and analysts have been repeatedly saying: LLMs achieve their results through memorization, not generalization, especially not out-of-distribution generalization.

[–] [email protected] 1 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

prompfondlers

Holy shit, I love it.

[–] [email protected] 0 points 2 weeks ago (1 children)

I still prefer promptards

[–] [email protected] 0 points 2 weeks ago

i prefer that you take your ableist vocabulary somewhere else, preferably stick it up your arse.