this post was submitted on 15 Oct 2024

493 points (96.6% liked)

Technology

69869 readers

2729 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

493

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities (arstechnica.com)

submitted 6 months ago by [email protected] to c/[email protected]

109 comments fedilink hide all child comments

(page 2) 50 comments

sorted by: hot top controversial new old

[–] [email protected] 56 points 6 months ago (11 children)

So do I every time I ask it a slightly complicated programming question

load more comments (11 replies)

[–] [email protected] 17 points 6 months ago* (last edited 6 months ago)

Here's the cycle we've gone through multiple times and are currently in:

AI winter (low research funding) -> incremental scientific advancement -> breakthrough for new capabilities from multiple incremental advancements to the scientific models over time building on each other (expert systems, LLMs, neutral networks, etc) -> engineering creates new tech products/frameworks/services based on new science -> hype for new tech creates sales and economic activity, research funding, subsidies etc -> (for LLMs we're here) people become familiar with new tech capabilities and limitations through use -> hype spending bubble bursts when overspend doesn't keep up with infinite money line goes up or new research breakthroughs -> AI winter -> etc...

[–] [email protected] 9 points 6 months ago

Someone needs to pull the plug on all of that stuff.

[–] [email protected] 90 points 6 months ago (4 children)

Did anyone believe they had the ability to reason?

[–] [email protected] 37 points 6 months ago

Yes

load more comments (2 replies)

[–] [email protected] 56 points 6 months ago (1 children)

The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding "seemingly relevant but ultimately inconsequential statements" to the questions

Good thing they're being trained on random posts and comments on the internet, which are known for being succinct and accurate.

[–] [email protected] 23 points 6 months ago (3 children)

Yeah, especially given that so many popular vegetables are members of the brassica genus

load more comments (3 replies)

[–] [email protected] 46 points 6 months ago (2 children)

statistical engine suggesting words that sound like they'd probably be correct is bad at reasoning

How can this be??

[–] [email protected] 19 points 6 months ago (2 children)

I would say that if anything, LLMs are showing cracks in our way of reasoning.

load more comments (2 replies)

[–] [email protected] 7 points 6 months ago (1 children)

Totally unexpectable!!!

[–] [email protected] 8 points 6 months ago* (last edited 6 months ago) (1 children)

antianticipatable!

load more comments (1 replies)

[–] [email protected] 15 points 6 months ago

They predict, not reason....

[–] [email protected] 20 points 6 months ago (1 children)

I hope this gets circulated enough to reduce the ridiculous amount of investment and energy waste that the ramping-up of "AI" services has brought. All the companies have just gone way too far off the deep end with this shit that most people don't even want.

[–] [email protected] 18 points 6 months ago (2 children)

People working with these technologies have known this for quite awhile. It's nice of Apple's researchers to formalize it, but nobody is really surprised-- Least of all the companies funnelling traincars of money into the LLM furnace.

load more comments (2 replies)

[–] [email protected] 27 points 6 months ago* (last edited 6 months ago) (1 children)

I feel like a draft landed on Tim's desk a few weeks ago, explains why they suddenly pulled back on OpenAI funding.

People on the removed superfund birdsite are already saying Apple is missing out on the next revolution.

[–] [email protected] 16 points 6 months ago (1 children)

"Superfund birdsite" I am shamelessly going to steal from you

[–] [email protected] 6 points 6 months ago

please, be my guest.

[–] [email protected] 1 points 6 months ago* (last edited 6 months ago) (1 children)

Are we not flawed too? Does that not makes AI...human?

[–] [email protected] 24 points 6 months ago (1 children)

How dare you imply that humans just make shit up when they don't know the truth

[–] [email protected] 6 points 6 months ago

Did I misremember something, or is my memory easily influenced by external stimuli? No, the Mandela Effect must be real!

[–] [email protected] -3 points 6 months ago (1 children)

Real headline: Apple research presents possible improvements in benchmarking LLMs.

[–] [email protected] 20 points 6 months ago (2 children)

Not even close. The paper is questioning LLMs ability to reason. The article talks about fundamental flaws of LLMs and how we might need different approaches to achieve reasoning. The benchmark is only used to prove the point. It is definitely not the headline.

[–] [email protected] -3 points 6 months ago* (last edited 6 months ago) (1 children)

You say “Not even close.” in response to the suggestion that Apple’s research can be used to improve benchmarks for AI performance, but then later say the article talks about how we might need different approaches to achieve reasoning.

Now, mind you - achieving reasoning can only happen if the model is accurate and works well. And to have a good model, you must have good benchmarks.

Not to belabor the point, but here’s what the article and study says:

The article talks at length about the reliance on a standardized set of questions - GSM8K, and how the questions themselves may have made their way into the training data. It notes that modifying the questions dynamically leads to decreases in performance of the tested models, even if the complexity of the problem to be solved has not gone up.

The third sentence of the paper (Abstract section) says this “While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics.” The rest of the abstract goes on to discuss (paraphrased in layman’s terms) that LLM’s are ‘studying for the test’ and not generally achieving real reasoning capabilities.

By presenting their methodology - dynamically changing the evaluation criteria to reduce data pollution and require models be capable of eliminating red herrings - the Apple researchers are offering a possible way benchmarking can be improved.
Which is what the person you replied to stated.

The commenter is fairly close, it seems.

load more comments (1 replies)

[–] [email protected] 0 points 6 months ago

Once there’s a benchmark, LLMs can optimise for it. This is just another piece of news where people call “game over” but the money poured into R&D isn’t stopping anytime soon. Wasn’t synthetic data supposed to be game over for LLMs? Its limitations have been identified and it’s still being leveraged.

load more comments