this post was submitted on 07 Jul 2025

957 points (98.0% liked)

Technology

72702 readers

2582 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

957

AI agents wrong ~70% of time: Carnegie Mellon study (www.theregister.com)

submitted 4 days ago by [email protected] to c/[email protected]

284 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 5 points 4 days ago (2 children)

yes, that's generally useless. It should not be shoved down people's throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.

[–] [email protected] 3 points 3 days ago (3 children)

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don't get tired and they can be run in parallel.

[–] [email protected] 0 points 3 days ago (1 children)

What's 0.7^10?

[–] [email protected] 2 points 3 days ago (1 children)

About 0.02

[–] [email protected] 0 points 3 days ago (1 children)

So the chances of it being right ten times in a row are 2%.

[–] [email protected] 2 points 3 days ago* (last edited 3 days ago) (2 children)

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.

[–] [email protected] 2 points 3 days ago (1 children)

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.

[–] [email protected] 1 points 3 days ago (1 children)

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

[–] [email protected] 1 points 2 days ago (1 children)

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

[–] [email protected] 1 points 2 days ago (1 children)

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

[–] [email protected] 1 points 2 days ago* (last edited 2 days ago)

You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

[–] [email protected] 1 points 3 days ago

don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry >:(

[–] [email protected] 4 points 3 days ago (1 children)

The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.

[–] [email protected] 3 points 3 days ago

Very fair comment. In my experience even increasing the temperature you get stuck in local minimums

I was just trying to illustrate how 70% failure rates can still be useful.

[–] [email protected] 4 points 3 days ago (1 children)

I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It's a lot like machine translation. I speak fluent C++, but I don't speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

I also don't speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

Is this useful? When C++ is getting banned for "security concerns" and Rust is the required language, it's at least a little helpful.

[–] [email protected] 3 points 3 days ago (1 children)

I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.

[–] [email protected] 3 points 3 days ago (1 children)

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: " and a bit more than half of the time it is able to respond with a working fix.

[–] [email protected] 1 points 3 days ago (1 children)

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

[–] [email protected] 2 points 3 days ago

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

I agree. The agents also need to mature more to handle multi-level structures - work on a collection of smaller modules to get a larger system with more functionality. I can see the path forward for those tools, but the ones I have access to definitely aren't there yet.

[+] [email protected] -6 points 4 days ago* (last edited 4 days ago) (1 children)

Less broadly useful than 20 tons of mixed texture human shit, and more ecologically devastatimg.

[–] [email protected] 4 points 4 days ago (1 children)

Are you just trolling or do you seriously not understand how something which can do a task correctly with 30% reliability can be made useful if the result can be automatically verified.

[–] [email protected] -3 points 4 days ago* (last edited 4 days ago) (1 children)

Its not a magical 30%, factors apply. It's not even a mind that thinks and just isnt very good.

This isnt like a magical dice that gives you truth on a 5 or a 6, and lies on 1,2,3,7, and for.

This is a (very complicated very large) language or other data graph that programmatically identifies an average. 30% of the time-according to one potempkin-ass demonstration. Which means the more possible that is, the easier it is to either use a simpler cheaper tool that will give you a better more reliable answer much faster.

And 20 tons of human shit has uses! If you know its providence, there's all sorts of population level public health surveillance you can do to get ahead of disease trends! Its also got some good agricultural stuff in it-phosphorous and stuff, if you can extract it.

Stop. Just please fucking stop glazing these NERVE-ass fascist shit-goblins.

[–] [email protected] 4 points 4 days ago (1 children)

I think everyone in the universe is aware of how LLMs work by now, you don't need to explain it to someone just because they think LLMs are more useful than you do.

IDK what you mean by glazing but if by "glaze" you mean "understanding the potential threat of AI to society instead of hiding under a rock and pretending it's as useless as a plastic radio," then no, I won't stop.

[–] [email protected] -3 points 4 days ago* (last edited 3 days ago) (2 children)

It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.

Also, it's not AI.

Edit: and in a comment replying to this one, one of your fellow fanboys proved

everyone knows how they work

Wrong

[–] [email protected] 0 points 3 days ago

the industrial revolution could be seen as dangerous, yet it brought the highest standard of living increase in centuries

[–] [email protected] 3 points 4 days ago (1 children)

semantics.

[–] [email protected] -3 points 4 days ago (2 children)

No, it matters. Youre pushing the lie they want pushed.

[–] [email protected] 2 points 3 days ago (1 children)

Hitler liked to paint, doesn't make painting wrong. The fact that big tech is pushing AI isn't evidence against the utility of AI.

That common parlance is to call machine learning "AI" these days doesn't matter to me in the slightest. Do you have a definition of "intelligence"? Do you object when pathfinding is called AI? Or STRIPS? Or bots in a video game? Dare I say it, the main difference between those AIs and LLMs is their generality -- so why not just call it GAI at this point tbh. This is a question of semantics so it really doesn't matter to the deeper question. Doesn't matter if you call it AI or not, LLMs work the same way either way.

[–] [email protected] 0 points 3 days ago (1 children)

Semantics, of course, famously never matter.

[–] [email protected] 2 points 3 days ago

yeah.