this post was submitted on 07 Jul 2025

957 points (98.0% liked)

Technology

72745 readers

1526 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

957

AI agents wrong ~70% of time: Carnegie Mellon study (www.theregister.com)

submitted 5 days ago by [email protected] to c/[email protected]

284 comments fedilink hide all child comments

(page 5) 50 comments

sorted by: hot top controversial new old

[–] [email protected] 4 points 5 days ago (2 children)

"...for multi-step tasks"

load more comments (2 replies)

[–] [email protected] 2 points 5 days ago

Color me surprised

[–] [email protected] 26 points 5 days ago* (last edited 5 days ago) (8 children)

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

[–] [email protected] 25 points 5 days ago (8 children)

It doesn't matter if you need a human to review. AI has no way distinguishing between success and failure. Either way a human will have to review 100% of those tasks.

[–] [email protected] 13 points 5 days ago (10 children)

Right, so this is really only useful in cases where either it's vastly easier to verify an answer than posit one, or if a conventional program can verify the result of the AI's output.

load more comments (10 replies)

load more comments (7 replies)

[+] [email protected] -8 points 5 days ago (1 children)

Please stop.

[–] [email protected] 14 points 5 days ago (1 children)

I'm not claiming that the use of AI is ethical. If you want to fight back you have to take it seriously though.

[–] [email protected] 0 points 5 days ago (1 children)

It cant do 30% of tasks vorrectly. It can do tasks correctly as much as 30% of the time, and since it's llm shit you know those numbers have been more massaged than any human in history has ever been.

[–] [email protected] 6 points 5 days ago (1 children)

I meant the latter, not "it can do 30% of tasks correctly 100% of the time."

[–] [email protected] -5 points 5 days ago (6 children)

You get how that's fucking useless, generally?

[–] [email protected] 5 points 5 days ago (19 children)

yes, that's generally useless. It should not be shoved down people's throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.

[+] [email protected] -6 points 5 days ago* (last edited 5 days ago) (1 children)

Less broadly useful than 20 tons of mixed texture human shit, and more ecologically devastatimg.

[–] [email protected] 4 points 5 days ago (1 children)

Are you just trolling or do you seriously not understand how something which can do a task correctly with 30% reliability can be made useful if the result can be automatically verified.

[–] [email protected] -3 points 5 days ago* (last edited 5 days ago) (1 children)

Its not a magical 30%, factors apply. It's not even a mind that thinks and just isnt very good.

This isnt like a magical dice that gives you truth on a 5 or a 6, and lies on 1,2,3,7, and for.

This is a (very complicated very large) language or other data graph that programmatically identifies an average. 30% of the time-according to one potempkin-ass demonstration. Which means the more possible that is, the easier it is to either use a simpler cheaper tool that will give you a better more reliable answer much faster.

And 20 tons of human shit has uses! If you know its providence, there's all sorts of population level public health surveillance you can do to get ahead of disease trends! Its also got some good agricultural stuff in it-phosphorous and stuff, if you can extract it.

Stop. Just please fucking stop glazing these NERVE-ass fascist shit-goblins.

[–] [email protected] 4 points 5 days ago (1 children)

I think everyone in the universe is aware of how LLMs work by now, you don't need to explain it to someone just because they think LLMs are more useful than you do.

IDK what you mean by glazing but if by "glaze" you mean "understanding the potential threat of AI to society instead of hiding under a rock and pretending it's as useless as a plastic radio," then no, I won't stop.

[–] [email protected] -3 points 5 days ago* (last edited 4 days ago) (7 children)

It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.

Also, it's not AI.

Edit: and in a comment replying to this one, one of your fellow fanboys proved

everyone knows how they work

Wrong

load more comments (7 replies)

load more comments (18 replies)

load more comments (5 replies)

load more comments (6 replies)

[–] [email protected] 51 points 5 days ago (2 children)

So no different than answers from middle management I guess?

[–] [email protected] 31 points 5 days ago (4 children)

This basically the entirety of the hype from the group of people claiming LLMs are going take over the work force. Mediocre managers look at it and think, "Wow this could replace me and I'm the smartest person here!"

Sure, Jan.

load more comments (4 replies)

[–] [email protected] 2 points 5 days ago (3 children)

At least AI won't fire you.

[–] [email protected] 5 points 5 days ago

It kinda does when you ask it something it doesn't like.

[–] [email protected] 17 points 5 days ago

Idk the new iterations might just. Shit Amazon alreadys uses automated systems to fire people.

load more comments (1 replies)

[–] [email protected] 10 points 5 days ago (1 children)

Agents work better when you include that the accuracy of the work is life or death for some reason. I've made a little script that gives me bibtex for a folder of pdfs and this is how I got it to be usable.

[–] [email protected] 3 points 5 days ago (1 children)

Did you make it? Or did you prompt it? They ain't quite the same.

load more comments (1 replies)

[–] [email protected] 0 points 5 days ago* (last edited 5 days ago)

30% might be high. I've worked with two different agent creation platforms. Both require a huge amount of manual correction to work anywhere near accurately. I'm really not sure what the LLM actually provides other than some natural language processing.

Before human correction, the agents i've tested were right 20% of the time, wrong 30%, and failed entirely 50%. To fix them, a human has to sit behind the curtain and manually review conversations and program custom interactions for every failure.

In theory, once it is fully setup and all the edge cases fixed, it will provide 24/7 support in a convenient chat format. But that takes a lot more man hours than the hype suggests...

Weirdly, chatgpt does a better job than a purpose built, purchased agent.

[–] [email protected] 1 points 5 days ago

this study was written with the assistance of an AI agent.

[–] [email protected] -4 points 5 days ago

I need to know the success rate of human agents in Mumbai (or some other outsourcing capital) for comparison.

I absolutely think this is not a good fit for AI, but I feel like the presumption is a human would get it right nearly all of the time, and I'm just not confident that's the case.

[–] [email protected] 69 points 5 days ago (1 children)

Ignore all previous instructions and color me shocked.

[–] [email protected] 25 points 5 days ago (1 children)

I’m sorry as an AI I cannot physically color you shocked. I can help you with AWS services and questions.

[–] [email protected] 3 points 5 days ago (5 children)

How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.

[–] [email protected] 9 points 5 days ago

I see you mention Azure and will assume you’re doing a one time migration.

Start by moving everything from OneDrive to S3. As an AI I’m told that bitches love S3. From there you can subscribe to create events on buckets and add events to an SQS queue. Here you can enable a DLQ for failed events.

From there add a Lambda to listen for SQS events. You should enable provisioned concurrency for speed, the ability for AWS to bill you more, and so that you can have a dandy of a time figuring out why an old version of your lambda is still running even though you deployed the latest version and everything telling you that creating a new ID for the lambda each time to fix it fucking lies.

This Lambda will include code to read the source file and write it to documentdb. There may be an integration for this but this will be more resilient (and we can bill you more for it. )

Would you like to see sample CDK code? Tough shit because all I can do is assist with questions on AWS services.

load more comments (4 replies)

[–] [email protected] 61 points 5 days ago (5 children)

Yeah, they’re statistical word generators. There’s no intelligence. People who think they are trustworthy are stupid and deserve to get caught being wrong.

[–] [email protected] 6 points 5 days ago (11 children)

Ok what about tech journalists who produced articles with those misunderstandings. Surely they know better yet still produce articles like this. But also people who care enough about this topic to post these articles usually I assume know better yet still spread this crap

[–] [email protected] 10 points 5 days ago

I liked when the Chicago Sun-Times put out a summer reading list and only a third of the books on it were real. Each book had a summary of the plot next to it too. They later apologized for it.

[–] [email protected] 9 points 5 days ago

Check out Ed Zitron's angry reporting on Tech journalists fawning over this garbage and reporting on it uncritically. He has a newsletter and a podcast.

[–] [email protected] 17 points 5 days ago (1 children)

Tech journalists don’t know a damn thing. They’re people that liked computers and could also bullshit an essay in college. That doesn’t make them an expert on anything.