this post was submitted on 02 Jul 2025
20 points (100.0% liked)

TechTakes

2025 readers
106 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago
MODERATORS
top 9 comments
sorted by: hot top controversial new old
[–] [email protected] 5 points 7 hours ago* (last edited 7 hours ago)

When they tested on bugs not in SWE-Bench, the success rate dropped to 57‑71% on random items, and 50‑68% on fresh issues created after the benchmark snapshot. I’m surprised they did that well.

After the benchmark snapshot. Could still be before LLM training data cut off, or available via RAG.

edit: For a fair test you have to use git issues that had not been resolved yet by a human.

This is how these fuckers talk, all of the time. Also see Sam Altman's not-quite-denials of training on Scarlett Johansson's voice: they just asserted that they had hired a voice actor, but didn't deny training on actual Scarlett Johansson's voice.

[–] [email protected] 11 points 1 day ago

Artificial intelligence and cheating/lying: two great tastes that go together