this post was submitted on 15 Apr 2024

58 points (74.6% liked)

Technology

34883 readers

49 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago

MODERATORS

[email protected]

GPT-4 performance comparable with physicians on official medical board residency examinations. Model performance near or above official passing rate in all medical specialties tested (ai.nejm.org)

submitted 7 months ago by [email protected] to c/[email protected]

52 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 19 points 7 months ago* (last edited 7 months ago)

All these always do the same thing.

Researchers reduced [the task] to producing a plausible corpus of text, and then published the not-so-shocking results that the thing that is good at generating plausible text did a good job generating plausible text.

From the OP , buried deep in the methodology :

Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.

Yet here's their conclusion :

The advancement from GPT-3.5 to GPT-4 marks a critical milestone in which LLMs achieved physician-level performance. These findings underscore the potential maturity of LLM technology, urging the medical community to explore its widespread applications.

It's literally always the same. They reduce a task such that chatgpt can do it then report that it can do to in the headline, with the caveats buried way later in the text.

[–] [email protected] 4 points 7 months ago (2 children)

This research has been done a lot of a times but I don't see the point of it. Exams are something I would expect LLMs, especially the higher end ones, to do well because of their nature. But it says next to nothing about how reliable the LLM as an actual doctor.

[–] [email protected] 0 points 7 months ago (1 children)

But it says next to nothing about how reliable the LLM as an actual doctor.

Yet these tests say anything about how a human would be as an actual doctor?

[–] [email protected] 1 points 7 months ago (1 children)

It says as much as it does for an LLM but doctors have to have a lot of field experience after passing these tests before they get certified as doctors.

[–] [email protected] 0 points 7 months ago (1 children)

Then we should remove such tests and, if anything, increase such field experience

[–] [email protected] 1 points 7 months ago (1 children)

Why?

[–] [email protected] 0 points 7 months ago (2 children)

Because clearly passing such tests doesn't matter. If it did matter then it would be noteworthy and have implications for the labor value of doctors that gpt could pass the tests to a better extent than many of them

[–] [email protected] 1 points 7 months ago (1 children)

Tests are meant to gatekeep who gets to get the field training required to become a doctor. Sending every jabroni into residency willy-nilly is probably gonna collapse the healthcare system completely.

[–] [email protected] 1 points 7 months ago

That wouldn't collapse the health care system, it would devalue the salaries of doctors which would be good for everyone else as it would lower costs. Which is what has happened to practically every other profession

[–] [email protected] 1 points 7 months ago

If you can't read a licence plate at 20 metres you can't safely drive. Being able to read a licence plate at 20 metres does not make you a safe driver. The test still matters.

[–] [email protected] 2 points 7 months ago* (last edited 7 months ago)

Even those who do well in testing of wrote knowledge can perform poorly in practical exercises. That’s why medical doctors have to train and qualify through several years of supervised residency before being allowed to practice even basic medicine.

GPT-4 can’t do even that.

[–] [email protected] 71 points 7 months ago* (last edited 7 months ago) (2 children)

It's just a multiple choice test with question prompts. This is the exact sort of thing an LLM should be very good at. This isn't chat gpt trying to do the job of an actual doctor, it would be quite abysmal at that. And even this multiple choice test had to be stacked in favor of chat gpt.

Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.

Don't get me wrong though, I think there's some interesting ways AI can provide some useful assistive tools in medicine, especially tasks involving integrating large amounts of data. I think the authors use some misleading language though, saying things like AI "are performing at the standard we require from physicians," which would only be true if the job of a physician was filling out multiple choice tests.

[–] [email protected] 11 points 7 months ago

I, too, can pass the Boards if you remove all the questions I don't understand.

[–] [email protected] 8 points 7 months ago

I’d be fine with LLMs being a supplementary aid for medical professionals, but not with them doing the whole thing.

[–] [email protected] 2 points 7 months ago

The 17th percentile in peds is not surprising. The model mixing it's training data with adults would absolutely kill someone.

[–] [email protected] 10 points 7 months ago (1 children)

Neat but I don't think LLMs are the way to go for these sort of things

[–] [email protected] 4 points 7 months ago (1 children)

I don’t mind so long as all results are vetted by someone qualified. Zero tolerance for unfiltered AI in this kind of context.

[–] [email protected] 3 points 7 months ago (3 children)

If you need someone qualified to examine the case anyway, what's the point of the AI?

[+] [email protected] 1 points 7 months ago* (last edited 6 months ago) (1 children)

[deleted]

[–] [email protected] 1 points 7 months ago (1 children)

In the example you provided, you're doing it by hand afterwards anyway. How is a doctor going to vet the work of the AI without examining the case in as much detail as they would have without the AI?

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago)

Input symptoms and patient info -> spits out odds they have x, y, or z -> doctor looks at that as a supplement to their own work or to look for more unlikely possibilities they haven't thought of because they're a bit unusual. Doctors aren't gods, they can't recall everything perfectly. It's as useful as any toxicology report or other information they get.

I am not doing my edits by hand. I am not using a blade tool and spooling film. I am not processing it. My computer does everything for me, I simply tell it what to do and it spits out the desired result (usually lol). Without my eyes and knowledge the inputs aren't good and the outputs aren't vetted. With a person, both are satisfied. This is how all computer usage basically works, and AI tools are no different. Input->output, quality depends on the computer/software and who is handling it.

TL;DR: Garbage in, garbage out.

[–] [email protected] 1 points 7 months ago (1 children)

Why do skilled professionals have less-skilled assistants?

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago)

Usually to do work that needs done but does not need the direct attention of the more skilled person. The assistant can do that work by themselves most of the time. In the example above, the assistant is doing all of the most challenging work and then the doctor is checking all of its work

[–] [email protected] 6 points 7 months ago (1 children)

The ai can examine hundreds of thousands of data points in ways that a human can not

[–] [email protected] 1 points 7 months ago* (last edited 7 months ago) (2 children)

In the test here, it literally only handled text. Doctors can do that. And if you need a doctor to check its work in every case, it has saved zero hours of work for doctors.

[–] [email protected] 1 points 7 months ago* (last edited 6 months ago) (1 children)

asdfasfasf

[–] [email protected] 1 points 7 months ago

how high processing power computers with AI/LLM’s can assist in a lab and/or hospital environment

This is an enormously broader scope than the situation I actually responded to, which was LLMs making diagnoses and then getting their work checked by a doctor

[–] [email protected] 1 points 7 months ago

Residents need their work checked also. I don’t understand your point.

load more comments