Chatbots Make Terrible Doctors, New Study Finds (www.404media.co)

submitted 3 months ago* (last edited 3 months ago) by XLE@piefed.social to c/technology@lemmy.world

51 comments fedilink hide all child comments

Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn't ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

top 50 comments

sorted by: hot top controversial new old

[-] alzjim@lemmy.world 9 points 3 months ago

Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.

[-] BeigeAgenda@lemmy.ca 9 points 3 months ago

Anyone who have knowledge about a specific subject says the same: LLM'S are constantly incorrect and hallucinate.

Everyone else thinks it looks right.

[-] IratePirate@feddit.org 7 points 3 months ago* (last edited 3 months ago)

A talk on LLMs I was listening to recently put it this way:

If we hear the words of a five-year-old, we assume the knowledge of a five-year-old behind those words, and treat the content with due caution.

We're not adapted to something with the "mind" of a five-year-old speaking to us in the words of a fifty-year-old, and thus are more likely to assume competence just based on language.

[-] leftzero@lemmy.dbzer0.com 6 points 3 months ago

LLMs don't have the mind of a five year old, though.

They don't have a mind at all.

They simply string words together according to statistical likelihood, without having any notion of what the words mean, or what words or meaning are; they don't have any mechanism with which to have a notion.

They aren't any more intelligent than old Markov chains (or than your average rock), they're simply better at producing random text that looks like it could have been written by a human.

[-] IratePirate@feddit.org 2 points 3 months ago

I am aware of that, hence the ""s. But you're correct, that's where the analogy breaks. Personally, I prefer to liken them to parrots, mindlessly reciting patterns they've found in somebody else's speech.

[-] agentTeiko@piefed.social 2 points 3 months ago

Yep its why CLevels think its the Holy Grail they don't see it as everything that comes out of their mouth is bullshit as well. So they don't see the difference.

[-] zewm@lemmy.world 0 points 3 months ago

It is insane to me how anyone can trust LLMs when their information is incorrect 90% of the time.

[-] irate944@piefed.social 9 points 3 months ago

I could've told you that for free, no need for a study

[-] rudyharrelson@lemmy.radio 18 points 3 months ago* (last edited 3 months ago)

People always say this on stories about "obvious" findings, but it's important to have verifiable studies to cite in arguments for policy, law, etc. It's kinda sad that it's needed, but formal investigations are a big step up from just saying, "I'm pretty sure this technology is bullshit."

I don't need a formal study to tell me that drinking 12 cans of soda a day is bad for my health. But a study that's been replicated by multiple independent groups makes it way easier to argue to a committee.

[-] irate944@piefed.social 2 points 3 months ago

Yeah you're right, I was just making a joke.

But it does create some silly situations like you said

[-] rudyharrelson@lemmy.radio 0 points 3 months ago

I figured you were just being funny, but I'm feeling talkative today, lol

[-] IratePirate@feddit.org -1 points 3 months ago

A critical, yet respectful and understanding exchange between two individuals on the interwebz? Boy, maybe not all is lost...

[-] Knot@lemmy.zip 2 points 3 months ago

I get that this thread started from a joke, but I think it's also important to note that no matter how obvious some things may seem to some people, the exact opposite will seem obvious to many others. Without evidence, like the study, both groups are really just stating their opinions

It's also why the formal investigations are required. And whenever policies and laws are made based on verifiable studies rather than people's hunches, it's not sad, it's a good thing!

[-] BillyClark@piefed.social 2 points 3 months ago

it’s important to have verifiable studies to cite in arguments for policy, law, etc.

It's also important to have for its own merit. Sometimes, people have strong intuitions about "obvious" things, and they're completely wrong. Without science studying things, it's "obvious" that the sun goes around the Earth, for example.

I don’t need a formal study to tell me that drinking 12 cans of soda a day is bad for my health.

Without those studies, you cannot know whether it's bad for your health. You can assume it's bad for your health. You can believe it's bad for your health. But you cannot know. These aren't bad assumptions or harmful beliefs, by the way. But the thing is, you simply cannot know without testing.

[-] Slashme@lemmy.world 1 points 3 months ago

Or how bad something is. "I don't need a scientific study to tell me that looking at my phone before bed will make me sleep badly", but the studies actually show that the effect is statistically robust but small.

In the same way, studies like this can make the distinction between different levels of advice and warning.

[-] Telorand@reddthat.com 0 points 3 months ago

The thing that frustrates me about these studies is that they all continue to come to the same conclusions. AI has already been studied in mental health settings, and it's always performed horribly (except for very specific uses with professional oversight and intervention).

I agree that the studies are necessary to inform policy, but at what point are lawmakers going to actually lay down the law and say, "AI clearly doesn't belong here until you can prove otherwise"? It feels like they're hemming and hawwing in the vain hope that it will live up to the hype.

[-] spaghettiwestern@sh.itjust.works 5 points 3 months ago* (last edited 3 months ago)

Most doctors make terrible doctors.

[-] sbbq@lemmy.zip 1 points 3 months ago

My dad always said, you know what they call the guy who graduated last in his class at med school? Doctor.

[-] GnuLinuxDude@lemmy.ml 4 points 3 months ago

If you want to read an article that’s optimistic about AI and healthcare, but where if you start asking too many questions it falls apart, try this one

https://text.npr.org/2026/01/30/nx-s1-5693219/

Because it’s clear that people are starting to use it and many times the successful outcome is it just tells you to see a doctor. And doctors are beginning to use it, but they should have the professional expertise to understand and evaluate the output. And we already know that LLMs can spout bullshit.

For the purposes of using and relying on it, I don’t see how it is very different from gambling. You keep pulling the lever, oh excuse me I mean prompting, until you get the outcome you want.

[-] HeyThisIsntTheYMCA@lemmy.world 0 points 3 months ago

the one time my doctor used it and i didn't get mad at them (they did the google and said "the ai says" and I started making angry Nottingham noises even though all the ai did was tell us exactly what we had just been discussing was correct) uh, well that's pretty much it I'm not sure where my parens are supposed to open and close on that story.

[-] GnuLinuxDude@lemmy.ml 6 points 3 months ago

Be glad it was merely that and not something like this https://www.reuters.com/investigations/ai-enters-operating-room-reports-arise-botched-surgeries-misidentified-body-2026-02-09/

In 2021, a unit of healthcare giant Johnson & Johnson announced “a leap forward”: It had added artificial intelligence to a medical device used to treat chronic sinusitis, an inflammation of the sinuses...

At least 10 people were injured between late 2021 and November 2025, according to the reports. Most allegedly involved errors in which the TruDi Navigation System misinformed surgeons about the location of their instruments while they were using them inside patients’ heads during operations.

Cerebrospinal fluid reportedly leaked from one patient’s nose. In another reported case, a surgeon mistakenly punctured the base of a patient’s skull. In two other cases, patients each allegedly suffered strokes after a major artery was accidentally injured.

FDA device reports may be incomplete and aren’t intended to determine causes of medical mishaps, so it’s not clear what role AI may have played in these events. The two stroke victims each filed a lawsuit in Texas alleging that the TruDi system’s AI contributed to their injuries. “The product was arguably safer before integrating changes in the software to incorporate artificial intelligence than after the software modifications were implemented,” one of the suits alleges.

[-] pleksi@sopuli.xyz 4 points 3 months ago

As a phycisian ive used AI to check if i have missed anything in my train of thought. Never really changed my decision though. Has been useful to hather up relevant sitations for my presentations as well. But that’s about it. It’s truly shite at interpreting scientific research data on its own for example. Most of the time it will parrot the conclusions of the authors.

[-] Buddahriffic@lemmy.world 3 points 3 months ago

Funny because medical diagnosis is actually one of the areas where AI can be great, just not fucking LLMs. It's not even really AI, but a decision tree that asks about what symptoms are present and missing, eventually getting to the point where a doctor or nurse is required to do evaluations or tests to keep moving through the flowchart until you get to a leaf, where you either have a diagnosis (and ways to confirm/rule it out) or something new (at least to the system).

Problem is that this kind of a system would need to be built up by doctors, though they could probably get a lot of it there using journaling and some algorithm to convert the journals into the decision tree.

The end result would be a system that can start triage at the user's home to help determine urgency of a medical visit (like is this a get to the ER ASAP, go to a walk-in or family doctor in the next week, it's ok if you can't get an appointment for a month, or just stay at home monitoring it and seek medical help if x, y, z happens), then it can give that info to the HCW you work next with for them to recheck things non-doctors often get wrong and then pick up from there. Plus it helps doctors be more consistent, informs them when symptoms match things they aren't familiar with, and makes it harder to excuse incompetence or apathy leading to a "just get rid of them" response.

Instead people are trying to make AI doctors out of word correlation engines, like the Hardee boys following a clue of random word associations (except reality isn't written to make them right in the end because that's funny like in South Park).

[-] sheogorath@lemmy.world 2 points 3 months ago

Yep, I've worked in systems like these and we actually had doctors as part of our development team to make sure the diagnosis is accurate.

[-] selokichtli@lemmy.ml 1 points 3 months ago

Have you seen LLMs trying to play chess? They can move some pieces alright, but at some point it's like they just decide to put their cat in the middle of the board. Now, true chess engines are playing at their own level, not even grandmasters can follow.

[-] XLE@piefed.social 1 points 3 months ago* (last edited 3 months ago)

I think ~~I~~ you just described a conventional computer program. It would be easy to make that. It would be easy to debug if something was wrong. And it would be easy to read both the source code and the data that went into it. I've seen rudimentary symptom checkers online since forever, and compared to forms in doctors' offices, a digital one could actually expand to relevant sections.

Edit: you caught my typo

[-] Buddahriffic@lemmy.world 1 points 3 months ago

(Assuming you meant "you" instead of "I" for the 3rd word)

Yeah, it fits more with the older definition of AI from before NNs took the spotlight, when it meant more of a normal program that acted intelligent.

The learning part is being able to add new branches or leaf nodes to the tree, where the program isn't learning on its own but is improving based on the expeirences of the users.

It could also be encoded as a series of probability multiplications instead of a tree, where it checks on whatever issue has the highest probability using the checks/questions that are cheapest to ask but afffect the probability the most.

Which could then be encoded as a NN because they are both just a series of matrix multiplications that a NN can approximate to an arbitrary %, based on the NN parameters. Also, NNs are proven to be able to approximate any continuous function that takes some number of dimensions of real numbers if given enough neurons and connections, which means they can exactly represent any disctete function (which a decision tree is).

It's an open question still, but it's possible that the equivalence goes both ways, as in a NN can represent a decision tree and a decision tree can approximate any NN. So the actual divide between the two is blurrier than you might expect.

Which is also why I'll always be skeptical that NNs on their own can give rise to true artificial intelligence (though there's also a part of me that wonders if we can be represented by a complex enough decision tree or series of matrix multiplications).

[-] _g_be@lemmy.world 1 points 3 months ago

could be a great idea if people could be trusted to correctly interpret things that are not in their scope of expertise. The parallel I'm thinking of is IT, where people will happily and repeatedly call a monitor "the computer". Imagine telling the AI your heart hurts when it's actually muscle spasms or indigestion.

The value in medical professionals is not just the raw knowledge but the practice of objective assessment or deduction of symptoms, in a way that I didn't foresee a public-facing system being able to replicate

[-] Etterra@discuss.online 3 points 3 months ago

I didn't need a study to tell me not to listen to a hallucinating parrot-bot.

[-] Treczoks@lemmy.world 2 points 3 months ago

One needs a study for that?

[-] SuspciousCarrot78@lemmy.world 2 points 3 months ago* (last edited 3 months ago)

So, I can speak to this a little bit, as it touches two domains I'm involved it. TL;DR - LLMs bullshit and are unreliable, but there's a way to use them in this domain as a force multiplier of sorts.

In one; I've created a python router that takes my (deidentified) clinical notes, extract and compacts input (user defined rules) and creates a summary, then -

benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).
this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

Additionally, the llm generated note can be approved / denied by the python router, in the first instance based on certain policy criteria I've defined.

It can also suggest probable DDX based on my database (which are .CSV based)

Finally, if the llm output fails policy check, the router tells me why it failed and just says "go look at the prior summary and edit it yourself".

This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

The reason why this is interesting:

All of this runs within the llm (it calls / invokes the python tooling via >> command) and is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

Ive found that using a fairly "dumb" llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). Its too dumb to jazz, which is useful in this instance.

Would I trust the LLM, end to end? Well, I'd trust my system, approx 80% of the time. I wouldn't trust ChatGPT ... even though its been more right than wrong in similar tests.

[-] thatradomguy@lemmy.world 2 points 3 months ago

willy wonka meme image

[-] Tollana1234567@lemmy.today 1 points 3 months ago* (last edited 3 months ago)

its basically a convoluted version of webmd. even MD mods in medical subs are more accurate.

[-] dandelion@lemmy.blahaj.zone 1 points 3 months ago* (last edited 3 months ago)

link to the actual study: https://www.nature.com/articles/s41591-025-04074-y

Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice.

The findings were more that users were unable to effectively use the LLMs (even when the LLMs were competent when provided the full information):

despite selecting three LLMs that were successful at identifying dispositions and conditions alone, we found that participants struggled to use them effectively.

Participants using LLMs consistently performed worse than when the LLMs were directly provided with the scenario and task

Overall, users often failed to provide the models with sufficient information to reach a correct recommendation. In 16 of 30 sampled interactions, initial messages contained only partial information (see Extended Data Table 1 for a transcript example). In 7 of these 16 interactions, users mentioned additional symptoms later, either in response to a question from the model or independently.

Participants employed a broad range of strategies when interacting with LLMs. Several users primarily asked closed-ended questions (for example, ‘Could this be related to stress?’), which constrained the possible responses from LLMs. When asked to justify their choices, two users appeared to have made decisions by anthropomorphizing LLMs and considering them human-like (for example, ‘the AI seemed pretty confident’). On the other hand, one user appeared to have deliberately withheld information that they later used to test the correctness of the conditions suggested by the model.

Part of what a doctor is able to do is recognize a patient's blind-spots and critically analyze the situation. The LLM on the other hand responds based on the information it is given, and does not do well when users provide partial or insufficient information, or when users mislead by providing incorrect information (like if a patient speculates about potential causes, a doctor would know to dismiss incorrect guesses, whereas a LLM would constrain responses based on those bad suggestions).

[-] theunknownmuncher@lemmy.world 1 points 3 months ago

A statistical model of language isn't the same as medical training??????????????????????????

[-] scarabic@lemmy.world 0 points 3 months ago* (last edited 3 months ago)

It’s actually interesting. They found the LLMs gave the correct diagnosis high-90-something percent of the time if they had access to the notes doctors wrote about their symptoms. But when thrust into the room, cold, with patients, the LLMs couldn’t gather that symptom info themselves.

[-] Hacksaw@lemmy.ca 3 points 3 months ago

LLM gives correct answer when doctor writes it down first.... Wowoweewow very nice!

[-] scarabic@lemmy.world -1 points 3 months ago

If you think there’s no work between symptoms and diagnosis, you’re dumber than you think LLMs are.

[-] pageflight@piefed.social 1 points 3 months ago

Chatbots are terrible at anything but casual chatter, humanity finds.

[-] Sterile_Technique@lemmy.world 1 points 3 months ago* (last edited 3 months ago)

Chipmunks, 5 year olds, salt/pepper shakers, and paint thinner, also all make terrible doctors.

Follow me for more studies on 'shit you already know because it's self-evident immediately upon observation'.

[-] kescusay@lemmy.world 0 points 3 months ago

I would like to subscribe to your newsletter.

[-] Etterra@discuss.online 1 points 3 months ago

Don't worry guys, I found us a new doctor!

[-] sbv@sh.itjust.works 0 points 3 months ago

It looks like the LLMs weren't trained for medical tasks. The study would be more interesting if it had been run on something built for the task.

[-] softwarist@programming.dev 0 points 3 months ago

As neither a chatbot nor a doctor, I have to assume that subarachnoid hemorrhage has something to do with bleeding a lot of spiders.

[-] dandelion@lemmy.blahaj.zone 0 points 3 months ago

https://en.wikipedia.org/wiki/Subarachnoid_hemorrhage

https://en.wikipedia.org/wiki/Arachnoid_mater

it is one of the protective membranes around the brain and spinal cord, and it is named after its resemblance to spider webs, so - close enough

[-] end_stage_ligma@lemmy.world 1 points 3 months ago

can confirm, this is where spiders live inside your body

also pee is stored in the balls

load more comments (1 replies)

[-] MrKoyun@lemmy.world 0 points 3 months ago

Water is wet

[-] generic_computers@lemmy.zip 0 points 3 months ago

Um actually, water itself isn't wet. What water touches is wet.

[-] Buddahriffic@lemmy.world 1 points 3 months ago

Water loves touching itself.

[-] vivalapivo@lemmy.today -1 points 3 months ago

"but have they tried Opus 4.6/ChatGPT 5.3? No? Then disregard the research, we're on the exponential curve, nothing is relevant"

Sorry, I've opened reddit this week

load more comments

this post was submitted on 09 Feb 2026

74 points (100.0% liked)

Technology

85080 readers

1312 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 3 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws