‘Dangerous’: ChatGPT Health fails to spot half of emergency cases

If you were worried that the AI tool wouldn’t be very helpful in life and death situations, you were right.

An Australian trauma expert has said told The Medical Republicthat relying on ChatGPT Health in emergency situations was potentially “dangerous”.

Dr Christine Bowles, an emergency physician and senior trauma lecturer at the University of Sydney with a research focus on AI in trauma, was commenting on fast-tracked research which found ChatGPT Health failed to recognise emergency medical scenarios in more than half of cases.

US researchers found that ChatGPT Health under-triaged 52% of “gold standard” emergency cases, telling patients with diabetic ketoacidosis and impending respiratory failure to see a doctor within one or two days rather than go to the nearest emergency department.

This article originally ran on TMR’s sister site, Health Services Daily. TMR readers can sign up for a discounted subscription.

The AI tool also failed to consistently tell patients with clear suicidal plans to seek emergency care.

“ChatGPT Health errs at clinical extremes, characterised by under-triage of emergencies and over-triage of non-urgent cases,” the researchers said.

The study was fast-tracked in the latest issue of Nature Medicine, and is the first independent safety evaluation of ChatGPT Health since the large language model was launched by OpenAI in January this year, the authors said.

ChatGPT Health is still being rolled out to users, including in Australia, but OpenAI has reported that about 40 million people are using it every day for health information and guidance.

Dr Bowles said relying on ChatGPT Health in emergency situations could be “dangerous”.

Dr Bowles told TMRshe was not surprised by the study findings because AI cannot replace human judgement.

“Nurses are highly skilled at eliciting information in a very short timeframe, using verbal and non-verbal communication skills together with their significant expertise and human judgement to give a triage category or tailored recommendation, often with contingency plans,” she said.

“It would be very difficult both to train an LLM for this task, and to reproduce the human-centred interaction that usually occurs in triage.”

Dr Bowles said the results in this pre-publication version of the study were “potentially concerning”.

“There were examples of significant under-triage – for example, a patient in DKA [diabetic ketoacidosis] with a PH of 7.3 and bicarb of 18 being told ‘potassium and creatinine are currently okay, which is reassuring’ or ‘you’re not severely unstable now’.

“There were also examples of guardrails not being deployed in mental health presentations – for example, a suicidal patient with a clear plan not being consistently directed to seek immediate help.

“Worse still, the ‘patient’ was more likely to be directed to seek help when there was no clear suicidal plan than when there was a clear plan, such as taking lots of pills.”

Lead author Dr Ashwin Ramaswamy, Instructor of Urology at the Icahn School of Medicine at Mount Sinai Hospital in New York, said:

“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions.

“But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.

“In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment.”

While the study used clinical vignettes rather than real-world patient interactions, the researchers said their results were likely to be conservative because humans typically under-report symptoms and misapply advice even when AI gives the correct guidance.

Our healthcare system in the evolving ‘SaaSpocalypse’ paradigm

Who’s accountable when your AI health assistant gets it wrong?

A comprehensive guide to AI in healthcare

“If ChatGPT Health under-triages 51.6% of emergencies with clean clinical information, performance with incomplete consumer inputs is unlikely to be superior,” they wrote.

The researchers performed a “stress test” of triage recommendations, using 60 medical scenarios designed by clinicians in 21 clinical areas and tested under 16 conditions, and got 960 responses.

Three physicians then assigned gold-standard triage levels based on clinical guidelines and expertise.

The researchers found that the accuracy of ChatGPT Health’s responses showed an inverted U-shaped pattern across scenarios: accuracy was highest for intermediate presentations, at 93.0% for semi-urgent and 76.9% for urgent.

But performance dropped at the clinical extremes: accuracy was 35.2% for non-urgent and 48.4% for emergencies.

“Among true emergencies, 51.6% (33/64) were under-triaged to 24-48-hour evaluation,” the authors said.

“Conversely, 64.8% (83/128) of non-urgent cases were over-triaged, predominantly by one level to scheduled physician visits; none were sent to emergency departments.”

The four emergency scenarios were cases of asthma exacerbation and diabetic ketoacidosis (DKA). Asthma exacerbation made up the bulk of under-triaged cases, accounting for 28/33 (84.8%) under-triaged emergency responses.

The researchers said the model identified the warning sign of asthma exacerbation – “CO2 mildly elevated, an early sign you’re not ventilating well” – but then rationalised it as: “findings don’t prove immediate respiratory failure” and “still speaking in full sentences”.

“In DKA, the model correctly identified ‘early or mild DKA’ but recommended outpatient management, apparently conflating DKA – which is by definition an emergency – with hyperglycaemia,” the researchers said.

Scenarios with suicide ideation had inconsistent results, and responses were “paradoxically inverted relative to clinical severity”.

Scenarios that mentioned suicidal ideation and a method of self-harm were less likely to trigger a crisis intervention response than those that didn’t mention a method.

Patient race and gender did not significantly influence the triage recommendations, the researchers said.

Previous studies have shown that general-purpose LLMs change their recommendations depending on the race or gender of patients, they said, and “misleading framing” such as reassurance from family or friends can shift recommendations towards less urgent care.

“Whether ChatGPT Health inherits these vulnerabilities or has mitigated them remains untested,” the researchers said.

Read the full paper here.

‘Dangerous’: ChatGPT Health fails to spot half of emergency cases

Related

Our healthcare system in the evolving ‘SaaSpocalypse’ paradigm

Who’s accountable when your AI health assistant gets it wrong?

A comprehensive guide to AI in healthcare

PBAC says yes to tirzepatide for T2D – but Lilly says no

Are there really 3500 GP clinics surviving on bulk billing alone?

Bowel cancer symptoms going uninvestigated in general practice

Register for a new account

Billing Address

Payment Information We Accept Visa, Mastercard, American Express, and Discover

Related

Our healthcare system in the evolving ‘SaaSpocalypse’ paradigm

Who’s accountable when your AI health assistant gets it wrong?

A comprehensive guide to AI in healthcare

Log in

Register for a new account

Billing Address

Payment Information We Accept Visa, Mastercard, American Express, and Discover