Accuracy using text-based scenarios doesn’t mean AI tools are ready for wider use, Australian researchers say.
AI tools are showing better and more accurate results when faced with messy clinical scenarios, but Australian researchers warn that they need more evaluation before they’re widely implemented in healthcare.
In a commentary piece in Science, Flinders University researchers argued that diagnostic accuracy on a defined task was only one aspect of readiness for safe, wider deployment in healthcare settings.
“Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring,” they said.
For AI to be safely integrated into healthcare, there must be oversight, evaluation and transparency, support for clinicians and monitoring of patient outcomes, they said.
The Flinders researchers were invited to comment on the findings of a related study – also published in Science – that evaluated the diagnostic reasoning capabilities of AI across different emergency scenarios.
That US study found that a large language model released in 2024 – OpenAI’s o1-preview – accurately identified 67% of real-world emergency cases during triage, compared with 50-55% accuracy by two doctors.
“Overall, the model outperformed physicians across experiments, including in cases utilising real and unstructured clinical data taken directly from the health record in an emergency department,” the US researchers said.
But the Australian commentators said the deployment of AI systems was outpacing evaluation methods.
“Accuracy on a validated task does not guarantee that a deployed system will confine itself to that task.”
For example, the authors said ChatGPT Health was not designed for clinical triage, but one study found that it does not refuse clinical triage tasks. What’s more, it under-triaged more than half of emergency scenarios it was given.
Co-author and Flinders University PhD candidate, Erik Cornelisse, said the US study found that AI can match the clinical reasoning of doctors in certain situations.
“AI no longer just passes medical exams – it can reason through real messy emergency department cases as well as experienced doctors,” he told TMR.
But the US paper only used text-based scenarios, which does not reflect real-world clinical care, said Mr Cornelisse, whose research focuses on evaluating whether generative AI tools are safe and accurate in healthcare scenarios.
“We need better evaluation of AI in supervised settings that reflect real clinical care, including the use of visual and audio data,” he said.
“Second, we need to test the best way to implement AI tools.
“That means comparing AI working on its own, clinicians working without AI, and clinicians working with AI.
“Third, we need clear task definitions – what the AI is designed to do, what it is not designed to do, and develop transparent human benchmarks to evaluate the quality of its responses.
“Fourth, we need to look beyond safety or accuracy and determine whether these tools actually improve patient outcomes in the real world.”
AI in healthcare needs to involve clinicians, patients, health services, software developers, policymakers, and regulators, with ongoing monitoring, transparency and policies, he said.
“Before you deploy AI, you need to make sure that you’re using high-quality, local, real-world data of the performance of AI. You have to be clear about what the AI tool is being used for.
“It’s also essential for each clinician or doctor to understand how AI might impact care and patient outcomes, and this probably requires a lot of significant training and support for clinicians.”
Related
AI tools are already being used in Australian healthcare, from clinical decision support tools, patient support and assisting with administrative workloads, Mr Cornelisse said.
But in healthcare settings, AI can be harmful if it’s poorly evaluated or used beyond its intended purpose, Mr Cornelisse said.
“It is probably safer when there is knowledgeable clinical oversight – when a trained clinician can interpret the output, recognise when it its right or wrong, and decide on appropriate action.”
Clinicians who use AI need training and support so they understand the limitations and risks of using AI, he said.
“The clinician, the hospital, the software developer all need defined responsibilities so we know who’s accountable.
“But we also recognise that AI in the healthcare space is moving very fast, and instead of asking, ‘do I trust AI to be safe?’ it might be a better question to ask, ‘do I trust the system that’s deploying and evaluating the AI?’
“We need to make sure that when AI is being used, that we’re evaluating the AI outputs itself, understanding its limitations and risks and how it’s impacting care delivery and patient outcomes.
“We need to understand how AI tools can be safely integrated into clinical practice and recognise the need to continually evaluate these new AI tools alongside doctors and other health professionals.”
Another issue is the bias of large AI models, most of which have been trained on US data, raising questions around how well they perform on Australian populations, Mr Cornelisse said.
“Aboriginal Torres Strait Islander patients and other people from culturally or linguistically diverse backgrounds may not get appropriately recognised in the data, and that means that the AI outputs may not be representative of those populations.”
AI models have shown racial and gender bias in previous research, with one healthcare algorithm showing substantial racial bias.
“It systemically under-allocated care to black patients, which ended up affecting millions of patients.”
At the consumer level, patients are using ChatGPT for medical advice in the hundreds of millions each week, Mr Cornelisse said.
“The problem with that is that consumer AI is being used for things that it was never validated to do.
“These tools can sound confident and helpful, but still produce advice that is incomplete, inappropriate, or unsafe.
“They can generate a fluent answer without necessarily knowing whether that answer is truly safe for that person.”



