Checking its work is essential and not always as easy as it sounds.
AI use in primary care is moving faster than evaluation and regulation, but if physicians take care with how they use it, it has enormous potential for clinicians and patients, say the authors of an Australian study published in The Lancet today.
“There is a lot of potential, and that’s not only on the clinician side of things, but also on the patient side,” said lead author Associate Professor Liliana Laranjo, a digital health researcher at the University of Sydney’s Westmead Applied Research Centre.
“We need to take advantage of that potential, but put the safeguards and guardrails in place to avoid errors that can have catastrophic consequences,” she told The Medical Republic.
AI scribes have made their way into a growing number of GP practices around the country, with GPs exhorted to check their work.
Trials assessing the length of consultation and time spent outside of that doing associated administration have shown AI scribes do not necessarily save GPs time overall, but GPs still like using them. And patients seem to like the scribes too, because they allow the GP to focus on them rather than a screen.
Professor Laranjo said clinician and patient satisfaction was important.
“In general practice, we don’t have enough clinicians, and they’re a very tired and exhausted workforce. We want to support them. We also want patients to be satisfied, and so far it seems that patients do appreciate having the face to face and eye contact with clinicians,” she said.
“But we also want to make sure that there’s no problem in terms of safety. And that’s one of the things that is not studied enough; the potential errors of these tools.
“We know that they perform well, their accuracy is good, but what happens when they miss something? There can be catastrophic consequences.”
Related
As well as taking up additional time, one of the problems with the onus being on the clinician to check everything is automation bias, the paper says.
“If we have an AI tool that seems to perform really well most of the time, clinicians start trusting it. They stop verifying as closely everything that the tool does. And that can be problematic, because that’s where errors occur and things are missed,” said Professor Laranjo.
“If you think about the analogy with self-driving cars, we know that the technology is good and performs well most of the time.
“But there’s a reason why we’re not seeing them around, and that’s because there were a few fatal accidents, and that was related to automation bias. The drivers trusted the technology so much that they were not paying attention, and that led to catastrophic problems.
“We want to avoid that in health.”
Large language models also appear convincingly, in their presentation, to be reasoning well. And that’s a problem, the paper says.
“[T]he explanations can provide a strong illusion of a clear box and have a higher potential to elicit trust and lead to biases in decision making, such as confirmation bias and automation bias,” the authors wrote.
AI can be the equivalent of particularly transparent “yes men”, with all the associated consequences.
“LLMs can show sycophantic behaviour and mirror or endorse the user’s thinking, even when the user’s reasoning is flawed or factually incorrect,” wrote the authors.
GPs are using AI (especially ChatGPT) to get information out of clinical guidelines and scientific articles because it’s better than a keyword search in a search engine, the paper says.
There are more clinically specific LLMs which use retrieval augmented generation – combining generative LLMs with specific, real-time information sources like medical literature and guidelines – and medical domain fine-tuning, which is where an LLM is retrained to use specific data sets in an optimised way. It’s this kind of AI that’s been shown to do well in medical exams.
These specialist LLMs have access to the articles and information available behind paywalls. But they aren’t commonly available in Australia yet, said Professor Laranjo. And they are not free.
And even they have their limitations.
“Despite its futuristic allure, AI is trained on data from the past, embedding biases that threaten to perpetuate inequities in health care,” the authors wrote.
“The datasets used to train and test an AI algorithm can be a source of bias when they are unrepresentative of the intended population in which the AI system will be used.”
An example of this would be decision-making support algorithms similar to present-day risk calculators, explained Professor Laranjo.
“If the algorithm is trained on existing data, it’s probably going to miss a lot of under-represented populations, and so will not perform as well when applied to those populations.
“And so that’s a big problem that we have, as most algorithms in medicine are trained on data sets that are mostly European, white and male. And we don’t want to see those same biases replicated in these AI tools.”
Meanwhile, like their patients, GPs are using publicly available LLMs, which can’t access literature behind paywalls – only the abstracts. And the pitfalls are there for doctors, just like for everybody else.
Importantly, “LLMs are not sensitive to the strength of the evidence and can generate responses based on weak evidence or unreputable sources; LLMs rely on publicly available data”.
That public data is not necessarily updated in real time (so the LLM’s information could be old) and it’s a problem if it’s placing equal weight on information from a Reddit thread as a Lancet article.
At least they now provide a list of sources, so you can check where the information is coming from.
The paper suggests strategies for primary care practitioners to get the most out of LLMs.
“In general, good prompts are direct, include examples, and provide context. Examples of prompting strategies include few-shot prompting (ie, providing examples describing the task), chain-of-thought prompting (ie, including a breakdown of the reasoning for each example), or persona (i.e., assigning a role to the LLM, eg, ‘imagine you are a doctor…’),” the authors wrote.
“So you would say something like, ‘imagine you are a general practitioner in Western Sydney, Australia, and you are seeing a patient that has these conditions’,” Professor Laranjo added. “You want to provide as much detail as possible.”
And because you can now have a conversational interaction with a LLM, not everything has to be in the first prompt, she said.
“Now most large language models, at the end of their response, suggest something else. You’re able to continue with your problem, so you end up providing that chain of thought: this is what I’m looking for, I want to help this patient, I want to understand this medical condition and potential diagnosis, differential diagnosis and treatments.
“And so you’re telling it what you want, but then you can refine that based on the response as well.”
Crucially, AI will use anything you tell it to continue to train LLMs, including patient-identifying data.
“That’s one of the key takeaways, I think, that if GPs are using a large language model to try and better understand the clinical case that no information, identifiable information is provided, and that the question or the queries phrased in generic terms,” said Professor Laranjo.
But what about the AI scribes being used in the office, alongside other data systems? Can AI pull identifiable information to feed into training? Well, it’s basically buyer beware.
“That’s dependent on the specific software company involved in in deploying that AI scribe, and that’s a problem with the lack of regulation,” said Professor Laranjo.
“We know that with the most commonly used ones the information is not used to train the models and is kept securely. But with the lack of regulation, it’s very hard to safeguard that for all AI scribe software”.
Despite the dangers, AI has “huge potential”, not just for clinicians, but also for patients, said Professor Laranjo.
“So far, LLMs have been mostly applied on the clinician side of things, but I think there’s a lot of potential to empower patients to participate in their care,” said Professor Laranjo.
AI can be used to gather patient-reported experience and outcome measures (PREMS and PROMS), use them to improve the patient experience, and also learn from them, she said.
“It’s very important to have that patient side of things, and the patient voice and the patient outcomes informing these models as much as possible, which hasn’t happened so far,” said Professor Laranjo.
“One of the potential innovations around large language models and tools like ChatGPT is if we can leverage a patient-specific AI tool, coach, something in people’s phones that is able to collect that information regularly and then use that to inform the care that they’re receiving, not only in their GP clinics, but at the hospital, that would obviously be a benefit to the quality of care that they receiving.
“I think large language models are offering that possibility now, which hasn’t really been available to patients in the past.”



