Artificial intelligence chatbots pose significant risks to patients seeking medical advice due to their tendency to provide inaccurate and inconsistent information, according to research published on Monday in the journal Nature Medicine.
The study, led by scientists from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, found that despite advances in AI technology, chatbots frequently deliver a mix of good and bad information that users struggle to distinguish. Researchers tested nearly 1,300 participants across 10 different medical scenarios to evaluate how successfully humans use chatbots to identify health problems and determine whether they require seeing a doctor or going to hospital.
Dr Rebecca Payne, a co-author of the study and general practitioner, stated that the research showed AI is not ready to take on the role of the physician. She warned that asking a large language model (LLM) about symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.
Some participants used large language model software, including GPT-4o, Llama 3, and Cohere’s Command R+, to obtain potential diagnoses and next steps, while others relied on traditional approaches such as consulting a general practitioner or using trusted medical websites. The scenarios included ailments ranging from a young man developing a severe headache after a night out with friends to a new mother feeling constantly out of breath and exhausted.
After evaluating the responses, researchers found that while chatbots now excel at standardised tests of medical knowledge, their use as a clinical tool would pose risks to real users seeking help with their own medical symptoms. When AI listed three possible conditions, people were left to guess which one might fit, according to lead author Andrew Bean of the Oxford Internet Institute.
The study revealed that when large language models were directly presented with test scenarios, they could correctly identify relevant conditions 94.9% of the time. However, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time. Patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using any methods they would typically employ at home. The control group was 76% more likely to identify correct conditions than the group assisted by LLMs.
Bean explained that interacting with humans poses a challenge even for top performing large language models. He expressed hope that the work will contribute to the development of safer and more useful AI systems.
Dr Payne explained that users usually share information gradually and may omit key details that a professional medical practitioner would have identified during a face-to-face examination. This creates additional risk of repeating errors embedded in modern medical practice.
David Shaw, a bioethicist at Maastricht University in the Netherlands who was not involved in the research, called the study very important as it highlights the real medical risks posed to the public by chatbots. He advised people to only trust medical information from reliable sources, such as the UK’s National Health Service.
Despite the criticism, experts see potential in specialized models. Dr Bertalan Mesko noted that new versions of chatbots from OpenAI and Anthropic, developed specifically for the healthcare sector, may show better results. However, the key condition for safety remains the implementation of clear national rules, regulatory barriers, and official medical recommendations for improving such systems.


