OpenAI’s Whisper Tool Frequently Hallucinates in a Medical Setting

Reports indicate widespread issues with OpenAI's Whisper AI transcription tool when used in a setting where accuracy is crucial

OpenAI's Whisper Tool Frequently Hallucinates in a Medical Setting
AI

Serious concerns have emerged about OpenAI’s AI transcription tool Whisper, as researchers discover it frequently invents text that was never spoken, raising particular alarm about its use in healthcare settings.

Despite OpenAI’s claims of “human level robustness and accuracy” for Whisper, multiple experts have found the tool regularly creates fabricated content – known as hallucinations in the artificial intelligence field – ranging from racial commentary to non-existent medical treatments.

A Widespread Problem

Research findings reveal troubling error rates across different use cases. A researcher at the University of Michigan, while studying public meeting recordings, identified fabricated content in 80% of the audio transcriptions examined. Similarly concerning results came from a machine learning engineer who found invented text in over half of more than 100 hours of transcriptions analyzed.

The scale of the problem became even more apparent when a developer reported finding fabricated content in almost all of the 26,000 transcripts produced using Whisper. Even in ideal conditions with clear, short audio recordings, problems persist. Computer scientists in a recent study identified 187 instances of fabricated content in their examination of over 13,000 audio snippets.

Medical and Industry Implementation Raises Concerns

The discovery of these accuracy issues is particularly worrying as medical centers rush to implement Whisper-based tools for transcribing doctor-patient consultations. This adoption continues despite OpenAI’s explicit warnings against using the tool in “high-risk domains.”

The tool’s current implementation extends beyond healthcare, being used globally for tasks such as interview transcription, text generation in consumer technologies, and video subtitle creation.

In response to these findings, an OpenAI spokesperson acknowledged the concerns, stating the company is “continually working to improve the accuracy of our models, including reducing hallucinations.” The spokesperson also emphasized that their usage policies prohibit Whisper’s use “in certain high-stakes decision-making contexts.”

Researchers indicate that based on current error rates, the tool could potentially generate tens of thousands of faulty transcriptions across millions of recordings. This level of inaccuracy raises significant questions about the tool’s reliability in settings where accurate transcription is crucial for decision-making or record-keeping.

These findings support the growing concerns about the implementation of AI tools in sensitive environments before their limitations and potential risks are fully understood and addressed.

Avatar photo
Maria is a freelance journalist whose passion is writing about technology. She loved reading sci-fi books as a kid (still does) and suspects that that's the bug that got her interested in all things tech-y and science-y. Maria studied engineering at university but after graduating discovered that she finds more joy in writing about inventions than actually making them. She is really excited (and a little scared) about everything that's going on in the AI landscape and the break-neck speed at which the field is developing. When she’s not writing, Maria enjoys capturing the beauty of nature through her camera lens and taking long walks with her scruffy golden retriever, Goldie.

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top