Serious concerns have emerged about OpenAI’s AI transcription tool Whisper, as researchers discover it frequently invents text that was never spoken, raising particular alarm about its use in healthcare settings.
Despite OpenAI’s claims of “human level robustness and accuracy” for Whisper, multiple experts have found the tool regularly creates fabricated content – known as hallucinations in the artificial intelligence field – ranging from racial commentary to non-existent medical treatments.
A Widespread Problem
Research findings reveal troubling error rates across different use cases. A researcher at the University of Michigan, while studying public meeting recordings, identified fabricated content in 80% of the audio transcriptions examined. Similarly concerning results came from a machine learning engineer who found invented text in over half of more than 100 hours of transcriptions analyzed.
The scale of the problem became even more apparent when a developer reported finding fabricated content in almost all of the 26,000 transcripts produced using Whisper. Even in ideal conditions with clear, short audio recordings, problems persist. Computer scientists in a recent study identified 187 instances of fabricated content in their examination of over 13,000 audio snippets.
Medical and Industry Implementation Raises Concerns
The discovery of these accuracy issues is particularly worrying as medical centers rush to implement Whisper-based tools for transcribing doctor-patient consultations. This adoption continues despite OpenAI’s explicit warnings against using the tool in “high-risk domains.”
The tool’s current implementation extends beyond healthcare, being used globally for tasks such as interview transcription, text generation in consumer technologies, and video subtitle creation.
In response to these findings, an OpenAI spokesperson acknowledged the concerns, stating the company is “continually working to improve the accuracy of our models, including reducing hallucinations.” The spokesperson also emphasized that their usage policies prohibit Whisper’s use “in certain high-stakes decision-making contexts.”
Researchers indicate that based on current error rates, the tool could potentially generate tens of thousands of faulty transcriptions across millions of recordings. This level of inaccuracy raises significant questions about the tool’s reliability in settings where accurate transcription is crucial for decision-making or record-keeping.
These findings support the growing concerns about the implementation of AI tools in sensitive environments before their limitations and potential risks are fully understood and addressed.