Speech recognition remains a challenging problem in AI and machine learning. In a step toward solving it, OpenAI today open-sourced Whisper, an automatic speech recognition system that the company claims enables “robust” transcription in multiple languages as well as translation from those languages into English.
Countless organizations have developed highly capable speech recognition systems, which sit at the core of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual and “multitask” data collected from the web, which lead to improved recognition of unique accents, background noise and technical jargon.
“The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an automatic speech recognition solution for developers, especially for English speech recognition,” OpenAI wrote in the GitHub repo for Whisper, from where several versions of the system can be downloaded. “[The models] show strong ASR results in ~10 languages. They may exhibit additional capabilities … if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these area.”
Whisper has its limitations, particularly in the area of text prediction. Because the system was trained on a large amount of “noisy” data, OpenAI cautions Whisper might include words in its transcriptions that weren’t actually spoken — possibly because it’s both trying to predict the next word in audio and trying to transcribe the audio itself. Moreover, Whisper doesn’t perform equally well across languages, suffering from a higher error rate when it comes to speakers of languages that aren’t well-represented in the training data.
Despite all this, OpenAI sees Whisper’s transcription capabilities being used to improve existing accessibility tools.
“While Whisper models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation,” the company continues on GitHub. “The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications … [W]e hope the technology will be used primarily for beneficial purposes, making automatic speech recognition technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication.”