DeepSpeech, a suite of speech-to-text and text-to-speech engines maintained by Mozilla’s Machine Learning Group, this morning received an update (to version 0.6) that incorporates one of the fastest open source speech recognition models to date. In a blog post , senior research engineer Reuben Morais lays out what’s new and enhanced, as well as other spotlight features coming down the pipeline.
The latest version of DeepSpeech adds support for TensorFlow Lite, a version of Google’s TensorFlow machine learning framework that’s optimized for compute-constrained mobile and embedded devices. It’s reduced DeepSpeech’s package size from 98MB to 3.7MB and its built-in English model size — which has a 7.5% word error rate on a popular benchmark and which was trained on 5,516 hours of transcribed audio from WAMU (NPR), LibriSpeech, Fisher, Switchboard, and Mozilla’sCommon Voice English data sets — from 188MB to 47MB. Plus, it’s cut down DeepSpeech’s memory consumption by 22 time and boosted its startup speed by over 500 times.
This more efficient English language model — which runs “faster than real time” on a single core of a Raspberry Pi 4 and which is 50% smaller than before (including the inference code and the trained model) — is available on Windows, macOS, and Linux as well as Android.
Above: DeepSpeech’s memory usage during startup.
Image Credit: Mozilla
DeepSpeech 0.6 is much more performant overall, thanks in part to a new streaming decoder that enables “consistent” low latency and memory utilization regardless of the length of audio being transcribed. Additionally, the platform’s two main subsystems — an acoustic model that receives audio features as inputs and outputs character probabilities and a decoder that transforms character probabilities into textual transcripts — are both now capable of streaming. This means that there’s no longer any need for carefully tuned silence detection algorithms, said Morais.
The new DeepSpeech provides transcriptions 260 milliseconds after the end of the audio, or 73% faster than before the streaming decoder was implemented. As for intermediate transcript requests at seconds 2 and 3 of audio files, they’re returned in a fraction of the time.
That’s not all that’s improved on the performance side of the equation. Now, thanks to an upgrade to TensorFlow 1.14 and the adoption of newly available APIs, DeepSpeech is up to two times faster when it comes to model training. Moreover, it’s capable of fully training and deploying models at different sample rates (e.g., 8kHz for telephony data), and the new decoder exposes timing and confidence metadata for each character in the transcript.
Above: The DeepSpeech client.
Image Credit: Mozilla
Morais notes that startup Te Hiku Media — which is using DeepSpeech to develop and deploy the first Te reo Māori automatic speech recognizer — has been exploring the use of the confidence metadata in the decoder to build a digital pronunciation helper for Te reo Māori, starting with New Zealand English and Te reo Māori.
Mozilla’s work in natural language processing extends to the aforementioned Common Voice data set, which was recently updated with 1,400 hours of speech across 18 languages. It’s one of the largest multi-language dataset of its kind, Mozilla claims — substantially larger than the Common Voice corpus it made publicly available eight months ago, which contained 500 hours (400,000 recordings) from 20,000 volunteers in English — and it’ll soon grow larger still. The organization says that data collection efforts in 70 languages are actively underway via the Common Voice website and mobile apps .