Speech Recognition on Device using Deep Learning

Speech Recognition on Device using Deep Learning


“The Human Voice is the most perfect instrument of all”-

Arvo Pärt

You have heard this somewhere, but do not emphasize till now.

But now using it in the same way as instruments do, whether it’s using Google Speech Recognition or telling Alexa, your voice does.

What special do they have? How do they work? This is going to be done by you within 10 lines of code.

Let’s get started

There are many cloud-based speech recognition APIs available today. The Google Cloud Speech API and the IBM Watson Speech-to-Text API are the most widely-used ones. But, what if you don’t want your application to depend on a third-party service. Or, what if you want to create a speech recognition-based application that can work offline. Well, you should consider using Mozilla DeepSpeech.

DeepSpeechis an open-source Tensorflow-based speech-to-text processor with reasonably high accuracy. Needless to say, it uses the latest and state-of-the-art machine learning algorithms.

Installing and using it is surprisingly easy. In this tutorial, I’ll help you get started


— A computer running Ubuntu 16.04 or higher. You are free to use a Google Compute Engine VM or a DigitalOcean Droplet.

— Python 3.6

Git large file Storage


First create a virtual environment with python 3.6, than activate that environment

Create a new directory to store a few DeepSpeech-related files.

mkdir speech
cd speech

The easiest way to install DeepSpeech is to the pip tool. Make sure you have it on your computer by running the following command:

sudo apt install python-pip

And now, you can install DeepSpeech for your current user.

pip3 install deepspeech

DeepSpeech needs a model to be able to run speech recognition. You can train your own model, but, for now, let’s use a pre-trained one released by Mozilla. Here’s how you can download it:

wget https://github.com/mozilla/DeepSpeech/releases/download/v0.4.1/deepspeech-0.4.1-models.tar.gz

You’ll be downloading about 2.0GB of data, so be patient.

Once the download is complete, extract it using the tar command.

tar -xvzf deepspeech-0.4.1-models.tar.gz

You should now have the following files:


Import Libraries

from deepspeech import Model
import scipy.io.wavfile as wav
import os
import pyaudio

You must now initialize a Model instance using the locations of the model and alphabet files. The constructor also expects the number of Mel-frequency cepstral coefficient features to use, a size for the context window, and a beam width for the Connectionist temporal classification decoder. The values of those numbers should match the values used during the training. If you are using the pre-trained model from Mozilla, here’s what you can use:

LM_WEIGHT = 1.50

deep= Model(path+”/models/output_graph.pb”,

P.S.- Use your own path of models file

Next, you can read the WAV file using the read() method available in wavfile.

You can use pre-recorded audio or you can record using python libraries.

I am using pre-recorded audio-


Finally, to perform the speech-to-text operation, use the stt() method of the model.

deep.stt(audio, fs)

Congratulations! you have got the words on screen.


This is pretty good to start and will give good results but not as good as Google API.If you want to train your own model you can train using the same Mozilla DeepSpeech .

Feel free to ask if any query.