Arvo PärtYou have heard this somewhere, but do not emphasize till now.
But now using it in the same way as instruments do, whether it’s using Google Speech Recognition or telling Alexa, your voice does.
What special do they have? How do they work? This is going to be done by you within 10 lines of code.
There are many cloud-based speech recognition APIs available today. The Google Cloud Speech API and the IBM Watson Speech-to-Text API are the most widely-used ones. But, what if you don’t want your application to depend on a third-party service. Or, what if you want to create a speech recognition-based application that can work offline. Well, you should consider using Mozilla DeepSpeech.
DeepSpeechis an open-source Tensorflow-based speech-to-text processor with reasonably high accuracy. Needless to say, it uses the latest and state-of-the-art machine learning algorithms.
Installing and using it is surprisingly easy. In this tutorial, I’ll help you get started
— A computer running Ubuntu 16.04 or higher. You are free to use a Google Compute Engine VM or a DigitalOcean Droplet.
— Python 3.6
First create a virtual environment with python 3.6, than activate that environment
Create a new directory to store a few DeepSpeech-related files.
The easiest way to install DeepSpeech is to the
tool. Make sure you have it on your computer by running the following command:
sudo apt install python-pip
And now, you can install DeepSpeech for your current user.
pip3 install deepspeech
DeepSpeech needs a model to be able to run speech recognition. You can train your own model, but, for now, let’s use a pre-trained one released by Mozilla. Here’s how you can download it:
You’ll be downloading about 2.0GB of data, so be patient.
Once the download is complete, extract it using the
tar -xvzf deepspeech-0.4.1-models.tar.gz
You should now have the following files:
from deepspeech import Model
import scipy.io.wavfile as wav
You must now initialize a Model instance using the locations of the model and alphabet files. The constructor also expects the number of Mel-frequency cepstral coefficient features to use, a size for the context window, and a beam width for the Connectionist temporal classification decoder. The values of those numbers should match the values used during the training. If you are using the pre-trained model from Mozilla, here’s what you can use:
BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.10
N_FEATURES = 26
N_CONTEXT = 9
P.S.- Use your own path of models file
Next, you can read the WAV file using the read() method available in wavfile.
You can use pre-recorded audio or you can record using python libraries.
I am using pre-recorded audio-
Finally, to perform the speech-to-text operation, use the
method of the model.
Congratulations! you have got the words on screen.
This is pretty good to start and will give good results but not as good as Google API.If you want to train your own model you can train using the same Mozilla DeepSpeech .
Feel free to ask if any query.