A structured data usually lives in an RDBMS or a database that you can easily search records, see the numbers and compare them. For example, a record can have names, id, date of birth, salary, address, etc. The data is arranged in a structured tabular like format and it’s simple to work with them.
Unstructured data comprises audio, text, images, etc. Around 80% of Enterprise data is stored in an unstructured format. It is not easy to work with them because we can’t directly use the data stored in an image or an audio file. In this article, we will be mainly focussing on the audio data.
The human brain is continuously perceiving audio around us. We hear the birds chirping, the road racing, the air blowing and the people speaking. We have devices to store all this data in various formats like mp3, wav, WMA, etc. Now, what else can we do with this data?
For working with unstructured data like this Deep learning techniques are your best bet.
First, we see what audio looks like.
Audio is represented in the form of waves where the amplitude of the waves differs from time to time.
It is important to understand sampling because sounds are continuous analog signals and when we convert them into the digital signal that is composed of discrete data points from the signals. This process is called sampling and the rate by which sampling is done is called the sample rate. It is measured in Hz (Heartz). Audio with a 48kHz sample rate means the audio was sampled with 48,000 data points in a second. When sampling a little bit of the information is lost.
LibROSA is the popular python package used for music and audio analysis. It has the building blocks to create a music information retrieval.
To install the package with pip you can run this command in your terminal.
pip install librosa
We will load a 23-second audio file of a dog barking.
import librosadata, sample_rate = librosa.load(“dog bark.wav”)
The load method of librosa takes the path of the audio file and returns a tuple that has audio sampled data and the sample rate. The default sample rate is 22050. You can also specify a custom sample rate in the argument. To use the original sample rate we use sr=None
data, sample_rate = librosa.load(“dog bark.wav”, sr=None)
(1049502,) [ 0.00019836 -0.00036621 0.00016785 …. 0.00099182 0.00161743 0.00135803]
The data is a numpy array with 1049502 data points. The original sample rate is 44100 Hz. Scaling down the sample rate will make the data less and we can perform operations faster but too much scaling down will also result in some information loss.
Librosa has a display module that plots you a graph of the data-
This is what barking of a dog looks like. Now with the sampled data and the sample rate, we can extract features from the audio.
There are various methods and techniques to extract audio features. These are-
Zero-Crossing Rate -If you observed the audio image we saw the sampled data was between -1 to 1. Zero crossing rate is the rate of the change in these signs i.e. the rate of change from a negative value to positive value. This is used heavily in speech recognition and music information retrieval.
Spectral Centroid -It indicates the “brightness” of a given sound. It represents the spectral center of gravity. Suppose you are trying to balance a pencil on your finger. So, the spectral centroid would be the frequency where your finger touches the pencil when it’s balanced.
Spectral Rolloff -Spectral roll-off is the frequency in Hz below a predefined percentage (roll_percent) which is 85% by default in librosa library.
This feature is useful in determining voiced signals from non voiced signals. It is also good for approximating the minimum or maximum frequency by setting the roll_percent to 1 or 0.
MFCC — Mel-Frequency Cepstral Coefficients -Each individual voice sounds different because the voice is filtered by our vocal tract including the tongue, teeth, etc. The shape decides how it sounds and by determining shape accurately we can identify the sound it will produce. The job of the MFCC is to determine the shape of the vocal tract and represent it into a power spectrum.
MFCCs are the most used feature in audio and speech recognition. They were introduced in 1980 and have been the state of art ever since.
The presence of unstructured data is huge on the internet. It’s not an easy task to analyze unstructured data as we have to perform a lot of transformations on the data to extract features. The audio can have 3 different categories of features time-domain, spectral and perceptual features.