The Cardio of Audio

Achieve Great Feat with this Data Science Cheat Sheet


A structured data usually lives in an RDBMS or a database that you can easily search records, see the numbers and compare them. For example, a record can have names, id, date of birth, salary, address, etc. The data is arranged in a structured tabular like format and it’s simple to work with them.

Unstructured data comprises audio, text, images, etc. Around 80% of Enterprise data is stored in an unstructured format. It is not easy to work with them because we can’t directly use the data stored in an image or an audio file. In this article, we will be mainly focussing on the audio data.


The human brain is continuously perceiving audio around us. We hear the birds chirping, the road racing, the air blowing and the people speaking. We have devices to store all this data in various formats like mp3, wav, WMA, etc. Now, what else can we do with this data?

For working with unstructured data like this Deep learning techniques are your best bet.

First, we see what audio looks like.

Audio is represented in the form of waves where the amplitude of the waves differs from time to time.


It is important to understand sampling because sounds are continuous analog signals and when we convert them into the digital signal that is composed of discrete data points from the signals. This process is called sampling and the rate by which sampling is done is called the sample rate. It is measured in Hz (Heartz). Audio with a 48kHz sample rate means the audio was sampled with 48,000 data points in a second. When sampling a little bit of the information is lost.

Get Set, and Do these Amazing Data Science Projects


LibROSA is the popular python package used for music and audio analysis. It has the building blocks to create a music information retrieval.

To install the package with pip you can run this command in your terminal.

pip install librosa

Load an Audio File

We will load a 23-second audio file of a dog barking.

import librosadata, sample_rate = librosa.load(“dog bark.wav”)

The load method of librosa takes the path of the audio file and returns a tuple that has audio sampled data and the sample rate. The default sample rate is 22050. You can also specify a custom sample rate in the argument. To use the original sample rate we use sr=None

data, sample_rate = librosa.load(“dog bark.wav”, sr=None)

Let us see what is in the data-

print(data.shape, data)print(sample_rate)


(1049502,) [ 0.00019836 -0.00036621 0.00016785 …. 0.00099182 0.00161743 0.00135803]


The data is a numpy array with 1049502 data points. The original sample rate is 44100 Hz. Scaling down the sample rate will make the data less and we can perform operations faster but too much scaling down will also result in some information loss.

Displaying the Audio Data

Librosa has a display module that plots you a graph of the data-

import librosa.displaylibrosa.display.waveplot(data)


This is what barking of a dog looks like. Now with the sampled data and the sample rate, we can extract features from the audio.

Why Data Science is a Dream Job?

Feature extractions for Machine Learning

There are various methods and techniques to extract audio features. These are-

Time-domain Features

Zero-Crossing Rate -

If you observed the audio image we saw the sampled data was between -1 to 1. Zero crossing rate is the rate of the change in these signs i.e. the rate of change from a negative value to positive value. This is used heavily in speech recognition and music information retrieval.

Spectral Features

Spectral Centroid -

It indicates the “brightness” of a given sound. It represents the spectral center of gravity. Suppose you are trying to balance a pencil on your finger. So, the spectral centroid would be the frequency where your finger touches the pencil when it’s balanced.

Spectral Rolloff -

Spectral roll-off is the frequency in Hz below a predefined percentage (roll_percent) which is 85% by default in librosa library.

This feature is useful in determining voiced signals from non voiced signals. It is also good for approximating the minimum or maximum frequency by setting the roll_percent to 1 or 0.

Perceptual Features

MFCC — Mel-Frequency Cepstral Coefficients -

Each individual voice sounds different because the voice is filtered by our vocal tract including the tongue, teeth, etc. The shape decides how it sounds and by determining shape accurately we can identify the sound it will produce. The job of the MFCC is to determine the shape of the vocal tract and represent it into a power spectrum.

MFCCs are the most used feature in audio and speech recognition. They were introduced in 1980 and have been the state of art ever since.


The presence of unstructured data is huge on the internet. It’s not an easy task to analyze unstructured data as we have to perform a lot of transformations on the data to extract features. The audio can have 3 different categories of features time-domain, spectral and perceptual features.

Conquer your Dream of Data Science