Mysteries of Neural Networks
We live in fascinating times, where Deep Learning [DL] is continuously applied in new areas of our life and very often, revolutionizes otherwise stagnated industries. At the same time, open-source frameworks such as Keras and PyTorch level the playing field and give everybody access to state-of-the-art tools and algorithms. Strong community and simple API of these libraries make it possible to have cutting edge models at your fingertips, even without in-depth knowledge of math that makes it all possible.
However, the understanding of what is happening inside the Neural Network [NN] helps a lot with tasks like architecture selection, hyperparameters tuning, or performance optimization. Since I believe that nothing teaches you more than getting your hands dirty, I’ll show you how to create a Convolutional Neural Network [CNN] capable of classifying MNIST images, with 90% accuracy, using only NumPy .
NOTE:Convolutional neural network is a type of deep neural network, most commonly used to analyze images.
This article is directed primarily to people with some experience with DL frameworks. However, if you are just a beginner — entering the world of Neural Networks — please don’t be afraid! This time, I’m not planning to analyze any math equations. Honestly, I’m not even going to write them down. Instead, I’ll try my best to give you an intuition about what happens under the cover of these well-known libraries.
As already mentioned, our primary goal is to build a CNN, based on the architecture shown in the illustration above and test its capabilities on the MNIST image dataset. This time, however, we won’t use any of the popular DL frameworks. Instead, we will take advantage of NumPy — a powerful but low-level library for linear algebra in Python. Of course, this approach will significantly complicate our job, but at the same time, it will allow us to understand what is happening at each stage of our model. Along the way, we will create a simple library containing all the necessary layers, so you will be able to continue experimenting and solve other classification problems.
NOTE:MNIST is a large database of handwritten digits that is commonly used as a benchmark for image recognition algorithms. Each black and white photo is 28x28 px.
L et’s stop for a second to analyze the structure of digital images, as it has a direct impact on our design decisions. In reality, digital photos are huge matrices of numbers. Each such number represents the brightness of a single pixel. In the RGB model, the color image is composed of three such matrices corresponding to three color channels — red, green, and blue. On the other hand, to represent grayscale images — like those we can find in the MNIST data set — we only need one such matrix.
In linear algebra, these structured and multidimensional matrices are called tensors . Tensor dimensions are described by their shape . For example, the shape of a single MNIST image is
[28, 28, 1] , where successive values indicate the height, width, and the number of color channels.
A Sequential Model is one where successive layers form a linear flow — the outcome of the first layer is used as input to the second one, and so on. The model acts as a conductor in this orchestra and is responsible for controlling the data flow between the layers.
There are two flow types — forward and backward . We use forward propagation to make predictions based on already accumulated knowledge and new data provided as an input
X . On the other hand, backpropagation is all about comparing our predictions
Y_hat with real values
Y and drawing conclusions. Thus, each layer of our network will have to provide two methods:
backward_pass , which will be accessible by the model. Some of the layers — Dense and Convolutional — will also have the ability to gather knowledge and learn. They keep their own tensors called weights and update them at the end of each epoch. In simple terms, a single epoch of model training is comprised of three elements: forward and backward pass as well as weights update .
C onvolution is an operation where we take a small matrix of numbers (called kernel or filter) and pass it over our image to transform it based on filter values. After placing our kernel over a selected pixel, we take each value from the filter and multiply them in pairs with corresponding values from the image. Finally, we sum everything up and put the result in the right place in the output matrix.
It’s quite simple, right? Right? Well, often, things tend to be a little bit more complicated. In order to speed up calculations, a layer usually processes multiple images at once. Therefore, we pass a four-dimensional tensor with shape
[n, h_in, w_in, c] as an input. Here
n corresponds to the number of images processed in parallel — so-called batch size . The rest of the dimensions are quite standard — width, height, and the number of channels.
Moreover, usually, input tensor can have more than one channel. Above, you can see an example of a layer that performs the convolution on color images. That process is called convolution over volume. The most important rule, in that case, is that the filter and the image must have the same number of channels. We proceed very much like in standard convolution, but this time we multiply the pairs of numbers from the three-dimensional tensor.
Finally, to make the layers as versatile as possible, each of them usually contains multiple filters. We carry out the convolution for each of kernels separately, stack the results one on top of the other and combine them into a whole. Convolutional layer forward pass produces a four-dimensional tensor with
[n, h_out, w_out, n_f] shape, where
n_f corresponds to the number of filters applied in a given layer. Let’s take a look at the visualization below to gain a little bit more intuition about those dimensions.
It is commonly believed that a higher resolution improves photo quality. After all, smooth edges of objects visible in the picture, make the overall scene more appealing to the human eye. Interestingly, very often, more pixels do not translate into more detailed image understanding. It seems the computers simply don’t care too much. Storing these redundant pixels is called over-representation. Very often, even a significant reduction of the tensor volume does not affect the quality of the achieved predictions.
NOTE:Nowadays standard smart phone camera is capable of producing 12Mpx images. Such an image is represented by a colosal tensor consisting of 36 million numbers.
The main task of the pooling layer is to reduce the spatial size of our tensor.We do this to limit the number of parameters that we need to train — shortening the whole training process. This effect is achieved by dividing the tensor into sections and then applying a function of our choice on each part separately. The function must be defined in such a way that for every section it returns a single value. Depending on our choice, we may deal with, for example, max or average-pooling.
The visualization above shows a simple max-pooling operation. During forward propagation, we iterate over each section and find its maximum value. We copy that number and save it in the output. At the same time, we also memorize the location of the number we selected. As a result, two tensors are created — the output, which is then passed on to the next layer, and the mask, which will be used during backpropagation. The pooling layer transforms the tensor form original shape
[n, h_in, w_in, c] to
[n, h_out, w_out, c] . Here the ratio between
h_out is defined by stride and
When backpropagating through the pooling layer, we start with differentials tensor and try to expand its dimensions. To begin with, we create empty tensor with shape
[n, h_in, w_in, c] and fill it with zeros. Then, use cached mask tensor to relocate input values in places previously occupied by maximum numbers.
I t’s one of the most popular methods for regularization and preventing Neural Network overfitting. The idea is simple — every unit of the dropout layer is given the probability of being temporarily ignored during training. Then, in each iteration, we randomly select the neurons that we drop according to the assigned probability. The visualization below shows an example of a layer subjected to a dropout. We can see how, in each iteration, random neurons are deactivated. As a result, the values in the weight matrix become more evenly distributed. The model balances the risk and avoids betting all the chips on a single number. During inference, the dropout layer is turned off so we have access to all parameters.
NOTE:Overfitting occurs when our model too closely fit to a limited set of data points. Model like that will poorly generalize and most likely fail given new set of data.
I t’s surely the simplest layer that we implement during our journey. However, it serves a vital role of a link between the convolutional and densely connected layers. As the name suggests, during the forward pass, its task is to flatten the input and change it from a multidimensional tensor to a vector. We will reverse this operation during the backward pass.
A mongst all the functions that we will use, there are a few straightforward but powerful ones. Activation functions can be written in a single line of code, but they give the Neural Network non-linearity and expressiveness that it desperately needs. Without activations, NN would become a combination of linear functions so that it would be just a linear function itself . Our model would have limited expressiveness, no greater than logistic regression. The non-linearity element allows for greater flexibility and the creation of complex functions during the learning process.
S imilar to activation functions, dense layers are the bread and butter of Deep Learning. You can create fully functional Neural Networks — like the one you can see on the illustration below — using only those two components. Unfortunately, despite obvious versatility, they have a fairly large drawback — they are computationally expensive. Each dense layer neuron is connected to every unit of the previous layer. A dense network like that requires a large number of trainable parameters. This is particularly problematic when processing images.
Luckily, the implementation of such a layer is very easy. The forward pass boils down to multiplying the input matrix by the weights and adding bias — a single line of NumPy code. Each value of the weights matrix represents one arrow between neurons of the network visible in Figure 10. The backpropagation is a bit more complicated, but only because we have to calculate three values:
dA — activation derivative,
dW — weights derivative, and
db — bias derivative. As promised, I am not going to post math formulas in this article. What is essential, calculating these differentials is simple enough that it won’t cause us any problems. If you would like to dig a little deeper and are not afraid to face linear algebra, I encourage you to read my otherarticle where I explain in detail all twists and turns of dense layers backward pass.
I hope that my article has broadened your horizons and increased your understanding of math operations taking place inside the NN. I admit that I learned a lot by preparing code, comments, and visualizations used in this post. If you have any questions, feel free to leave a comment under the article or reach me out through social media.
This article is another part of the “Mysteries of Neural Networks” series, if you haven’t had the opportunity yet, please consider readingother pieces. Also, if you like my job so far, follow me on Twitter , Medium , and Kaggle . Check out other projects I’m working on like MakeSense — online labeling tool for small Computer Vision projects. Most importantly, stay curious!