#011 CNN Why convolutions ?
Why convolutions ?
In this post we will talk about why convolutions or convolutional neural networks work so well in a computer vision.
Convolutions are very useful when we include them in our neural networks. There are two main advantages of \(Convolutional \) layers over \(Fully\enspace connected\) layers:
- parameter sharing and
- sparsity of connections.
We can illustrate an example. Let’s say that we have a \(32\times32\times3\) dimensional image. This actually comes from the example from the previous post, but let’s say we use \( 6 \), \(5 \times 5\) filters (\( f=5 \)), so this gives \(28\times28\times6 \) dimensional output.
The main advantage of \(Conv\) layers over \(Fully\enspace connected \) layers is number of parameters to be learned
If we multiply \(32\times32\times3\) we’ll get \(3072\) and if we multiply \(28\times28\times 6 \) that is \(4704\). If we were to create in our network with \(3072\) units in one layer and \(4704 \) units in the next layer and if we connect every one of these neurons then the weight matrix, the number of parameters in the weight matrix, would be a \(3072 \times 4704\) which is about \(14\) million parameters to be learned. That’s just a lot of parameters to train and today we can train neural networks with even more parameters than \(14 \) million. However, considering that this is just a pretty small image, this is a lot of parameters to train. Naturally, if this was to be a \(1000 \times 1000 \) image, then this weight matrix will just become too large. On the other hand, if you look at the number of parameters in this convolutional layer, each filter is \(5 \times 5\), so each filter has \(25 \) parameters plus a bias parameter which means we have \(26\) parameters per filter, and we have six filters so the total number of parameters is \(256 \). Hence, the number of parameters in this \(Conv \) layer remains quite small.
The reasons that a \(Convnet \) has a small number of parameters is really two reasons:
- Parameter sharing
A feature detector, for example a vertical edge detector, that’s useful in one part of the image is probably useful in another part of the image. What this means is that if we apply a \(3\times3 \) filter for detecting vertical edges at one part of an image, we can then apply the same \(3\times3 \) filter at another position in the image .
A feature detector, such as a vertical edge detector, that’s useful in one part of the image may be useful in another part of the image as well
Each of these feature detectors, can use the same parameters in a lot of different positions in our input image in order to detect a vertical edge or some other feature. This is also true for low-level features like edges as well as for higher-level features like maybe detection of the eye that indicates a face.
- Sparsity of connections
\(Convnets \) are able to have relatively few parameters due to sparse connections.
In each layer, each output value depends only on a small number of inputs
Let’s now explain what does sparsity of connections mean. If we look at the zero (blue-upper left): it was computed by a \(3\times3 \) convolution and it depends only on this \(3\times3\) input grid of cells. So, this output unit on the right is connected only to \(9 \) out of these \(36 \) (\(6 \times 6\)) input features. In particular, the rest of these pixel values, do not have any effect on that output. As an another example, this red output depends only on \(9 \) input features, and as if only those \(9 \) input features are connected to this output.
Through these two mechanisms a neural network has a lot fewer parameters which allows it to be trained on a smaller training sets. Also, this makes it being less prone to an overfitting. One another reason why convolutional neural networks are so good in computer vision is:
- Translation invariance
Next, convolutional neural networks are being very good at capturing translation invariance, and that means that a picture of a car shifted a couple pixels to the right is still a car. Also, convolutional structure helps neural network because the fact that an image shifted a few pixels should result in pretty similar features and should probably be assigned the same output label. The fact that we’re applying the same filter to all the positions of the image both in the early layers and in the later layers, that helps a neural network to automatically learn to be a more robust, or to better capture this desirable property of translation invariance .
Finally, let’s put it all together and see how we can train one of these networks.
Putting it all together: Training a Convolutional Neural Network
We use a gradient descent to optimize parameters to reduce cost function \( J \) :
$$ J=\frac{1}{m}\sum_{i=1}^{m} \left ( {\hat{y}}^{\left ( i \right )},y^{\left ( i \right )} \right ) $$
Let’s say that we want to build a car detector and we have a training set as follows: \(x \) is an image and \( y \) can be binary labels for one of \( k \) classes. Let’s say we’ve chosen a convolutional neural network structure that suits the image. That convolutional neural network consists of some \(convolutional \) and \(Pooling\enspace layers \). Then, we have \(Fully\enspace connected\) layers followed by a \(softmax \) output that gives us \(\hat{y} \). The \(Convolutional \) layers and the \(Fully\ connected\) layers will have various parameters \(w \), as well as biases \(b \), so that lets us to define a cost function. We can compute the cost function \(J \) as the sum of losses of our neural networks predictions on our entire training set and it may be divided by m to get an average value. To train this neural network we can use a gradient descent or some other optimization algorithm, like gradient descent with momentum, in order to optimize all the parameters of the neural network and to reduce the cost function \(J \). And if we do this, we can build a very effective car detector or some other detector.
In the next posts we will see an architecture of some convolutional neural networks.