#010 CNN An Example of a Neural Network
Convolutional Neural Network – An Example
In previous posts (CNN 004, CNN 005 and CNN 009) we have defined all building blocks for building a full convolutional neural network. Let’s now look at an example of a convolutional neural network (CNN). Let’s say that we have a \(32 \times 32 \times 3 \) dimensional image as an input to the CNN. So it’s an RGB image and supoose we want to try to do a handwritten digit recognition (e.g. MNIST dataset).
An example of an input to a CNN when doing a hand written digits recognition
So, we can have a number, for example a number \(7 \) (as in the picture above), and we have to recognize which one of the ten digits from \(0\) to \(9 \) this is. Let’s build a neural network to do this!
We’re going to use a network similar to the classic neural network called the \(LeNet – 5 \). It is created by Yann LeCun many years ago. What we’re going to show here isn’t exactly \(LeNet – 5 \), but it is inspired by it (especially when we choose the number of parameters) .
We’re using \(32\times 32\times 3 \) input images. The first layer uses a \(5\times 5 \) filter, \(stride = 1\) and there is \(no\enspace padding \). The output of this layer, if we use \(6 \) filters, would be \(28\times 28 \times 6\). We are going to call this layer \(Conv\enspace 1 \). Calculations in the \(Conv\enspace 1 \) layer are:
- convolve an image with \(6 \) filters, so we get \(6 \) channels in the resulting volume, and dimension of those filters are \(5\times5 \), so that height and weight of the resulting volume are \(\frac{32-5+0}{1} + 1 = 28\) )
- add bias
- apply non-linearity (for example a \(ReLU\) non-linearity)
\(LeNet – 5 \) neural network – the first \(Conv \) layer
Next let’s apply a pooling layer. We’re going to apply \(Max\enspace pooling \) here and let’s use an \(f = 2 \) and \(s=2 \). When we don’t write a padding, that means \(p=0 \). This should reduce the height and width of the representation by a factor of \(2 \). So, \(28\times 28 \) now becomes \(14\times 14 \), and the number of channels remains the same. The output is \(14\times 14 \times 6 \) . We’re going to call this \(Pool\enspace1\) output.
\(LeNet – 5 \) – the first layer (the \(Conv\) and the \(Max\enspace pooling \) layer)
There are two conventions which are slightly inconsistent about what we call a layer:
-
- one layer is actually the layer with convolutional operation
- the \(Conv \) layer is a layer, and the \(Pooling \) layer is another layer.
When we count the number of layers in a neural network, we usually count just layers where we have weights or parameters to learn. That’s why we will use the convention that \(Conv \enspace 1\) and \(Pool\enspace1\) together represent the first layer. Notice that the pooling layers have no parameters, just a few hyperparameters. In the names \(Conv\enspace 1 \) and \(Pool \enspace1 \), when we use number \(1 \) it refers to the fact that both of these are part of \( layer \enspace 1 \) of the neural network. \(Pool\enspace1\) is grouped into layer \(1\) because it doesn’t have it’s own weights.
Now, let’s apply another layer to this. We’re going to use a \(5\times 5 \) filter, so \(f=5 \) and stride is \(s=1 \) and there is \(no\enspace padding \). Let’s use \(16 \) filters and we will obtain \( 10\times10\times16 \) dimensional output. This will give us the \(Conv\enspace 2\) layer. Next, let’s apply \(Max\enspace pooling\) to this output with \(f = 2 \) and \(s=2\). That creates the output of \(10\times10\times16 \). This will half the height and width obtaining a \(5\times5\times 16 \) volume. The number of channels is the same as before. We’re going to call this \(Pool \enspace 2\). And in our convention this is \(layer \enspace 2 \) because this has a one set of weights in the \(Conv \enspace 2 \) layer. Now, \(5\times 5 \times 16 \) is equal to \( 400 \).
Let’s now flatten our \(Pool\enspace 2 \) into a \( 400 \times 1 \) dimensional vector.
The first two layers of \(LeNet-5 \) neural network
We’re going to take this \(400 \) units and build the next layer with \(120 \) units. So, this is actually our first \(Fully \enspace connected \) layer. We’re going to call this \(FC \enspace3 \). In this layer we have \(400 \) units densely connected to \(120 \) units. This \(Fully \enspace connected \) layer is like the single neural network layer. Hence, this is just a standard neural network where you have a weight matrix that’s called \(W \) of a dimension \(120 \times 400 \). So, this layer is called \(Fully \enspace connected \) layer because each of these \(400 \) units is connected to each of these \(120 \) units in the \(FC\enspace 3 \). We also have a bias parameter that’s going to be \(120 \) dimensional. In this layer it is \(b^{[3]} \). As the last step, let’s take \(120 \) units and add another layer. This time a little bit smaller, but let’s say we have \(84 \) units here, we’re gonna call this \(Fully \enspace connected \) layer (or \(FC\enspace 4 \)). Finally, we have \(84 \) row numbers that we can feed to a softmax unit. If we’re trying to do a handwritten digit recognition to recognize is it the handwritten digit \(0,1,2 \) and so on up to \(9 \), then this would be a \(softmax \) with \(10 \) outputs. For a \(10 \) handwritten digits problem we will use a \(softmax \) with 10 output neurons.
\(LeNet-5 \) neural network with \(Fully \enspace connected \) layers
This is a reasonably typical example of what a convolutional neural network might look like. This seems like there are a lot of hyperparameters. However, we’ll see some more specific suggestions how to choose these types of hyperparameters.
One common guideline is to actually not try to invent our own settings of hyperparameters, but to look in the literature to see what hyperparameters work for others.Then to just choose an architecture that has worked well for someone else and test it. There’s a chance that it will work for your application as well.
Usually, \(n_{h} \) , \(n_{w} \) the height and the width will decrease, like in our example earlier. It goes from \(32 \times 32 \) to \(28\times 28 \) to \(14\times 14 \) to \(10 \times10\) to \(5\times 5 \). So, as we go deeper usually the height and width will decrease whereas the number of channels \(n_{c} \) will increase (from \(3 \) to \(6 \) to \(16 \)). Then, we have \(Fully \enspace connected \) layers at the end. Another pretty common pattern we can see in neural networks is to have \(conv \) layers . One or more \(conv\) layers followed by a \(pooling\enspace layer\), and then one or more \(conv \) layers followed by a \(pooling \) layer. Finally, at the end we can have a few \(Fully \enspace connected \) layers being followed by a \(softmax \) function (or \(sigmoid \) for binary classification problem). This is pretty common pattern that we see in convolutional neural networks.
Look at the following table to se what are the dimensions of particular volumes and the number of parameters to be learned.
\(Activaton \enspace shape \) | \(Actiovation\enspace size \) | \(number \enspace of \enspace parmeters \) | |
$$ Input $$ | $$ (32\times32\times3) $$ | $$ 3027 $$ | $$ 0 $$ |
$$Conv\enspace 1$$
$$ (f=5, \enspace s=1) $$ |
$$ (28\times28\times3) $$ | $$ 6272 $$ | $$ 208$$ |
$$ Pool\enspace 1 $$ | $$ (14\times14\times8) $$ | $$ 1568 $$ | $$ 0 $$ |
$$ Conv \enspace 2 $$
$$ (f=5, \enspace s=1) $$ |
$$ (10\times10\times16) $$ | $$ 1600 $$ |
$$ 426 $$ |
$$ Pool\enspace 2 $$ | $$ (5\times5\times16) $$ | $$ 400$$ | $$ 0 $$ |
$$ FC\enspace 3 $$ | $$ (120\times1) $$ | $$ 120 $$ | $$ 48001 $$ |
$$ FC\enspace 4 $$ | $$ (84\times1) $$ | $$ 84 $$ | $$ 10081$$ |
$$ softmax $$ | $$ (10\times1) $$ | $$ 10 $$ | $$ 841 $$ |
A table with of layers sizes in the \(LeNet-5 \) neural network
Just to point out a few things:
- \(Pooling\enspace \) layers and the \(Max\enspace pooling\) layers don’t have any parameters
- \(Conv \) layers tend to have relatively few parameters
- A lot of parameters tend to be in the \(Fully\enspace Connected\) layers of the neural network
- The activation size tends to go down gradually as we go deeper into a neural network (If it drops too quickly that is usually not great for the performance as well).
We’ve seen the basic building blocks of convolutional neural networks: the \(Conv \) layer, the \(Pooling \) layer and the \(Fully\enspace Connected \) layer. A lot of computer vision research has gone into figuring out how to put together these basic building blocks in order to build effective neural networks actually requires quite a bit of insight. One of the best ways to gain intuitions about how to put these things together is to see number of concrete examples of how others have done it.
In the next post we will learn why convolutional neural network work so well in the area of computer vision.