#010 CNN An Example of a Neural Network
Convolutional Neural Network – An Example
In previous posts (CNN 004, CNN 005 and CNN 009) we have defined all building blocks for building a full convolutional neural network. Let’s now look at an example of a convolutional neural network (CNN). Let’s say that we have a 32 \times 32 \times 3 dimensional image as an input to the CNN. So it’s an RGB image and supoose we want to try to do a handwritten digit recognition (e.g. MNIST dataset).
An example of an input to a CNN when doing a hand written digits recognition
So, we can have a number, for example a number 7 (as in the picture above), and we have to recognize which one of the ten digits from 0 to 9 this is. Let’s build a neural network to do this!
We’re going to use a network similar to the classic neural network called the LeNet – 5 . It is created by Yann LeCun many years ago. What we’re going to show here isn’t exactly LeNet – 5 , but it is inspired by it (especially when we choose the number of parameters) .
We’re using 32\times 32\times 3 input images. The first layer uses a 5\times 5 filter, stride = 1 and there is no\enspace padding . The output of this layer, if we use 6 filters, would be 28\times 28 \times 6. We are going to call this layer Conv\enspace 1 . Calculations in the Conv\enspace 1 layer are:
- convolve an image with 6 filters, so we get 6 channels in the resulting volume, and dimension of those filters are 5\times5 , so that height and weight of the resulting volume are \frac{32-5+0}{1} + 1 = 28 )
- add bias
- apply non-linearity (for example a ReLU non-linearity)
LeNet – 5 neural network – the first Conv layer
Next let’s apply a pooling layer. We’re going to apply Max\enspace pooling here and let’s use an f = 2 and s=2 . When we don’t write a padding, that means p=0 . This should reduce the height and width of the representation by a factor of 2 . So, 28\times 28 now becomes 14\times 14 , and the number of channels remains the same. The output is 14\times 14 \times 6 . We’re going to call this Pool\enspace1 output.
LeNet – 5 – the first layer (the Conv and the Max\enspace pooling layer)
There are two conventions which are slightly inconsistent about what we call a layer:
-
- one layer is actually the layer with convolutional operation
- the Conv layer is a layer, and the Pooling layer is another layer.
When we count the number of layers in a neural network, we usually count just layers where we have weights or parameters to learn. That’s why we will use the convention that Conv \enspace 1 and Pool\enspace1 together represent the first layer. Notice that the pooling layers have no parameters, just a few hyperparameters. In the names Conv\enspace 1 and Pool \enspace1 , when we use number 1 it refers to the fact that both of these are part of layer \enspace 1 of the neural network. Pool\enspace1 is grouped into layer 1 because it doesn’t have it’s own weights.
Now, let’s apply another layer to this. We’re going to use a 5\times 5 filter, so f=5 and stride is s=1 and there is no\enspace padding . Let’s use 16 filters and we will obtain 10\times10\times16 dimensional output. This will give us the Conv\enspace 2 layer. Next, let’s apply Max\enspace pooling to this output with f = 2 and s=2. That creates the output of 10\times10\times16 . This will half the height and width obtaining a 5\times5\times 16 volume. The number of channels is the same as before. We’re going to call this Pool \enspace 2. And in our convention this is layer \enspace 2 because this has a one set of weights in the Conv \enspace 2 layer. Now, 5\times 5 \times 16 is equal to 400 .
Let’s now flatten our Pool\enspace 2 into a 400 \times 1 dimensional vector.
The first two layers of LeNet-5 neural network
We’re going to take this 400 units and build the next layer with 120 units. So, this is actually our first Fully \enspace connected layer. We’re going to call this FC \enspace3 . In this layer we have 400 units densely connected to 120 units. This Fully \enspace connected layer is like the single neural network layer. Hence, this is just a standard neural network where you have a weight matrix that’s called W of a dimension 120 \times 400 . So, this layer is called Fully \enspace connected layer because each of these 400 units is connected to each of these 120 units in the FC\enspace 3 . We also have a bias parameter that’s going to be 120 dimensional. In this layer it is b^{[3]} . As the last step, let’s take 120 units and add another layer. This time a little bit smaller, but let’s say we have 84 units here, we’re gonna call this Fully \enspace connected layer (or FC\enspace 4 ). Finally, we have 84 row numbers that we can feed to a softmax unit. If we’re trying to do a handwritten digit recognition to recognize is it the handwritten digit 0,1,2 and so on up to 9 , then this would be a softmax with 10 outputs. For a 10 handwritten digits problem we will use a softmax with 10 output neurons.
LeNet-5 neural network with Fully \enspace connected layers
This is a reasonably typical example of what a convolutional neural network might look like. This seems like there are a lot of hyperparameters. However, we’ll see some more specific suggestions how to choose these types of hyperparameters.
One common guideline is to actually not try to invent our own settings of hyperparameters, but to look in the literature to see what hyperparameters work for others.Then to just choose an architecture that has worked well for someone else and test it. There’s a chance that it will work for your application as well.
Usually, n_{h} , n_{w} the height and the width will decrease, like in our example earlier. It goes from 32 \times 32 to 28\times 28 to 14\times 14 to 10 \times10 to 5\times 5 . So, as we go deeper usually the height and width will decrease whereas the number of channels n_{c} will increase (from 3 to 6 to 16 ). Then, we have Fully \enspace connected layers at the end. Another pretty common pattern we can see in neural networks is to have conv layers . One or more conv layers followed by a pooling\enspace layer, and then one or more conv layers followed by a pooling layer. Finally, at the end we can have a few Fully \enspace connected layers being followed by a softmax function (or sigmoid for binary classification problem). This is pretty common pattern that we see in convolutional neural networks.
Look at the following table to se what are the dimensions of particular volumes and the number of parameters to be learned.
Activaton \enspace shape | Actiovation\enspace size | number \enspace of \enspace parmeters | |
Input | (32\times32\times3) | 3027 | 0 |
Conv\enspace 1
(f=5, \enspace s=1) |
(28\times28\times3) | 6272 | 208 |
Pool\enspace 1 | (14\times14\times8) | 1568 | 0 |
Conv \enspace 2
(f=5, \enspace s=1) |
(10\times10\times16) | 1600 |
426 |
Pool\enspace 2 | (5\times5\times16) | 400 | 0 |
FC\enspace 3 | (120\times1) | 120 | 48001 |
FC\enspace 4 | (84\times1) | 84 | 10081 |
softmax | (10\times1) | 10 | 841 |
A table with of layers sizes in the LeNet-5 neural network
Just to point out a few things:
- Pooling\enspace layers and the Max\enspace pooling layers don’t have any parameters
- Conv layers tend to have relatively few parameters
- A lot of parameters tend to be in the Fully\enspace Connected layers of the neural network
- The activation size tends to go down gradually as we go deeper into a neural network (If it drops too quickly that is usually not great for the performance as well).
We’ve seen the basic building blocks of convolutional neural networks: the Conv layer, the Pooling layer and the Fully\enspace Connected layer. A lot of computer vision research has gone into figuring out how to put together these basic building blocks in order to build effective neural networks actually requires quite a bit of insight. One of the best ways to gain intuitions about how to put these things together is to see number of concrete examples of how others have done it.
In the next post we will learn why convolutional neural network work so well in the area of computer vision.