## #013 A CNN LeNet-5

## \(LeNet-5 \)

The goal of \(LeNet-5 \) was to recognize handwritten digits. So, it takes as an input \(32\times32\times1 \) image. It is a grayscale image, thus the number of channels is \(1 \). Here is a picture of it’s arhitecture.

*\(LeNet-5 \) architecture*

In the first step we use \(6\enspace5 \times 5\) filters with a stride \(s=1 \) and \(no\enspace padding\). Therefore we end up with a \(28 \times 28 \times 6 \) volume. Notice that, because we are using \(s=1 \) and \(no\enspace padding\), the image dimensions reduce from \(32 \times32\) to \(28 \times 28\).

Next, \(LeNet\) applies \(pooling \). When this paper was written \(Average\enspace pooling \) was much more in use, so here we will use \(Average\enspace pooling \). However, nowadays we would probably use \(Max\enspace pooling\) instead. So, here we will implement \(Average\enspace pool \) with filter \(f=2 \) and stride \(s=2\). We get a \(14 \times 14 \times 6\) volume, so we reduced dimensions of an image by a factor of \(2 \) and due to use of a stride of \(2 \).

Next we apply another \(convolutional\) layer. We will use \(16 \) filters of dimension \(5\times5\), so we end up with \(16\) channels in the next volume. The dimensions of volume are \(10\times10\times16 \). Once again, height and width are reduced and that is because when this paper was written \(same \enspace convolutions \) were not much in use.

Next we will apply another \(pooling \) layer with filter size \(f=2 \), and stride \(s=2\) so once again we reduce the size of an image by \(2 \) (as we did with the first \(pooling \) layer). Finally we have \(5\times5\times16\) volume and if we multiply these numbers \(5\times5\times16\) we get \(400\). We reduced dimensions of an image so now we can apply a \(Fully\enspace connected\) layer with \(120\) nodes. Then we apply another \(Fully\enspace connected\) layer with \(84 \) nodes. The final step is to use these \(84 \) features to get the final output, and at the output can take on \(10 \) possible values because we have to recognize \(10 \) different digits (\(0\) to\(9 \)), so at the end we have a \(softmax \) layer with a \(10 \)-way classification output (although back then \(LeNet-5 \) actually used a different classifier at the output layer, one that’s useless today).

*Architecture of \(LeNet-5 \) neural network leads to decrease of hight and weigth of volume and to increase the number of channels *

\(LeNet-5 \) Summary

- It consists of one, or more, \(conv \) layers followed by \(pooling \) layer and then some \(Fully\enspace conneted\) layers and ends up with an output layer which is a \(softmax \) layer

- As we go deeper into the layers of the network, number of channels increase. It goes from \(1\) to \(6\) to \(16\)
- It has small number of parameters – \(60.000\), and today we use neural networks that have from \(10\) million to a \(100\) million parameters

In the next post we will talk about \(AlexNet \).