datahacker.rs@gmail.com

#013 A CNN LeNet-5

$$LeNet-5$$

The goal of $$LeNet-5$$ was to recognize handwritten digits. So, it takes as an input $$32\times32\times1$$ image. It is a grayscale image, thus the number of channels is $$1$$. Here is a picture of it’s arhitecture. $$LeNet-5$$ architecture

In the first step we use $$6\enspace5 \times 5$$ filters with a stride $$s=1$$ and $$no\enspace padding$$. Therefore we end up with a $$28 \times 28 \times 6$$ volume. Notice that, because we are using $$s=1$$ and $$no\enspace padding$$, the image dimensions reduce from $$32 \times32$$ to $$28 \times 28$$.

Next, $$LeNet$$ applies $$pooling$$. When this paper was written $$Average\enspace pooling$$ was much more in use, so here we will use $$Average\enspace pooling$$. However, nowadays we would probably use $$Max\enspace pooling$$ instead. So, here we will implement $$Average\enspace pool$$ with filter  $$f=2$$ and stride $$s=2$$. We get a $$14 \times 14 \times 6$$ volume, so we reduced dimensions of an image by a factor of $$2$$ and due to use of a stride of  $$2$$.

Next we apply another $$convolutional$$ layer. We will use $$16$$ filters of dimension $$5\times5$$, so we end up with $$16$$ channels in the next volume. The dimensions of volume are $$10\times10\times16$$. Once again, height  and width are reduced and that is because when this paper was written $$same \enspace convolutions$$ were not much in use.

Next we will apply another $$pooling$$ layer with filter size $$f=2$$, and stride $$s=2$$ so once again we reduce the size of an image by $$2$$ (as we did with the first $$pooling$$ layer). Finally we have $$5\times5\times16$$ volume and if we multiply these numbers $$5\times5\times16$$ we get $$400$$. We reduced dimensions of an image so now we can apply a $$Fully\enspace connected$$ layer with $$120$$ nodes. Then we apply another $$Fully\enspace connected$$ layer with $$84$$ nodes. The final step is to use these $$84$$ features to get the final output, and at the output can take on $$10$$ possible values because we have to recognize $$10$$ different digits ($$0$$ to$$9$$), so at the end we have a $$softmax$$ layer with a $$10$$-way classification output (although back then $$LeNet-5$$ actually used a different classifier at the output layer, one that’s useless today). Architecture of $$LeNet-5$$ neural network leads to decrease of hight and weigth of volume and to increase the number of channels

$$LeNet-5$$ Summary

• It consists of one, or more, $$conv$$ layers followed by $$pooling$$ layer and then some $$Fully\enspace conneted$$ layers and ends up with an output layer which is a $$softmax$$ layer
• As we go deeper into the layers of the network, number of channels increase.  It goes from $$1$$ to $$6$$ to $$16$$
• It has small number of parameters – $$60.000$$, and today we use neural networks that have from $$10$$ million to a $$100$$ million parameters

In the next post we will talk about $$AlexNet$$.

More resources on the topic: 