#018 CNN Inception Network – Inception Module
In the previous post we’ve already seen all the basic building blocks of the Inception network. Here, we will see how to put these building blocks together and build it.
An example of an Inception module
To explain how Inception Network works we will consider a few steps:
- The third step may be (look at a red rectangle) just using a \(1\times 1 \) convolution, maybe with \(64\) filters so we get a $latex 28\times 28 \times 64 $
- In the second step we will do a \(1\times 1 \) convolution with \(96 \) filters so we get \(28\times 28 \times 96 \) volume on which we will apply \(3\times 3 \) convolution with \(128 \) filters and we get \( 28\times 28 \times 128 \) volume.
- In this step we will first do a \(1\times 1 \) convolution with \(16\) filters so we get \(28\times 28 \times 16 \) volume on which we apply \(5\times 5 \) convolution with \(32 \) filters and we get \( 28\times 28 \times 32 \) volume.
- The fouth step in this inception modul consists of first using a max pooling with \(3\times 3 \)filters and a stride of \(1 \) and with the same convolutions, and after \(Max\enspace pooling \) we are using \(1\times 1 \) convolution ( or \(1\times 1\times 192 \)) with 32 filters so the output in this step is \(28\times 28\times 32 \) dimensional volume.
The inception module takes as input the activation or the output from some previous layers. Let’s say that we have \(28\times28\times 192\) volume as previous actiovation. The example we work through in depth was the \(1\times 1 \) followed by a \(5 \times 5 \) layer, so maybe the \(1\times 1 \) has 16 channels and then the \(5 \times 5 \) will output a \(28\times28\times32 \) volume. Then to save computation of our \(3 \times 3 \) convolution we can also do the same here, and then the \(3 \times 3 \) outputs \(28\times 28\times 128 \). Then, maybe we want to consider a \(1\times 1 \) convolution as well, there’s no need to do a \(1\times 1 \) conv followed by another \(1\times 1 \) conv, so that’s just one step here and let’s say it outputs \(28 \times 28 \times 64\), and then finally is the pooling layer.
We will use the same type of padding for pooling, so that the output height and width is still \(28 \times 28\). In this way we can concatenate it with these other outputs. However, notice that if we do max pooling even with same padding, \(3 \times 3\) filter with a stride of \(1 \), the output will be \(28\times28\times 192\). It will have the same number of channels on the same depth as the input that we have here. This seems like a lot of channels, so what we’re going to do is add one more \(1\times 1 \) conv layer and then do what we saw in the \(1\times 1 \) convolutional post, to shrink the number of channels to get this down to \(28\times28\times 32 \). The way we do that is we use \(32 \) filters of \(1\times 1\times 192 \) dimension so that’s why the output dimension has a number channels shrunk down to \(32 \). Then, we don’t end up with the pooling layer taking up all the channels in the final output. Finally, we take all of these blocks and we do channel concatenation, just concatenate across this \(64 \) plus \(128 \) plus \(32 \) plus \(32 \) and this gives us a \(28 \times 28 \times 256 \) dimensional output.
This was one inception module. The overall inception network consists of a larger number of such modules stacked together. We observe a lot of repeated blocks below. Although this network seems complex, it is actually created of the same, though slightly modified blocks (marked with red).
Last but not least, there’s one final detail to the inception network that has to be clarified. We can read in the original research paper , that there are additional side branches depicted with green lines. What do they do? The last few layers of the network is a fully connected layer followed by a softmax layer that makes a prediction. These side branches do is it takes some hidden layer and it tries to use that to make a prediction, this is a softmax output and so is that, and this other side branch again takes a hidden layer passes it through a few fully connected layers, and then the softmax tries to predict what’s the output label. We should think of this as maybe just another detail of the inception network, but what it does, it helps to ensure that the features created in the hidden units, even at the intermediate layers, still can predict the output of an image. This attempt appears to have a regularizing effect on the inception network and prevents this network from overfitting.
Finally here’s one fun fact. Where does the name inception network come from? The inception paper actually cites this meme for we need to go deeper, and this URL is an actual reference in the inception paper which links to this image. It is from the movie titled “The inception”.
In case that we fully understand the inception module, then it is easy for us to understand the whole inception network. After the development of the original inception module, the authors and others have extended it and come up with other versions as well. Their research papers on newer versions of the inception algorithm refer to networks like Inception v2, Inception v3, Inception v4.
After explaining a large number of deep neural network architectures, it is time to see how we can apply them to solve real-world computer vision problems. Let’s see this in the following post!