#016 CNN Network in Network – 1×1 Convolutions
Network in Network – 1×1 Convolutions
In terms of designing \(ConvNet \) architectures one of the ideas that really helps is using a \(1\times 1 \) convolution. You might be wondering what does \(1\times 1 \) convolution do? Isn’t that just multiplying by a number? It seems like a funny thing to do. However, it turns out that it’s not quite like that. Let’s take a look!
What does a \(1\times 1 \) convolution do?
An example of \(1\times1 \) convolution
We can see \(1\times 1 \) filter which consists of just one number, number \(2 \). If we take this \(6\times 6\times 1 \) image and convolve it with a \(1\times 1\times 1 \) filter, we obtain the image and multiply it by \(2 \).
A convolution by a \(1\times 1 \) filter doesn’t seem totally useful. We just multiply it by some number, but that’s the case of \(6\times 6\times 1 \) channel images. If we have a \(6\times 6\times 32 \), instead of \(1\times 1 \), then a convolution with a \(1\times 1 \) filter can do something that makes much more sense. Let’s look in the following picture.
An example of a \(1\times 1 \) convolutions on a 3D images
In particular a \(1\times 1 \) convolution will look at each of the \(36 \) different positions (\(6\times 6 \)), and it will take the element-wise product between \(32 \) numbers on the left and the \(32 \) numbers in the filter. Then, a \(ReLU \) non-linearity will be applied. Looking at one out of the \(36 \) positions, maybe one slice through this volume, we take these \(32 \) numbers, multiply it by \(1\times 1 \) slice through the volume, and we get a single number.
In fact, one way to think about the \(32 \) numbers we have in this \(1 \times 1 \times 32 \) filter is that if we have one neuron. That is, taking as input \(32 \) numbers, multiplying them by \(32 \) weights and then applying a \(ReLU \) non-linearity to it, and then obtaining the corresponding result as the output.
More generally, if we have not just one filter but multiple filters, then it’s as if we have not just one unit but multiple units that are taking as inputs all the numbers in one slice, and then building them up into an output there the \(6\times 6\times number \enspace of \enspace filters \). One way to think about the \(1\times 1 \) convolution is that it is basically like having a fully connected neural network that applies to each of the \(36 \) different positions. What this fully connected neural network does, it has a \(32 \) dimensional input whereas the number of outputs equals the number of \(1\times 1 \) filters applied. Doing this every \(36 \) positions we end up with an output that is \(6\times 6 \times number \enspace of \enspace filters \). This can carry out a pretty non-trivial computation on our input volume. This idea is often called a \( 1\times 1 \) convolution, but sometimes it’s also called a \( Network\enspace in\enspace Network \). This idea has been very influential. It has influenced many other neural network architectures, including the \(Inception\enspace network \) which we’ll see in the next posts.
Let’s see an example where a \(1\times 1 \) convolution is useful.
Using \(1\times 1 \) convolutions
An example of how we can reduce a number of channels witx \(1\times 1\) convolution
If we want to shrink the height and width we can use a \(pooling \) layer, and we know how to do that. But what if the number of channels has gotten too big and we want to shrink that? How do we shrink that into a \(28\times 28 \times 32 \) dimensional volume?
What we can do is use \(32 \) filters that are \(1\times 1 \), and technically each filter would be of dimension \(1\times 1 \times192\) because the number of channels in our filter has to match the number of channels in input volume. But we use \(32 \) filters and the output of this process will be a \(28\times 28 \times 32 \) volume. This is a way to let us shrink \(n_{c} \) (number of channels). We’ll see later how this idea of \(1\times 1 \) convolutions allows us to shrink the number of channels and therefore save on computations and networks. But of course, if we want to keep the number of channels to the \(192 \) that’s fine too. The effect of a \(1\times 1 \) convolution is that we apply non-linearity that allows us to learn the more complex function. Adding another layer also helps us to learn more complex functions. So, we can have the input which is \(28 \times 28 \times 192 \) dimensional and the output is \(28 \times 28 \times 192 \)
That’s how a \(1\times 1 \) convolutional layer is actually doing something pretty non-trivial. It has a non-linearity too, in our network and allow us to decrease or keep the same, or increase the number of channels in our volumes. In our next post we’ll see that this is actually very useful for building the \(Inception\enspace network \). To conclude, a \(1\times 1 \) convolution operation is actually doing a pretty non-trivial operation and allows us to shrink the number of channels in our volumes or keep it the same or even increase it if we want.