#024 CNN Convolutional Operation of Sliding Windows
Convolutional operation of sliding windows
In the previous post we learned about the sliding windows object detection algorithm using a \(convnet \), but we saw that it was too slow. In this post we will see how to implement that algorithm convolutionaly. Let’s see what that means.
To build up the convolutional implementation of sliding windows, let’s first see how we can turn \(Fully \enspace connected \) layers in our neural network into \(Convolutional\) layes. Let’s say that our object detection algorithm inputs \(14\times14\times3 \) images, this is quite small but we will use it just for illustrative purposes, and let’s say it then uses \(5\times5 \) filters and let’s say that it uses \(16 \) of them to map it from \(14\times14\times3 \) to \(10\times10\times16 \), and we apply \(2\times2 \) \(Max\enspace pooling\) layer to reduce the size of a volume to \(5\times5\times 16 \). Then we have a \(Fully\enspace connected\) layer, with \(400 \) units, then another \(Fully\enspace connected\) layer (also with a \(400 \)) units and then a neural network finally outputs \(Y \) using a \(softmax \) unit.
An example of a CNN
In order to make the change we’ll need to change this picture a little bit and instead we’re going to view \(Y \) as \(4 \) numbers corresponding to the cost probabilities of the four classes that the \(softmax \) units is classifying. The four classes could be pedestrian, car, motorcycle and background or something else.
How to turn \(Convolutional \) layers into \(Fully\enspace Connected\) layers ?
The \(convnet \) is the same as before for the first few layers, and now one way of implementing the first \(Fully \enspace connected \) layer, is to implement a \(5\times5 \) filter and let’s use \( 400 \) \(5\times5 \) filters (see the picture below). So, we take a \(5\times5\times16 \) volume and convolve it with a \(5\times5 \) filter. Remember that a \(5\times5 \) filter is implemented as a \(5\times5\times16 \) filter because our convention is that the filter looks across all \(16 \) channels. So, if we have \(400 \) of these \(5\times5\times 16 \) filters, then the output dimension is going to be \(1\times1\times400 \). Rather than viewing these \(400 \) as just a set of nodes (units), we’re going to view this as a \(1\times1\times400 \) volume and mathematically this is the same as a \(Fully \enspace connected \) layer because each of these \(400 \) nodes has a filter of dimension \(5\times5\times16 \), so each of those \(400 \) values is some arbitrary linear function of these \(5\times5\times 16 \) activations from the previous layer.
Turning \(Fully \enspace connected \) layers into \(Convolutional \) layers
Next, to implement the next convolutional layer, we’re going to implement a \(1\times1\) convolution, and if we have \(400 \) \(1\times1\) filters then the next layer will again be \(1\times1\times400 \), so that gives us this next \(Fully \enspace connected \) layer. And finally we’re going to have another \(1\times1 \) filter followed by a \(softmax \) activation so as to give a \(1\times1\times4 \) volume to take the place of these four numbers that the network was outputting. This shows how we can take these \(Fully \enspace connected \) layers and implement them using \(Convolutional\) layes. These sets of units instead are now implemented as \(1\times1\times400 \) and \(1\times1\times4 \) volumes.
A convolutional implementation of sliding windows object detection
Let’s say that our sliding windows \(convnet \) inputs \(14\times14\times3 \) images. As before we have a neural network as follows that eventually outputs a \(1\times1\times4 \) volume which is the output of our \(softmax \) unit. We can see the implementation of this neural network in the following picture.
Turning \(Fully\enspace connected \) layers into \(Convolutional\enspace layers \)
Let’s say that our \(convnet \) inputs \(14\times 14 \) images or \(14\times14\times3 \) images and our test set image is \(16\times 16\times 3\) , so now will add that yellow stripe to the border of this image as we can see in the picture below.
Turning \(Fully\enspace connected \) layers into \(Convolutional\enspace layers \)
In the original sliding windows algorithm we might want to input the blue region into a \(convnet\) and run that once to generate a classification(to output \(0\) or \(1\)) and then slide it down a bit, let’s use the stride of \(2\) pixels, and then we might slide that to the right (for example we can use a stride of \(2\) pixels ) to input this green rectangle into the \(convnet \) and rerun the whole \(convnet \) and get another label \(0 \) or \(1 \).Then we might input this orange region into the \(convnet \) and run it one more time to get another label and then do the fourth and final time with this lower right now purple square. To run sliding windows on this \(16\times 16\times3 \) image, this pretty small image, we run this \(convnet \) from above \(4 \) times in order to forget \(4 \) labels. It turns out a lot of this computation done by these \(4 \) \(convnets \) is highly duplicated, so what the convolutional implementation of sliding windows does is it allows these \(4 \) forward passes of the \(convnet \) to share a lot of computation. Specifically, here’s what we can do. We can take the convent and just run it same parameters, the same \(16 \) \(5\times 5 \) filters and run it, and now we can have a \(12\times12\times16 \) output volume, and then do the max pool same as before, now we have a \(6\times6\times16 \), run through our same \(400 \) \(5\times5 \) filters to get \(2\times2\times 400 \) volume. Now instead of a \(1\times1\times400 \) volume, we have instead a \(2\times2\times 400 \) volume. Run it through our \(1\times1 \) filter and it gives us another \(2\times2\times 400 \) instead of \(1\times1\times400 \), we will do that one more time and now we have a \(2\times2\times4 \) output volume instead of \(1\times1\times4 \). It turns out that this blue \(1\times1\times4 \) subset gives us the result of running in the upper left-hand corner \(14\times 14 \) image, this upper right \(1\times1\times4 \) volume gives us the upper right result, the lower left gives us the results of implementing the content on the lower left \(14\times 14 \) region, and the lower right \(1\times1\times4 \) volume gives us the same result as running the \(convnet \) on the lower right \(14\times 14 \) region.
If we step through all the steps of the calculation, let’s look at the green example. If we had cropped out just this region and passed it through the $latex convnent $, through the \(convnet \) on top, then the first layers activations would have been exactly this region, the next layers activation of the max pooling would have been exactly this region, and then the next layer, the next layer would have been as follows. What this process does, what this convolutional inclination does, is instead of forcing us to run \(4\) propagation on \(4 \) subsets of the input image independently, instead it combines all \(4 \) into \(1 \) for computation and shares a lot of the computation in the regions of the image that are common, all four of the \(14\times 14 \) patches we saw here.
Let’s go through a bigger example. Let’s say we now want to run sliding windows on a \(28\times28\times3 \) image. It turns out if we run for crop the same way, then we end up with an \(8\times8\times4\) output and this corresponds to running sliding windows with that \(14\times 14 \) region, and that corresponds to running sliding windows first on that region does giving us the output corresponding on the upper left-hand corner, then using stride of \(2 \) to shift one window over, one window over, one window over and so on, there are \(8 \) positions, so that gives us this first row. Then as we go down the image as well that gives us all of these \(8\times8\times4 \) outputs. And because of the max pooling of \(2 \) that this corresponds to running our neural network with a stride of \(2 \) on the original image.
A bigger example of turning \(Fully\enspace connected \) layers into \(Convolutional\enspace layers \)
To recap, to implement sliding windows, previously what we do is we drop out a region, let’s say this is on \(14\times 14 \), and run that to our convent and do that for the next region over, then do that for the next \(14\times 14 \) region, then the next one, then the next one, the next one, the next one and so on until hopefully that one recognizes the car. But now instead of doing it sequentially, with this convolutional implementation that we saw in the previous slide, we can implement the entire image of maybe \(28\times28 \) and convolutionaly make all the predictions at the same time by one for pass through this big \(convnet \) in hope it recognize the position of the car.
Sliding Windows example
That’s how we implement sliding windows convolutionally, and it makes the whole thing much more efficient. This algorithm still has one weakness which is the position of the bounding boxes is not going to be too accurate. In the next lecture let’s see how we can fix that problem.
This algorithm for object detection is computationally efficient but is not the most accurate one. In the next post, we will see how we can detect objects more accurately.
In the next post, we will talk about Bounding Box Predictions.