datahacker.rs@gmail.com

#023 CNN Object Detection

Object Detection

We have learned about object localization as well as landmark detection, now let’s build an object detection algorithm. In this post we’ll learn how to use a $$convnet$$ to perform object detection using a Sliding windows detection algorithm.

Car detection – an example

Let’s say we want to build a car detection algorithm.

Some examples of a training set

We can first create a labeled training set $$(x,y)$$ with closely cropped examples of cars and some other pictures that aren’t pictures of cars. For making a training dataset we can take a picture and crop it out. We want to cut out anything else that is not a part of a car, so we end up with a car centered in pretty much the entire image. Given this labeled training set we can then train a $$Convnet$$ that inputs an image, like one of these closely cropped images above, and then the job of the $$Convnet$$ is to output $$y$$ ($$0$$ or $$1$$ is as a car or not).

Once we have trained up this $$Convnet$$ we can then use it in sliding windows detection. The way we do that is, if we have a test image like the following one, that we start by picking a certain window size shown down there, and then we would input into a $$Convnet$$ this small rectangular region.

Detecting a car with a sliding window

Take just this little red square, as we draw in the picture above, and put that into the $$Convnet$$, and have a $$Convnet$$ make a prediction. Presumably for that little region in the red square, it will say that a little red square does not contain a car. In the sliding windows detection algorithm, what we do is we then process input a second image now bounded by the red square shifted a little bit over and feed that to the $$Convnet$$ we speed in just the region of the image in the red square to the $$Convnet$$ and run the $$Convnet$$ again, and then we do that with a third image and so on and we keep going until we split the window across every position in the image. We basically go through every region of this size and pass lots of little crafted images into the $$convnet$$ and have it classify $$0$$ or $$1$$ for each position at some stride.Running this was called a sliding window through the image and  we’d then repeat this but now using a larger window and then a more large window (as we can see in the following image).

Different sizes of sliding windows

So, we take a slightly larger region and run that region, feed that to the $$convnet$$ and have it output $$0$$ or $$1$$. Then we slide the window over again using some stride and so on, and we run that throughout our entire image until we get to the end. Then we might do the third time using even larger windows and so on.

The hope is that if there’s a car somewhere in the image so, that there will be a window  where for which the $$convnet$$  will have output $$1$$ for that input region  which means that there is a car. This algorithm is called Sliding windows detection because we take these windows, these square red boxes, and slide them across the entire image and classify every square region with some stride as containing a car or not.

Disadvantages of sliding window detection and how to overcome them

There’s a huge disadvantage of sliding windows detection which is the Computational cost, because we’re cropping out so many different square regions in the image and running each of them independently through a $$convnet$$. If we use the very course stride, a very big stride, very big step size, then that would reduce the number of windows we need to pass through the $$convnet$$, but that coarser granularity may hurt performance, whereas if we use a very fine granularity or a very small stride then the huge number of all these little regions we’re passing through the $$convnet$$ means that there’s a very high computational cost. Before the rise of neural networks people used to use much simpler classifiers, like a simple linear classifier overhand engineer features in order to perform object detection, and in that error because each classifier was relatively cheap to compute it was just a linear function, sliding windows detection ran properly, it was not a bad method, but with $$convnets$$ now running a single classification task is much more expensive and sliding windows this way is infeasible slow. Unless we use a very fine granularity or a very small stride we end up not able to localize the objects that accurately within the image as well.

Fortunately, this problem of computational cost has a pretty good solution. In particular the sliding windows object detector can be implemented convolutionally or much more efficiently. Let’s see in the next post how we can do that.