#025 CNN Bounding Box Predictions

Bounding box predictions

In the last post, we learned how to use a convolutional implementation of sliding windows. That’s more computationally efficient, but it still has a problem of not outputting the most accurate bounding boxes.

In this post, we will see how we can obtain more accurate predictions of bounding boxes.

Output accurate bounding boxes

An example of two different bounding boxes - one with a high and one sliding window — *Two different bounding boxes – one with a high and one sliding window*

With sliding windows, we take the sets of windows that we move throughout the image and we obtain a set of sliding windows (the purple box). The next thing we will do is applying a classifier to see if there is a car in that particular sliding window or not.

This is not the most accurate way of getting bounding boxes. Let’s see what we can do.

A good way to get the more accurate output bounding boxes is with the $YOLO $ algorithm. $YOLO $ stands for – $You\enspace Only\enspace Look \enspace Once$.

$YOLO $ algorithm

Let’s say we have a $100\times 100 $ input image. We’re going to place down a grid on this image and for the purpose of illustration. We are going to use a $3\times 3 $ grid. In the actual implementations in practice, we would use a finer one, for example, a $19\times19 $ grid.

We can say that the basic idea of the $Yolo $ algorithm is applying both the image classification and localization algorithm on each of nine grid cells.

How do we define labels $y $?

In the following picture, we can see what are the output vectors $y $ for the tree grid cells that are in the purple, green and orange rectangle.

Specifying the label vector y in the YOLO algorithm, bounding boxes — *Specifying the label vector $y $ in the $YOLO $ algorithm*

Our first output $p_{c} $ is either $0 $ or $1 $ depending on whether or not there is an object in that grid cell. Then, we have $b_{x}, \enspace b_{y}, \enspace b_{h},\enspace b_{w} $ to specify the bounding box of an object (in case that there is an object associated with that grid cell). We take $c_{1},\enspace c_{2},\enspace c_{3} $ to denote if we had recognized pedestrian’s class, motorcycles and the background class. So, $c_{1},\enspace c_{2},\enspace c_{3} $ are labels for the pedestrian, car and motorcycle classes.

In this image, we have nine grid cells, so for each grid cell, we can define a vector, like the one we saw in the picture above. Let’s start with the upper left grid cell. For this grid cell, we see that there is no object present. So, the label vector $y $ for the upper left grid cell will have $p_c = 0 $, and then we don’t care what the remaining values in this vector are. The output label $y $ would be the same for the first tree grid cells because all these tree grid cells don’t have an interesting object in them.

Subsequently, this analyzed image has two objects which are located in the remaining six grid cells. And what the $YOLO $ algorithm does, it takes the midpoint of each of the two objects and then assigns the object to the grid cell that contains the midpoint. So, the left car is assigned to the green grid cell, whereas the car on the right is assigned to the orange grid cell.

Even though four grid cells (bottom right) have some parts of the right car, the object will be assigned to just one grid cell. So, for the central grid cell, the vector $y $ also looks like a vector with no object. The first component $p_{c} $ is equal to $0 $, and then the rest values in this vector can be of any value. We don’t care about it. Hence, for these two grid cells this we have the following vector $y $:

$$ y = \begin{bmatrix} 0 \\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \end{bmatrix} $$

On the other hand, for the cell circled in green on the left, the target label $y $ will be defined in the following way. First, there is an object, so $p_c = 1 $, and then we write $b_{x}, b_{y}, b_{h}, b_{w} $ to specify the position of that bounding box. If class one was to mark a pedestrian then: $C_1 =0 $, class two was a car $C_2 = 1 $ and class three was a motorcycle, so $C_3 = 0 $. Similarly, for the grid cell on the right, there is an object in it and this vector will have the same structure as the previous one.

Finally, for each of these nine grid cells, we end up with eight-dimensional output vectors. And because we have $3\times 3 $ grid cells, we have nine grid cells, the total volume of the output is going to be $3\times 3\times 8 $. So, for each of the $3 \times 3 $ grid cells, we have a eight-dimensional $y $ vector.

nine-grid-cell bounding boxes — *The shape of the output is $3 \times 3 \times 8 $*

The target output volume is $3\times 3\times 8 $. Where for example, this $1\times 1\times 8 $ volume in the upper left corresponds to the target output vector for the upper left of the nine grid cells. For each of the $3\times 3 $ positions, for each of these nine grid cells, we have eight-dimensional target vector $y $ that we want to output. Some of which could be vectors that correspond to the cells without an object of importance, if there’s no object in that grid cell. Therefore, the total target output is a $3\times 3\times 8 $ volume.

Let’s now see in more details how do we define the output vector $y $

First, to train our neural network, the input is $100\times 100\times 3 $ dimensional. Then, we have a usual convolutional neural network with $convolutional $ layers, $Max\enspace pool \enspace$ layers, and so on. So, this neural network maps from an input image to a $3\times 3\times 8 $ output volume.

We have an input $x $ which is the input image like this one in the picture above, and we have these target labels $y $ which are $3\times 3\times 8 $. Further, we use backpropagation to train the neural network in order to map an input $x $ to this type of output volume $y $.

The advantage of this algorithm is that the neural network outputs precise bounding boxes. At the test time, we feed an input image $x $ and run forward propagation step until we get the output $\textbf{Y} $.Next, for each of the nine outputs, we can read $1$ or $0 $. That is if there is an object is some of those nine positions?

As long as we don’t have more than one object in each grid cell, this algorithm should work properly. The problem of having multiple objects within the grid cell is something we’ll talk about later.

Here we have used a relatively coarse $3\times 3 $ grid, in practice, we might use a much finer grid maybe $19\times 19 $. In that case we end up with $19\times 19\times 8 $ output. This step reduces the probability that we encounter multiple objects assigned to the same grid cell.

Let’s notice two things:

This algorithm resembles the image classification and localization algorithm that we explained in our previous posts. And that it outputs the bounding box’s coordinates explicitly. This allows our network to output bounding boxes different aspect ratio providing more precise coordinates in contrast to the sliding windows classifier
This is a convolutional implementation because we’re not assessing this algorithm nine times on the $3\times 3 $ grid or $361 $ times if we are using the $19\times 19 $ grid. Instead, this is one single convolutional evaluation, and that’s why this algorithm is very efficient.

$YOLO $ algorithm gained a lot of popularity because of a convolutional implementation that can detect objects even in the real-time scenarios.

Last but not least, before wrapping up, there’s one more detail: how do we encode these bounding boxes $b_{x}, b_{y}, b_{h}, b_{w} $ ?

Let’s take the example of the car in the picture.

In this grid cell there is an object and the target label $y $will have $ p_{c}$ equal to one. Then we have some values for $b_{x}, b_{y}, b_{h}, b_{w}$, and the last three values in this output vector are $0, \enspace 1, \enspace 0 $ because in this cell we have recognized the car, so the class two or $C_2 $ is equal to $1 $.

So, how do we specify the bounding box? In the $YOLO $ algorithm we take the convention that the upper left point is $(0,0) $ and this lower right point is $(1,1) $. To specify the position of the midpoint, that orange dot in the picture above, $b_{x} $ might be $0.4 $ (we are looking the x-axis) because maybe it’s about 0.4 of the way to the right. $y $, looks maybe like it is $0.3$ (if we are in the direction of the y-axis). Next, the height of the bounding box is specified as a fraction of the overall width of this box.

The width of this red box in the picture above is maybe 90% of the height of the grid cell and that’s why $b_{h} $ is $0.9 $ and the height of this bounding box is maybe one half of the overall height of the grid cell. So, in that case, $b_{w} $, would be 0.8. In other words, this $b_{x},b_{y} $ was specified relative to the grid cell. $b_{x} $ and $b_{y}$ , has to be between $0 $ and $1 $. Because pretty much by definition that orange dot is within the bounds of that grid cell to which it is assigned to. If it wasn’t between $0 $ and $1 $ than it was outside the square that means that it is assigned to another grid cell. These could be greater than 1 in case we have a car which is in two grid cells.

Although there are multiple ways of specifying the bounding boxes, this convention can be quite a reasonable one.

In the $YOLO $ research papers, there were other parameterizations that work even a little bit better, but we hope this gives one reasonable condition that should work properly.

#025 CNN Bounding Box Predictions