#028 CNN Anchor Boxes

datahacker.rs Deep Learning 25.11.2018 | 0

Anchor Boxes

As we can see from our previous posts, object detection is quite challenging. This is the final challenge that we are going to explain. Then, we will develop a holistic YOLO algorithm.
One scenario that we may encounter in practice is that several objects of interest are present in the same grid cell. This is shown in the figure below. In this case, we can use the idea of $Anchor\enspace boxes $ to solve this problem. So, let’s start with an example.

anchor boxes on images

In the figure above, we will use a $3\times3 $ grid. The midpoint of both objects, the car and the pedestrian are almost in the same place within the same grid cell. If we use the previously developed ideas, our output vector $y $ will have the following structure:

$$ y = \begin{bmatrix} 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 1\\ 0\\ 0 \end{bmatrix} $$

Obvious challenge is that with this vector we can not detect all three desired classes: pedestrians, cars and motorcycles. That is, we can’t have two detections for a single cell and we have to choose only one.

The main idea of $anchor \enspace boxes $ is to predefine two different shapes. They are called anchor boxes or anchor box shapes. In this way, we will be able to associate two predictions with the two anchor boxes. In general, we might use even more anchor boxes (five or even more), but to make the description easier we will stick with only two shapes.

As we can see in the above picture, we defined anchor box $1$ and anchor box $2$. Every anchor box is defined with the following values: $p_c, \enspace b_x, \enspace b_y, \enspace b_h, \enspace b_w, \enspace c_1, \enspace c_2, \enspace c_3 $. Obviously, the shape of the pedestrian is more similar to the shape of anchor box $1$ and the shape of a car is more similar to the shape of anchor box $2$. Hence, the vector associated with the grid cell in the middle will be:

$$ y = \begin{bmatrix} 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 1\\ 0\\ 0\\ 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 0\\ 1\\ 0 \end{bmatrix} $$

Now, we can see that we recognized a pedestrian with an anchor box $1$ and the car with an anchor box $2$.

Previously, before we were using $ancor\enspace boxes $, we defined a grid for each training image and we assigned an object to the grid cell where the center of a grid is. So, the output was $3 \times 3 \times 8 $ dimensional because we are using the $3\times3 $ grid and in each grid cell we have the values: $p_c, \enspace b_x, \enspace b_y, \enspace b_h, \enspace b_w, \enspace c_1, \enspace c_2, \enspace c_3 $.

How do we encode the objects in the target label?

Previously, each object in the training image is assigned to a grid cell that contains that object’s midpoint. However, now with two anchor boxes, each object is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest Intersection over Union ($IoU$).

Now, the output $y $ is going to be $3\times 3\times 16 $ or $3\times 3 \times 2 \times 8 $ because we use now $2 \enspace anchor\enspace boxes $ and $y $ is $8 $ dimensional.

Let’s go through a concrete example. For this grid cell let’s specify what is $y $.

Another example of anchor boxes

An example of anchor boxes

Looking at this image, we see the pedestrian is more similar to the shape of Anchor box $1$, so we will assign the anchor box $1$ to the pedestrian. Also looking at the shape of the car, we would assign it to anchor box $2$. If a car was actually found in the image, both output $c_1$ and $c_3$ would be $0$ and $c_2$ would be $1$.

Note that this algorithm will not work properly in two different cases:

When we have 2 Anchor boxes, and 3 objects in the same grid cell.
Also, 2 objects in the same grid cell, and both objects have the same Anchor box.

These are some special cases which generally won’t happen so frequently in practice. Hence, so they do not affect the performance of the algorithm that much. It will happen quite rarely especially if we use a $19\times 19 $ grid. In this case, the chance that the two objects have the same midpoint will not happen that often.

Finally, how do we choose the anchor boxes?

Normally, a simple approach to this selection process is to manually select by hand. For example, choosing $5$ to $10$ Anchor box shapes that spans the object we wish to detect. A more advance technique is to apply the $k-means$ clustering algorithm to groups together the types of object shapes.

In the next post, we will talk about the YOLO algorithm.

#028 CNN Anchor Boxes