datahacker.rs@gmail.com

# #027 CNN Non-Max Suppression algorithm

## Non-Max Suppression

In this post, we will learn how the non-max suppression algorithm allows us to overcome multiple detections of the same object in an image. Let’s go through an example!

Let’s say we want to detect pedestrians, cars, and motorcycles in this image.

If we look at the picture above we can see that there are two cars. Each of these two cars has one midpoint so it should be assigned to just one grid cell which then actually predicts that there is a car in the picture. In practice, we’re running an object classification and localization algorithm for every one of these grid cells, so it’s quite possible that different cells might think that the center of a car is in many different cells.

$$Non-max \enspace supperesion$$ cleans up these multiple bounding boxes

Let’s see an example of how $$Non-Max\enspace suppression$$ works. Since we are running the image classification and localization algorithm on every grid cell, it is possible that many of them will be with a large probability $$p_c$$, that there is an object in that cell. When we run the algorithm we would end up with multiple detections of each object. What the $$Non-Max\enspace suppression$$ does is it bends together these detections so that we end up with just one detection per car.

More specifically, this algorithm will first search for the probability associated with each of these detections, so it looks at $$p_c$$ values first, and then it takes the largest one. In the picture above, there is a rectangle associated with $$0.9$$ (the car on the right) and this means that an algorithm has detected a car there. After this, the $$Non-Max\enspace suppression$$ looks at other rectangles that are close to the first one and the ones with the highest overlap with this one (highest $$IoU$$) will be suppressed. As an example, in the picture above we can see those two rectangles with the $$IoU$$ $$0.6$$ and the $$0.7$$ are going to be suppressed.

More specifically, it first searches for the probability associated with each of these detections (it looks at $$p_c$$ values) and first, it takes the largest one. In the picture above, there is a rectangle associated with $$0.9$$ (the car on the right). This means that an algorithm detected a car there. After this, the $$Non-Max\enspace suppression$$ looks at other rectangles that are close and the ones with a high overlap with this one (high $$IoU$$) will be suppressed. As an example, we can see those two rectangles with the $$IoU$$ $$0.6$$ and the $$0.7$$ are going to be suppressed.

Similarly, the same procedure can be applied for the car on the left, we go through the remaining rectangles and find the one with the highest probability, the highest $$p_c$$ value. So for the car on the left, it will be the box with the value $$p_{c} =0.8$$. Next, a task of $$Non-Max\enspace suppression$$ is to discard the remaining detections having high $$IoU$$ with this cell. In this case, we have two bounding boxes with lower $$p_c$$ that recognized the car on the left (two red rectangles in the left in the above picture) and because they both have big $$IoU$$ both of them will be discarded.

Similarly, for the car on the left, we go through the remaining rectangles and find the one with the highest probability, the highest $$p_c$$ value. Here, it is the box with $$p_c=0.8$$. Next, a task of $$Non-Max\enspace suppression$$ is to discard the remaining detections having high $$IoU$$ with this cell. Here, we have two bounding boxes with lower $$p_c$$ that recognized the car on the left (two red rectangles in the left in the above picture) and because they both have big $$IoU$$ both of them will be discarded.

Applying the $$Non-Max\enspace suppression$$ means that we’re going to output our maximal classifications probabilities, and suppress the other ones (the other detections of the same object) with lower values. That’s why we use the name $$Non-Max\enspace suppression$$.

First, within this $$19\times19$$ grid we’re going to get a $$19\times19\times8$$ output volume. For each of these $$361$$ cells we output the vector:

$$\begin{bmatrix} p_{c}\\b_{x}\\b_{y}\\b_{h}\\b_{w}\\c_{1}\\c_{2}\\c_{3} \end{bmatrix}$$

For this example, we will be detecting only cars. So, for every grid cell, we will obtain the following vector:

$$\begin{bmatrix} p_{c}\\b_{x}\\b_{y}\\b_{h}\\b_{w} \end{bmatrix}$$

Let’s see how $$Non-Max suppression$$ works for this example. It will first discard all the boxes with $$p_c$$ values less than some predefined threshold. We can use for instance a value of $$0.6$$.  For every grid cell we output a bounding box together with a probability of detecting a car within bounding box.

Next, while there are any remaining bounding boxes that we have not yet discarded or processed, we are going to repeatedly select the box with the highest $$p_c$$. That will be selected as our output prediction. Next, we will discard any remaining box with high overlap ( $$IoU$$) with the box that we have just outputted in the previous step. We keep doing this while there are still any remaining boxes that we’ve not yet process until we have taken each of the boxes and either output it as a prediction, or discarded it.

In case that we want to detect more objects, for instance: pedestrians, cars, and motorcycles, the output vector will have three $$p_c$$ components. It has been shown in practice that we can run $$Non-Max\enspace suppression$$ three times independently, one for every output class.

This completes the $$Non-Max suppression$$. In practice, it obtains rather decent results. However, to complete the idea of the $$YOLO$$ algorithm we need to introduce one additional idea. It is the idea of Anchor boxes, and we will explain it in the next post.