#009 CNN Pooling Layers

datahacker.rs Other 08.11.2018 | 0

Pooling layers

Apart from convolutional layers, $ConvNets $ often use pooling layers to reduce the image size. Hence, this layer speeds up the computation and this also makes some of the features they detect a bit more robust. Let’s go through an example of pooling, and then we’ll talk about why we might want to apply them.

There are two types of pooling:

$Max \enspace pooling $
$Average \enspace pooling $

$Max \enspace pooling $

Suppose we have a $4 \times 4 $ input image and we want to apply a type of pooling, called $Max\enspace pooling $. The output of this particular implementation of $Max\enspace pooling $ will be a $2\times 2 $ output. The procedure is quite simple: we take our $4 \times 4 $ input and break it into different regions. We’ll cover the four regions as shown in figure below. Then, in the output, which is $2\times 2 $, each of the outputs will be the max from the corresponding shaded region.

max pooling with a 2x2 filter and stride

$Max \enspace pooling $ with a $2\times 2 $ filter and a stride of $2 $

In the upper left, the max of these four numbers is $9 $. In the upper right, the max of the blue numbers is $2 $, lower left the biggest number is $6 $, and lower right the biggest number is $3$. To compute each of the numbers on the right we took the max over a $2 \times 2 $ region. This is similar to applying a filter size of size $2 $ ($f=2 $) because we’re taking a $2 \times 2 $ regions, and we’re taking a stride of $2 $ ($s=2 $).

These are actually the hyperparameters of $Max \enspace pooling $ because we start from this filter size, a $2 \times 2 $ region, that gives us the $9 $ and then we step it over $2 $ steps to look at this region that gave us the $2 $. Then, for the next row we step it down $2 $ steps to give us the $6 $ and then step it to the right by $2 $ steps to give us $3 $. Because the squares are $2 \times 2 $ ( $f=2 $ ), and because we stride by $2 $ ($s=2 $ ).

The intuition behid the $Max\enspace pooling $

If we think of this $4 \times 4 $ region as some set of features (the activations in some layer of the neural network), then a large number means that this particular feature is maybe detected. So, the upper left-hand quadrant has this particular feature, maybe a vertical area or maybe an eye of an animal. Clearly that feature exists in the upper left-hand quadrant, (maybe that’s a cat eye detector), whereas this feature doesn’t really exist in the upper right hand quadrant. So, the features, detected anywhere in one of these quadrants, max operation has preserved in the output of max pooling. What the $max $ operation does is really safe. If this feature is detected anywhere in this filter then keep a high number. However, if this feature is not detected, then this feature most likely doesn’t exist in the corresponding quadrant. That can be an intuition behind $Max \enspace pooling$.

We can say that there are two main reasons that peope use $Max\enspace pooling$:

It’s been found in a lot of experiments to work well.
It has no parameters to learn. There’s actually nothing for the gradient descent to learn. Once we’ve fixed $f $ and $s $, it’s just a fixed computation and gradient descent doesn’t change anything.

Let’s go through an example with some different hyperparameters!

Here, we’re going to use a $5 \times 5 $ input and we’re going to apply max pooling with a filter size $3 \times 3 $. So, $f=3 $ and let’s use a stride of $1$ ($s=1 $). In this case the output size is going to be $3 \times 3$, and the formulas we have developed previously for all the $conv $ layer outputs can be applied for $Max\enspace pooling $ as well.

The formula for calculating a dimension of the output of $conv $ layer is: $\frac{n+2p-f}{s}+1 $ and it also works for calculating the output size of $Max\enspace pooling $ .

Here, we can compute every element of the $3 \times 3 $ output. Note that the filter size is $f=3 $ and that the stride is $s=1 $. With this set of parameters we have obtained the following output (this $3\times 3 $):

max pooling with a 3x3 filter and 1x1 stride

$Max\enspace pooling $ with the $3\times 3 $ fitler and with a stride of 1

So far, we’ve seen $Max\enspace pooling $ on a 2D input. In case of a 3D input the output will have the same dimension as we can see in the picture below. For example, if we have $5 \times 5 \times 2 $ then the output would be $3 \times 3 \times 2 $. The Max pooling calculation is performed on each channel independently. More generally, if we have $5\times 5\times n_{c} $, the output would be $3 \times 3 \times n_{c} $. The Max pooling computation is done independently on each of these $n _{c} $ channels.

Result of the $Max \enspace pooling$ applied on a 3D volume

Average pooling

There is another type of pooling that isn’t very often used and it is called $Average\enspace pooling $. That is, instead of taking the maxima within each filter, it takes the average. In this example the average of the numbers in purple is $3.75 $. Then $1.25 $, then $4 $, and finally $2 $. This is average pooling with hyperparameters $f=2 \enspace and \enspace s=2 $. We can choose other hyperparameters as well. These days $Max \enspace pooling $ is used much more often than $Average \enspace pooling $.

average pooling with a 2x2 filter and stride

Average pooling

Summary of parameters and hyperparameters used in $pooling $ layers

In the following table we will make an overview of values we use in a pooling layer.

filter size

stride

pooling

type of pooling

$$ f\times f $$

$$ s $$

$$ p $$

rarely used

$$Max \enspace pooling $$

$$ or $$

$$ Averedge \enspace pooling $$

Table with hyperparmeters for pooling

An Explanation of parameters used in pooling layers

An Explanation of parameters used in pooling layers

Within pooling, the stride is used quite often and this has the effect of roughly shrinking the height and width by a factor of about $2 $ (the common choice of hyperparameters is $f=2 $ and $s=2 $) and padding is rarely used.

An input of $Max\enspace pooling $ is the volume of size $n_{H}\times n_{W}\times n_{C} $ and it would output a volume, assuming there’s no padding, of $\left \lfloor \frac{n_{H}-f}{s}+1 \right \rfloor\times \left \lfloor \frac{n_{W}-f}{s}+1 \right \rfloor\times n_{C} $. The number of input channels is equal to the number of output channels because pooling is applied to each of our channels independently.

One thing to note about pooling is that there are no parameters to learn and so when we implement $backpropagation $ we find that there are no parameters that $backpropagation $ will adapt through $Max \enspace pooling $. Instead, here are only hyperparameters that are set once. They can be set once by hand or set using cross-validation. Further on, it’s a fixed function that the neural network computes in one of the layers and there are no parameters to be learned.

Now we know how to build convolutional layers and pooling layers!.

In the next post we’ll see more complex example of a $convnet $ and that will also allow us to use fully connected layers.

#009 CNN Pooling Layers