datahacker.rs@gmail.com

In order to build deep neural networks, one modification to the basic convolutional operation that we have to use is padding. Let’s see how it works.

What we saw in earlier posts is that if we take a $$6 \times 6$$ image and convolve it with a $$3 \times 3$$ filter, we end up with a $$4 \times 4$$ output (or with a $$4 \times 4$$ matrix), and that’s because the number of possible positions for our $$3 \times 3$$ filter, to fit in our $$6 \times 6$$ matrix is $$4 \times 4$$ .

If we convolve an $$n\times n$$ image with an $$f\times f$$ filter, what are the dimensions of the output matrix?

If we have an $$n\times n$$, and convolve that with a $$f\times f$$, then the dimension of the output will be :  $$(n-f+1 )\times (n-f+1)$$ .

So, if we consider the example below, we convolve a $$6\times 6$$ image with a $$4\times4$$ filter we get $$6-3+1=4$$ which is why we end up with a $$4 \times 4$$  output.

Two downsides of convolution:

We see that there are two downsides:

1. Every time we apply a convolutional operator our image shrinks. We’ve gone from $$6 \times 6$$ down to $$4 \times 4$$, and we can only do this a few times before our image starts getting really small. Maybe it shrinks down to $$1 \times 1$$. Usually we don’t want our image to shrink every time we detect the edges or other features in it.
2. If we look at the pixel at the corner of the image border, that little pixel is “touchless”, used only in one of the outputs, because it touches that $$3 \times 3$$ region. But if we take a pixel in the middle, then there are a lot of $$3 \times 3$$ regions that overlap that pixel. Pixels on the corners around the image border are used much less in the output so we’re throwing away a lot of the information near the border of the image.

We don’t want the image neither to shrink on every step of a deep learning, nor to throw away the information from the edges of the image, so in order to fix both of these problems we will pad the image before applying the convolutional operation.  If we do that, and if we convolve an $$8 \times 8$$ image with a $$3 \times 3$$ image we get the $$6 \times 6$$ image so, we’ve managed to preserve the original input size of $$6 \times 6$$ . So, by convention when we’ve padded with zeros, $$p$$ is the padding amount.

Here we will use padding $$p = 1$$.

Applying padding of 1 before convolving with  $$3\times3$$ filter

So, in this example $$p=1$$ because we’re padding all around the image with an extra border of one pixel. Then the output becomes $$(n+2p-f+1) \times (n+2pf+1)$$. This becomes $$6+2×1 -3+1=6$$. We end up with a $$6 \times 6$$ image and the original input image size is preserved.

### Valid and same convolutions

We’ve shown here the effect of padding the border with just one pixel. We can also pad the border with two or more pixels. In terms of how much to pad, it turns out there are two common choices: a valid and a same convolutions.

The valid convolution this basically means that we don’t padd the image. When we do the valid convolutions we have an $$n \times n$$ image convolved with an $$f \times f$$ filter and we get an (n+2p-f+1) x (n+2pf+1) dimensional output. This is like the example we had in previous lectures where we had an $$n \times n$$ image convolved with a $$3 \times 3$$ filter and that gave us a $$4 \times 4$$ output

$$\textbf{|}$$ Valid convolutions $$\Leftrightarrow$$ no padding

 size of an input image size of a filter padding size of an output image $$n \times n$$ $$f \times f$$ no padding, p=0 $$n-f+1 \times n-f+1$$ $$6$$ $$3$$ $$6-3+1 = 4 \Rightarrow 4\times 4$$

Table of dimensions when we do a valid convolution

$$\textbf{|}$$ Same convolution $$\Leftrightarrow$$ pad so that output size is the same as the input size

The other most common choice of padding is called the same convolution. In this case when we pad, the output size is the same as the input size. If we actually look at this formula, when we pad by $$p$$ pixels, then $$n$$ goes to $latex n+2p$  and we add $$–f+1$$.

 size of an input image size of a filter padding size of an output image $$n \times n$$ $$f \times f$$ $$p=1$$ $$n +2p-f+1 \times n+2p-f+1$$ $$6$$ $$3$$ $$6+2-3+1 = 6 \Rightarrow 6\times 6$$

Table of domensions when we do the same convolution

We have an $$n \times n$$ image and the padding of a border of $$p$$ pixels around, then the output size ends up being $$n+2p-f+1$$. If we want $$n+2p-f+1$$ to be equal to $$n$$, that’s the output size is the same as the input size, than $$p$$ is equal to $$(f-1)/2$$. So, when $$f$$ is odd, we are certain that the output size is the same as the input size. Therefore, when the filter was $$3 \times 3$$ as in the earlier example, the padding that would make the output size the same as the input size was $$(3-1)/2=1$$.

That’s another example, if our filter was $$5 \times 5$$,  then if we plug it into that equation we find a padding of $$2$$ is required to keep the output size the same as the input size. By convention, in computer vision $$f$$ is almost always odd. We rarely see even number filters. There are two reasons for that:

1. If $$f$$ was even then we need some asymmetric padding. If $$f$$ is odd then the same convolution gives the natural padding.
2. When we have an odd dimension filter such as $$3 \times 3$$ or $$5 \times 5$$ it has a central position, and sometimes in computer vision it’s nice to have it distinguished. It’s nice to have a pixel that we can call the central pixel so we can define the position of the filter.

In summary, to use padded convolution, we need to specify the value for $$p$$. Also, we can apply a valid convolution, which means $$p=0$$ or we can say this is a same convolution which means pad as much as you need to make sure that the output has the same dimension as the input.

In the next post we’ll talk about how we can implement strided convolutions.