datahacker.rs@gmail.com

#006 CNN Convolution On RGB Images

How do we make convolutions on RGB images?

We’ve seen how convolutions over 2D images work. Now, let’s see how we can implement convolutions over not just 2D images, but over three-dimensional volumes. For example, if we want to detect features, not just in a grayscale image, but in an RGB image. 2D (or a grayscale) image and 3D (or a RGB) image

Instead of a $$6 \times 6$$ image, an RGB image could be $$6 \times 6 \times 3$$ where the $$3$$ here corresponds to the $$3$$ color channels. We can think of this as a stack of three $$6 \times 6$$ images.

In order to detect edges or some other feature in this image, we convolve it not with a $$3 \times 3$$ filter, as we did in previous posts, but now with a $$3$$ – dimensional filter. That’s gonna be a $$3 \times 3 \times 3$$, so the filter itself will also have three layers corresponding to red, green and blue channels. The RGB image with the corresponding filter. The $$3^{th}$$ dimension must be the same.

Let’s name them: this first $$6$$ here is the height of the image, the second $$6$$ is the width, and the $$3$$ is the number of channels. Similarly, our filter also have a height, width and the number of channels. Number of channels in our image must match the number of channels in our filter, so these two numbers have to be equal. The output of this will be a $$4 \times 4$$ image, and notice this is $$4 \times 4 \times 1$$, there’s no longer $$3$$ at the end. Look at the image below. Result of a convolution applied on a RGB image

Let’s see in detail how this works, using a more nicely drawn image.

Convolutions on RGB image RGB image, corresponding filter for convolution and the result  of a convolution

Here we can see the $$6 \times 6 \times 3$$ image and the $$3 \times 3 \times 3$$ filter. The last number is the number of channels and it matches between the image and the filter. To simplify the drawing the $$3 \times 3 \times 3$$ filter, we can draw it as a stack of three matrices. Sometimes, the filter is drawn as a three-dimensional cube as we can see in the image below. The filter we use we can consider as a volume

To compute the output of this convolution operation, we take the $$3 \times 3 \times 3$$ filter and first place it in that most upper left position. Notice that $$3 \times 3 \times 3$$ filter has $$27$$ numbers. We take each of these $$27$$ numbers and multiply them with the corresponding numbers from the red, green and blue channel. So, take the first nine numbers from red channel, then the three beneath it for the green channel, then three beneath it from the blue channel and multiply them with the corresponding $$27$$ numbers covered by this yellow cube. Then, we add up all those numbers and this gives us the first number in the output. To compute the next output we take this cube and slide it over by one. Again we do the twenty-seven multiplications sum up $$27$$ numbers and that gives us the next output. When we apply  $$3\times 3\times 3$$ filter on the RGB image it is as we implement the volume

Why our filter has tree channels and what are the coefficients in that filter ?

We choose the first filter as $$1, 0, -1, 1, 0, -1, 1, 0, -1$$ ( as we already did). This can be for a red color, for the green channel the values will be all zeros and for the blue filter as well. We stack these three matrices together to form our $$3 \times 3 \times 3$$ filter. Then, this would be a filter that detects vertical edges, but only in the red channel. Red color edge detector and vertical edge detector for all 3 channels

Alternatively, if it is not important what color the vertical edges are, then we might have a filter with $$1s$$ and $$-1s$$ in all three channels. In this way we got a $$3 \times 3 \times 3$$ edge detector that detects edges in any color. Different choices of the parameters will result in different feature detectors. By convention, in computer vision when you have an input with a certain height and width, and a number of channels, then your filter can have a different height and width, but number of channels will be the same. Again, notice that convolving a $$6 \times 6 \times 3$$ volume with a $$3 \times 3 \times 3$$ gives a $$4 \times 4$$ , a 2D output.

Knowing how to convolve on volumes is crucial for building convolutional neural networks. New question is, what if we want to detect vertical edges and horizontal edges and maybe even $$45°$$ or $$70°$$ as well. In other words, what if we want to use multiple filters at the same time?

We can add a new second filter denoted by orange color, which could be a horizontal edge detector. Convolving an image with the filters gives us different $$4 \times 4$$ outputs. These two $$4 \times 4$$ outputs, can be stacked together obtaining a $$4 \times 4 \times 2$$ output volume. The volume can be drawn this as a box of a $$4 \times 4 \times 2$$ volume, where $$2$$ denotes the fact that we used two different fi

lters. When we convolve with two different filters simultaneously

The idea of convolution on volumes turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with $$3$$ channels, but even more important is that you can now detect $$2$$ features like horizontal and vertical edges. Furthermore, there can be $$10$$ or maybe $$128$$ or maybe several hundred different features. Finally the output will then have a number of channels equal to the number of features we are  trying to detecting.

In the next post we will learn more about one layer of a convolutional neural network.

More resources on the topic: 