#001 CNN Convolutional Neural Networks
Source: Stanford CS 231n
Convolutional Neural Networks
What is Computer Vision?
Computer vision is an interdisciplinary field that deals with how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do.
Computer Vision is one of the fields of artificial intelligence that is rapidly progressing thanks to Deep Learning.
Computer Vision with the help of Deep Learning is currently helping autonomous cars to discover the location of other vehicles and pedestrians. Moreover, the facial recognition algorithm works much better than ever before. The latest apps can unlock the phone, even unlock the door using only the image of our face. Then, on mobile phones, we use applications that give us the opportunity to view the images that are most attractive and most relevant to us, thanks to the use of deep learning. In addition, deep learning enables even new approaches for the creation of art.
Why is Deep Learning so important when talking about Computer Vision?
1. Rapid progress in the Computer Vision allows the creation of completely new applications that could have not be designed a few years ago. Therefore, through learning the Deep Learning tools, we will be able to invent new products and applications.
2. Even if we are not concerned with the development of the Computer Vision systems, creative neural network architectures from this research field could inspire us to create many other methods in the fields of speech recognition, text processing or audio processing.
Here are some challenges of computer vision that we will study in the following blog posts:
- Image classification, also called object recognition. A cat picture (\(64 \times 64 \) pixels) is shown below. Our task is to understand whether there is a cat on the picture or not.
Example of image classification
- Object detection is another example that is being studied in a computer vision:
Example of object detection
Let’s say that we want to make autonomous cars. For that purpose, it is not just enough to find out if there are other cars in this image, but we also must determine the position of these cars in order to avoid them. In detecting an object, we need not only to find out whether an object is in the picture or not, but also to mark the edges around it in order to accurately determine its position. Also, in this case, it should be noted that in the same picture you can find several cars, which are at a specific distance from our car.
- Neural transfer of the style – Neural style transfer, is a very entertaining area of a computer vision. Imagine that we have a picture and want to modify it by changing it’s painting style.
Neural style transfer, images which are used in the process of transfer; content photo (left), image of the painting (right)
In a neural style transfer we have a content photo (left) and an image of the painting (right). Then neural networks links them to reposition the picture of the content, but in a different style! We can see the result in a photo below.
The image after style is transfered
Why computer vision is challenging ?
One of the challenges with the computer vision problems is that the input image can be very large. For instance, in previous example we worked with \(64 \times 64 \) pixel photos, so this is \(64 \times 64 \times 3 \), because there are three color channels. If we multiply these dimensions, we will get \(12288 \) pixels, which already represents a large number of pixels for a relatively small image ( \(64 \times 64 \) pixels ).
Examples of number of pixels in two different sized images
If we work with larger images, for example \(1000 \times 1000 \) pixels, we will process images of \(1M \) pixels (one mega pixel). Then, the number of input features will be \(1000 \times 1000 \times 3 \) (because there are three RGB channels), which is three million. In practice, these images require neural networks with three million input parameters. Even if we have only \(1000 \) hidden neurons in the first layer, then the total number of weight coefficients will be \(3\enspace billion \) \((1000 \times 3M) \). Such a matrix is very large! To train and successfully optimize so many parameters requires a huge amount of data in order to avoid overfitting.
The requirement for a computer memory to train such a large neural network is huge. Therefore, this is almost an impossible task.
Finally, in computer vision applications, we do not want to use only small images, but our goal is to process very large photos as well. In order to do this, it is necessary to conduct a convolution operation, which is one of the foundations of the convolution neural networks. Let’s see what that means!