#014 CNN Residual nets

datahacker.rs Other 12.11.2018 | 0

Residual networks

Last time we saw how VGG Networks work, however very deep neural networks are difficult to train because of the vanishing and exploding gradients types of problems. In this post we’ll learn about skip connections which allows us to take the activation from one layer and feed it to another layer much deeper in the neural network. Using that we will build \(Resnets \) which enable us to train very deep networks, that may have over \(100 \) layers.

Residual block

\(Resnets \) are built out of a residual block. Let’s first describe what this is!

It consists of two layers of a neural network where we start off with some activation \(a^{\left [ l \right ]} \), then we are passing it through a residual block and we will finally get \(a^{\left [ l+2 \right ]} \), as shown in the picture below.

A Residual block architecture

A Residual block

To go through the steps in this computation, following the main path (green in the following picture)

Main path

Main path

We have \(a^{\left [ l \right ]} \) as an input and then first we apply linear operator to it, which is governed by the following equation:

\(z^{\left [ l+1 \right ]}=W^{\left [ l+1 \right ]}a^{\left [ l \right ]}+b^{\left [ l+1 \right ]} \)

After that, we apply the \(ReLU \) non-linearity to get \(a^{\left [ l+1 \right ]} \) and that’s governed by this equation:

\(a^{\left [ l+1 \right ]}=g\left ( z^{\left [ l+1 \right ]} \right ) \)

Where \(g(z) = ReLU(z) \).

Then, in the next layer we apply the linear step again, so we have the equation:

\(z^{\left [ l+2 \right ]}=W^{\left [ l+2 \right ]}a^{\left [ l+1 \right ]}+b^{\left [ l+2 \right ]} \)

And finally after applying another \(ReLU \) function we will get \(a^{\left [ l+2 \right ]} \), and it is governed by this equation:

\(a^{ [ l+2 ]}=g ( z^{ [ l+2 ]} ) \)

In other words information from \(a^{\left [ l \right ]} \) to flow to \(a^{\left [ l+2 \right ]} \) it needs to go through all of these steps which we are going to call the “mainpath” of this set of layers:

\(a^{[l]} \rightarrow linear \rightarrow ReLU \rightarrow a^{[l+1]} \rightarrow linear \rightarrow ReLU \rightarrow a^{[l+2]} \)

Now we’re going to take \(a^{\left [ l \right ]} \) and just fast forward it, copy it, much further into the neural network. So, we will put it just before applying the non-linearity drawn in purple in the following picture.

Main path and short cut connection

Main path and short cut connection

We will call this the “shortcut”. So, rather than following the main path, the information from \(a^{\left [ l \right ]} \) can now follow a shortcut to go much deeper into the neural network. That means that we can calculate the output in the following way:

\(a^{ [ l+2 ]}=g ( z^{ [ l+2 ]} +a^{[l]} ) \)

The addition of this \(a^{[l]} \) in this equation makes this a residual block. Notice that the shortcut is actually added before the \(ReLU \) non-linearity. Each of those nodes in a residual block applies a linear function and a \(ReLU \), so \(a^{\left [ l \right ]} \) was being injected after the linear part but before the \(ReLU \) part.

Sometimes instead of the term shortcut it is also used the term skip connection and that refers to \(a^{\left [ l \right ]} \) just skipping over a layer or skipping over almost two layers in order to pass this information deeper into the neural network.

The use of residual blocks allows us to train much deeper neural networks. \(ResNets \) are built by stacking a lot of residual blocks together. Let’s now see their architecture.

Comparison of Plain networks and Residual networks

Plain network

Plain network

In the above picture a “Plain network” is shown. It is a terminology of the \(Resnet \) papper. In order to turn this into a \(Resnet \), we will add skip connections (or short connections). The picture below shows residual blocks stacked together and they represent a \(Residual \) network.

Residual network

Residual network

It turns out that if we use a standard optimization algorithm such as gradient descents or one another algorithm to train a plain network we find that as we increase the number of layers, the training error will tend to decrease after a while but then it’ll tend to increase. In theory, as we make a neural network deeper, it should only do better and better on the training set. So, in theory, having a deeper network should only hope, however, in practice having a very deep plain network means that our optimization algorithm would have much harder time in training. Training error gets worse if we pick a network that’s too deep.

Training error for Plain networks residual network architecture

Training error for Plain networks

The deeper model performs worse, but it’s not caused by overfitting.

On the other hand, \(Resnets \) tend to have a constantly decreasing training error. This happens even if we train a network with over a \(100 \) layers. \(ResNets \) allow us to train much deeper neural networks without a loss in performance. Maybe at some point this graphic bellow will plateau and will flatten out and it doesn’t help to increase a number of layers.

Training error of Residual network

Training error of Residual network

In the next post we will see why \(Resnets \) work so well.

#014 CNN Residual nets