#018 PyTorch – Popular techniques to prevent the Overfitting in a Neural Networks
Highlights: Hello and welcome to our new post. In today’s post, we will discuss one of the most common problems that arise during the training of deep neural networks. It is called overfitting, and it usually occurs when we increase the complexity of the network. In this post, you will learn the most common techniques to reduce overfitting while training neural networks. So, let’s begin.
Tutorial Overview:
- What is overfitting?
- Common tehniques to reduce the overfitting
- How to apply L2 regularization and Dropouts in PyTorch
1. What is overfitting?
When building a neural network our goal is to develop a model that performs well on the training dataset, but also on the new data that it wasn’t trained on.
However, when our model is too complex, sometimes it can start to learn the irrelevant information in the dataset. That means that model memorizes the noise that is closely related only to the training dataset. In that case, the model is highly inaccurate because the memorized pattern does not reflect the important information present in the data. For such a model we say that it is overfitted and is unable to generalize well to new data. It has learned the features of the training set extremely well, but if we give the model any data that slightly deviates from the exact data used during training, it’s unable to generalize and accurately predict the output.
The best way to tell if your model is overfitted is to use a validation dataset during the training. Then, if you realize that the validation metrics are considerably worse than the training metrics you can be sure that your model is overfitted.
The logical response to prevent overfitting can either be to stop the training process earlier or to reduce the complexity of the network. However, in both of these cases, the opposite problem called underfitting can occur. It happens when the model has not trained for enough time, so it is unable to determine a meaningful relationship between the input and output variables.
So, based on performance on both training and testing data we classify our model into three categories.
- Underfit Model. A model can’t learn the problem and performs poorly on both training and testing dataset.
- Overfit Model. A model learns the training dataset very well, but does not perform well on a testing dataset.
- Optimal Model. A model learns the training dataset very well,and also performs well on a testing dataset.
The following image illustrates all three scenarios: when the model is overfitted, under fitted, and optimal.
Now that we learned what overfitting is, the question that we need to ask is “how can we avoid this problem”? Well, to avoid overfitting in the neural network we can apply several techniques. Let’s look at some of them.
2. Common tehniques to reduce the overfitting
Simplifying The Model
The first method that we can apply to avoid overfitting is to decrease the complexity of the model. To do that we can simply remove layers and make the network smaller. Note that while removing layers it is important to adjust the input and output dimensions of the remaining layers in the neural network.
Early stopping
Another common approach to avoid overfitting is called early stopping. If you choose this method your goal is very simple. You just need to stop the training process before the model starts learning the irrelevant information(noise). However, we mentioned in the previous chapter that if you apply this method, you can end up with the opposite problem of underfitting. That is why the main challenge of this approach is to find the right point just before your model starts to overfit. This method is illustrated in the following image.
In the image above, the green curve represents the training loss and the red curve represents the validation loss. As we can see, after several epochs, validation loss has started to increase while the training loss is still decreasing. Therefore, we can be sure that the model is overfitting and we can stop the training exactly at the point where validation loss has started to increase.
Adding More Data To The Training Set
Probably the easiest way to eliminate overfitting is to add more data. The more data we have in the training set, our model will be able to learn more. Also, with more data, we’re hoping to be adding more diversity to the training set as well.
Data Augmentation
Data augmentation is the most popular technique that is used to increase the amount of data in the training set. To get more data, we just need to make minor alterations to our existing dataset. Even though the changes that we are making are quite subtle, our neural network will think these are distinct images. The benefit of data augmentation is that the model is unable to overfit all the samples, and is forced to generalize. The most common Data Augmentation techniques are flipping, translation, rotation, scaling, changing brightness, adding noise. For a more detailed explanation of this topic check this link.
Regularization tehniques
Another popular method that we can use to solve the overfitting problem is called Regularization. It is a technique that reduces the complexity of the model. The most common regularization method is to add a penalty to the loss function in proportion to the size of the weights in the model. In that way, the input parameters with the larger coefficients are more penalized than the parameters with smaller coefficients. After applying this method we will eventually limit the amount of variance in the model.
There are a number of regularization methods but the most common techniques are called L1 and L2 regularization. The L1 penalty minimizes the absolute value of the weights, whereas the L2 penalty minimizes the squared value of the weights. This is mathematically shown in the following equation.
Here, the term we’re adding to the loss is the sum of the squared norms of the weight matrices which is multiplied by a small constant lambda. It is called the regularization parameter. Note that this is another hyperparameter that we’ll have to tune in order to choose the correct number for our specific model.
The L2 regularization is a technique often used in situations where data is too complex because it is able to learn inherent patterns present in the data. However, a disadvantage of this method is that it is not robust to outliers.
Dropouts
So, regularization methods like L1 and L2 reduce overfitting by modifying the cost function. On the other hand, Dropout is a technique that modifies the network itself. It is a method where randomly selected neurons are ignored during training in each iteration.
The idea behind this technique is that dropping different sets of neurons, it’s equivalent to training different neural networks.
For example, let’s assume that we train multiple models. Each of these models will learn the pattern from the data in different ways. If each component model learns a relationship from the data that contains the true signal with some addition of noise, a combination of models should maintain the relationship of the signal within the data while averaging out the noise.
This approach could solve the problem of overfitting. However training multiple models can take several days. This is why we need some more efficient method which will help us save a lot of time.
Using the dropout method we don’t need to train multiple models separately. We can build multiple representations of the relationship present in the data by randomly dropping neurons from the network during training. Then, in the end, the different networks will overfit in different ways, so the net effect of dropout will be to reduce overfitting.
This technique is shown in the diagram above. It has proven to reduce overfitting to a variety of problems involving image classification, image segmentation, word embeddings, and semantic matching.
3. How to apply L2 regularization and Dropouts in PyTorch
During the last few years, the PyTorch become extremely popular for its simplicity. Implementation of Dropout and L2 regularization techniques is a great example of how coding in PyTorch has become simple and easy. For our task, which at first glance seems to be very complicated, we just need two lines of code.
To apply dropout we just need to specify the additional dropout layer when we build our model. For that, we will use the torch.nn.Dropout()
class. This class randomly deactivates some of the elements of the input tensor during training. The parameter p
is the probability of a neuron being deactivated. A default of this parameter is equal to 0.5, which means that the proportion of neurons to dropout is equal to 50%. The outputs are scaled by a factor of \(\frac{1}{1-p} \) which means that during evaluation the module simply computes an identity function.
In our experiment, we are going to train the LeNet-5 model on the Fashion MNIST dataset which consists of 10 classes.
Let’s take a look at the architecture of our model. It is the same architecture that we have used in our previous post.
class LeNet5(nn.Module):
def __init__(self):
super(LeNet5, self).__init__()
self.convolutional_layer = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
nn.Conv2d(in_channels=16, out_channels=120, kernel_size=5, stride=1),
nn.ReLU()
)
self.linear_layer = nn.Sequential(
nn.Dropout(),
nn.Linear(in_features=120, out_features=84),
nn.ReLU(),
nn.Linear(in_features=84, out_features=10),
)
def forward(self, x):
x = self.convolutional_layer(x)
x = torch.flatten(x, 1)
x = self.linear_layer(x)
x = F.softmax(x, dim=1)
return x
x = self.convolutional(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.linear(x)
return torch.softmax(x, 1)
We can apply dropout after any non-output layer in our model. As you can see in our example we have defined the dropout layer just before the first fully-connected layer,
Implementation of the L2 regularization to the parameters of your model is very simple. It is already included in most optimizers, including
Adam optimizer that we will use for our example. To apply the L2 regularization we can simply use the weight_decay
parameter inside this function optim.Adam()
.
For our experiment, we are going to create two optimizers. In the first optimizer, the weight_decay
parameter will be set to a default value of 0, whereas in the second optimizer we will set this parameter to be equal to 1e-4.
weight_decay =1e-4
optimizer1 = optim.Adam(model.parameters(), lr=0.001)
optimizer2 = optim.Adam(model.parameters(), lr=0.001, weight_decay= weight_decay)
Now we are going to train our model two times. For the first model, we will not apply any regularization methods. On the other hand, for the second model, we will apply both dropout and L2 regularization. So, let’s train our model and compare accuracy between the regularized version and the non-regularized version.
epochs = 30
train_loss = []
val_loss = []
t_accuracy_gain = []
accuracy_gain = []
for epoch in range(epochs):
total_train_loss = 0
total_val_loss = 0
model.train()
total_t = 0
# training our model
for idx, (image, label) in enumerate(trainloader):
image, label = image.to(device), label.to(device)
optimizer.zero_grad()
pred_t = model(image)
loss = criterion(pred_t, label)
total_train_loss += loss.item()
loss.backward()
optimizer.step()
pred_t = torch.nn.functional.softmax(pred_t, dim=1)
for i, p in enumerate(pred_t):
if label[i] == torch.max(p.data, 0)[1]:
total_t = total_t + 1
accuracy_t = total_t / train_data_size
t_accuracy_gain.append(accuracy_t)
total_train_loss = total_train_loss / (idx + 1)
train_loss.append(total_train_loss)
# validating our model
model.eval()
total = 0
for idx, (image, label) in enumerate(testloader):
image, label = image.to(device), label.to(device)
pred = model(image)
loss = criterion(pred, label)
total_val_loss += loss.item()
pred = torch.nn.functional.softmax(pred, dim=1)
for i, p in enumerate(pred):
if label[i] == torch.max(p.data, 0)[1]:
total = total + 1
accuracy = total / test_data_size
accuracy_gain.append(accuracy)
total_val_loss = total_val_loss / (idx + 1)
val_loss.append(total_val_loss)
#if epoch % 5 == 0:
print('\nEpoch: {}/{}, Train Loss: {:.4f}, Val Loss: {:.4f}, Val Acc: {:.4f}'.format(epoch, epochs, total_train_loss, total_val_loss, accuracy))
Code language: PHP (php)
As you can see, we achieved the validation accuracy of 89% with the model without regularization. After applying dropout and L2 regularization, accuracy increased by one percent. This may not be a big improvement, but keep in mind that we are using grayscale images and a relatively simple neural network. However, if we train a deeper network such as AlexNet, and use a dataset with colored images, the improvement would certainly be much greater.
Summary
In this post, we talked about the problem of overfitting which happens when a model learns the random fluctuations in the training data to the extent that it negatively impacts the performance of the model on new data. Then, we learned several techniques that we can apply to reduce overfitting. We paid special attention to the regularization techniques like dropout and L2 regularization. Finally, we learned how to apply dropout and L2 regularization in PyTorch.
Refferences:
[1] 5 Techniques to Prevent Overfitting in Neural Networks
[2] How to Avoid Overfitting in Deep Learning Neural Networks