datahacker.rs@gmail.com

# #019 CNN Transfer Learning

## Transfer Learning

Last time we talked about Inception Network and some other Neural Network architectures. All of these Neural Network architectures are really big and hard to train and we have a problem.

So if we are building a computer vision application rather than training a neural network from scratch we often make much faster progress if we download the network’s weights. In other words someone else has already trained the network architecture and we can use and that to a new task that we are solving. The computer vision research community has posted lots of datasets on the internet like Imagenet or NS Coco or Pascal datasets. Many computer vision researchers have trained their algorithms on these datasets. Sometimes this training takes several weeks and might take many GPUs. The fact that someone else has done this task and gone through the painful high-performance research process means that we can often download open source weights. Next, we can use them as a very good parameter’s initialization for  own neural network. That is, we can use transfer learning to transfer knowledge from some of these very large public datasets to our own problem. Let’s take a deeper look at how to do this!

Let’s say we’re building a cat detector to recognize our own pet cat. According to the internet, Tigger is a common cat name and Mistie is another common cat’s name. Let’s say our cats are called Tigger and Misty. We also have Neither as an output if detector doesn’t detect neither of these two cats.

So, we have a classification problem with three classes. Algorithm needs to decide if the picture is Tigger, or is it Misty or is it Neither. We will ignore the case of both of our cats appearing in a picture. We probably don’t have a lot of pictures of Tigger or Misty so our training set will be small. What can we do? A good idea is to go online and download some open source implementation of a neural network, and download both the code and the weights.

### How does the transfer learning work ?

An example of a neural network with 1000 outputs

There are a lot of networks that have been trained on, for example the Imagenet dataset, which has 1000 different classes. Hence, the network might have a $$softmax$$ unit that outputs one of a thousand possible classes, as we can see in the picture above. What we can do is get rid of the $$softmax$$ layer and create our own $$softmax$$ unit that outputs Tigger or Misty or Neither. So, we can transform neural network which we saw in the picture below to the following one.

A neural network that outputs one of tree possible classes – Tigger, Misty or Neither; only $$softmax$$ layer is trained, all the other parameters are freezed

Because we are using downloaded weights, we will just train the parameters associated with our $$softmax$$ layer, which is a $$softmax$$ layer. This layer has three possible outputs: Tigger, Misty or Neither. Using pretrained weights we might get pretty good performance for this task even with a small dataset. Deep learning frameworks have some parameters to specify, such as parameter = 0 which says “don’t train those weights”. Setting a parameter freeze=1 does similar thing. In this case we will train only the $$softmax$$ layer, but will freeze all of the earlier layers weights.

Another neat trick is that, because all of these early layers are frozen, there is some fixed function that doesn’t change. This insight could speed up training by pre-compute that layer values, actually the features activations from that layer, and just save them to disk. What we’re doing is using this fixed function in this first part of the neural network and compute some feature vector for it. Then, we are training a shallow $latex softmax$ model from this feature vector to make a prediction. So, one step that could help our computation is that we just pre-compute that layers activation for all the examples in the training set and save them to disk. Then we just train a $$softmax$$ classifier on top of that. The advantage of saving to disk is that we don’t need to re-compute those activations every time we take a pass through our training set.

Save activations from the last layer (the layer before a $$softmax$$ layer) to the disk

If we have a larger labeled dataset, that is, if we have a lot of pictures of Tigger, Misty and a lot of pictures of neither of them, one thing we could freeze fewer layers. Maybe we can freeze just layers as presented in the picture below and then train these later layers.

Freezing the first few layers and trainig other layers

However, if the final layer outputs different number of classes then we need to have our own output unit that gives us an answer is an input: Tigger, Misty or Neither. There are a couple ways to do this:we could take the last few layer’s weights and just use that as an initialization and do the gradient descent.

We can also remove these last few layer’s weights, use our own new hidden units and in our own final $$softmax$$ output. Hence, either of these methods could be worth trying (look at the picture below).

Freezing the first few layers and adding new layer after

One rule can be that if we have more data, the number of layers we freeze could be smaller and then the number of layers we train on top could be greater. The idea is that if we have a bigger dataset then maybe of enough data not just to train a single $$softmax$$ unit, but to train some modern size neural network that comprises the last few layers of this final network we end up using.

Finally if we have a lot of data, one thing we might do is to take this open source network and weights, and use the whole architecture just as initialization and train the whole network.

Training a whole neural network

However, this network has $$1000$$ node $$softmax$$ and we have just 3 outputs classes. The more pictures we have of Tigger, Misty and Neither, the more layers we could train and in the extreme cases we could use to downloaded weights just as an initialization so they would replace random initialization. Then, we could do gradient descent training, updating all the ways in all the layers of the network.

So that’s transfer learning for the training of $$convents$$. In practice nowadays, datasets on the Internet are open and we can also download code on which someone else has spent weeks training. We find that for a lot of computer vision applications we just do much better if we download someone else’s open source weights and use that as  an initialization for our problem.

In all the different disciplines and all the different applications of deep learning we think that computer vision is one where transfer learning is something that we should almost always do, unless we actually have an exceptionally large dataset to train everything else from scratch by ourself. Transfer learning is  very worth sconsidering unless we have an exceptionally large dataset and a very large computational budget to train everything from scratch by ourselves.

In the next post we will talk more about what we can do if we have only a limited number of training images and we will use a technique called Data Augmentation.