## #032 CNN Triplet Loss

In the last post, we talked about Siamese Network, but we didn’t talk how to actually define an objective function to make our neural network learn. So, in order to do that, here we will define Triplet Loss.

## Triplet Loss

One way to learn the parameters of the neural network, which gives us a good encoding for our pictures of faces, is to define and apply gradient descent on the Triplet loss function. Let’s see what that means.

To apply the triplet loss we need to compare pairs of images. For example, given the pair of images on the left in the picture below, we want their encodings to be similar because these are the same person. On the other hand, given a pair of images on the right, we want their encodings to be quite different because these are different persons.

*Learning Objective*

**In the terminology of the Triplet loss we always look at one anchor image and then we want the distance between the anchor (A) and a positive image (P), to be a positive example meaning that this is the same person. In contrast, we want the anchor image when compared to the negative example to be much further apart (or to have a larger distance).** So, we are always looking at three images at a time. That is, we will be looking at an anchor image, a positive image, as well as a negative image.

To formalize this, we want the parameters of our neural network, or our encodings, to have the following property (think of \(d \) as a distance function ):

$$ \left \| f\left ( A \right )-f\left ( P \right ) \right \|^{2}\leq \left \| f\left ( A \right )-f\left ( N \right ) \right \|^{2} $$

$$ d\left ( A,P \right ) \leq d\left ( A,N \right ) $$

Now we are going to make a slight change to this expression:

$$\left \| f\left ( A \right )-f\left ( P \right ) \right \|^{2} – \left \| f\left ( A \right )-f\left ( N \right ) \right \|^{2} \leq 0 $$

If \(f \) always outputs \(0 \), these two norms (distances) are \(0-0=0 \) and \( 0-0=0 \), and by saying \(f\left ( img \right )=\vec{0} \), we can almost trivially satisfy this equation. We need to make sure that the neural network doesn’t just output \(0 \) for all the encoding – that it doesn’t set all the encodings equal to each other. One way for the neural network to give a trivial output is if the encoding for every image was identical to the encoding to every other image, in which case we again get \(0-0=0 \). To prevent our network from doing that, we are going to modify this objective so that it doesn’t need to be just less than or equal to \(0 \), it needs to be a bit smaller than \(0\). In particular we say this needs to be less than \(– \alpha \) where \(-\alpha \) is another hyper parameter (it is also called a margin).

$$ | f ( A )-f( P ) |^{2}- | f ( A )-f ( N ) |^{2} \leq 0 – \alpha $$

This prevents the neural network from outputting the trivial solutions, and by convention usually we write plus \(+ \alpha \) on the left side of equation.

$$ \| f ( A )-f ( P ) \|^{2}- \| f ( A t )-f ( N ) \|^{2}+ \alpha \leq 0 $$

So, we want \(d(A,N) \) to be much bigger than \(d(A,P) \). To achieve this, we could either push \(d(A,P) \) up or push \(d(A,N) \) down, so that there is this gap of this hyper parameter \(\alpha \) between the distance between the anchor and the positive versus the anchor and the negative. This is **the role of a margin parameter. **Let’s define the Triplet loss function.

The Triplet loss function is defined on triples of images. The positive examples are of the same person as the anchor, but the negative are of a different person than the anchor. Now, we are going to define the loss as follows:

$$ L ( A,P,N )=max( \| f ( A )-f( P ) \|^{2}- \| f ( A) -f( N ) \|^{2} , 0)$$

As long as we achieve the goal of making \(\| f ( A )-f( P ) \|^{2}- \| f ( A) -f( N ) \|^{2} \) less than or equal to zero, then the loss on this example is equal to zero. On the other hand, if this is greater than zero then we take the max so we get a positive loss.

This is how we define the loss on a single triplet, and the overall cost function for our neural network can be a sum over a training set of these individual losses on different triplets:

$$ J=\sum_{i=1}^{M}h\left ( A^{\left ( i \right )},P^{\left ( i \right )},N^{\left ( i \right )} \right ) $$

Let’s imagine that we have a training set of \(10,000\) pictures with a \(1000 \) different persons. Then, we have to take our \(10,000 \) pictures and use them to generate triplets. Next, we train our learning algorithm using a gradient descent on cost function that we have defined previously. Notice that in order to define this data set of triplets we do need some pairs of A and P, pairs of pictures of the same person, so** for the purpose of training our system we do need a data sets where we have multiple pictures of the same person, at least for some people in our training set so that we can have pairs of anchor and positive images.** **If we had just one picture of each person then we can’t actually train this system.** Naturally, **after training this system we can apply it to our one-shot learning problem** where for our face recognition system maybe we have only a single picture of someone we might be trying to recognize.

### How do we actually choose these triplets to form our training set?

The first thing that comes to mind for most people is what if we choose A, P and N randomly from our training set. And the answer is that if we choose them sort of at random, then this constraint is very easy to satisfy, because given two randomly chosen pictures of people chances are that A and N are “more different” than A and P, so this is a problem.

If A and N are two randomly chosen different persons then there is a very high chance that \(\| f ( A )-f( P ) \|^{2}- \| f ( A) -f( N ) \|^{2} \) will be much bigger than margin \(\alpha \) then that term on the left, and so the neural network won’t learn much from it. Moreover, we want to construct our training set by choosing triplets A, P and N so that they represent “hard cases” to train on.

In particular, what we want is for all triplets that this constraint be satisfied, so a triplet that is hard would be when we choose values so that \(d(A,P) \) is actually quite close to \(d(A,N)\).

$$ d(A,P)+\alpha \leq d(A,N)$$

$$ d(A,P) \approx d(A,N)$$

In that case the learning algorithm has to try extra hard to take this \(d(A,N)\) and to try to push it up. The effect of choosing these triplets is that it increases the computational efficiency of our learning algorithm. If we choose the triplets randomly then too many triplets would be really easy, and so gradient descent won’t do anything because neural network will just get them right pretty much all the time. So, with choosing hard triplets, the gradient descent procedure has to do some work to try to push \(d(A,P) \) further away from \(d(A,N) \).

### Training set using triplet loss

To train on triplet loss, we need to take our training set and map it to a lot of triples.

Here is a nice example. The Anchor and Positive are the same person, but the Anchor and Negative are different persons. We need to find the training set of Anchor, Positive and Negative triples. Then, we will use gradient descent to try to minimize the cost function \(J \) that we defined earlier. It will have the effect of back propagating to all the parameters of the neural network in order to learn an encoding. Hence, a function \(d \) of two images will be small when these two images are of the same person. However, they will be large when these are two images of different persons.

*Examples of triplets*

That is it for the triplet loss and how we can use it to train a neural network to output a good encoding for face recognition task. It turns out that today’s face recognition systems, especially the large scale commercial face recognition systems are trained on very large data sets. Datasets with more than a million images is not uncommon, so this is one domain where often it might be useful to download someone else’s pretrained model, rather than doing everything from scratch.