## # 031 CNN Siamese Network

## Siamese Network

The job of the function \(d\), which we presented in the previous post, is to use two faces and to tell us *how similar* or *how different* they are. A good way to accomplish this is to use a Siamese network.

We get used to see pictures of \(convnets \), like these two networks in the picture below. We have an input image, denoted with \(x^{(1)}\), and through a sequence of \(Convolutional \), \(Poolling \) and \(Fully\enspace connected \) layers we end up with a feature vector.

Sometimes this output below, is fed to a \(softmax \) unit to make a classification, but we are not going to use that approach in this post. Instead, we are going to focus on this vector of \(128 \) numbers computed by some \(Fully\enspace connected\) layer that is deeper in the network. We are going to give this list of \(128 \) numbers a name \(f(x^{(1)}) \), and we should think of \(f(x^{(1)}) \) as an encoding of the input image \(x^{(1)} \). That means that we have taken the input image and represented it as a vector of \(128 \) numbers. Next, to build a face-recognition system we need to compare two pictures. Let’s say, the first picture with the second picture below. To do this, we can feed the second picture to the same neural network with the same parameters and get a different vector of \(128 \) numbers. This will be our representation of the the second picture. We also say that the picture is encoded in this way.

*Siamese network*

We will call the encoding of the second picture \(f(x^{(2)}) \). Note, here we are using \(x^{(1)} \) and \(x^{(2)} \) just to denote two input images. They don’t necessarily have to be the first and second examples in our training sets. Finally, if it turns out that these encodings are a good representation of these two images, we can find the distance \(d\) between \(x^{(1)}\) and \(x^{(2)}\). A common way is to use a norm of the difference between the encoding of these two images.

$$ d\left ( x^{\left ( 1 \right )},x^{\left ( 2 \right )} \right )=\left \| f\left ( x^{\left ( 1 \right )} \right )-f\left ( x^{\left ( 2 \right )} \right ) \right \|_{2}^{2} $$

This idea of running two identical convolutional neural networks on two different inputs and then comparing them is called a Siamese neural network architecture.

**How do we train this Siamese neural network?** First, these two neural networks have the same parameters. Then, we want to train a neural network, so that the encoding that it computes results in a function \(d \). Finally, it will tell us when two pictures are of the same person.

To put it more formally, the parameters of the neural network define an encoding \(f(x^{(i)}) \). So, given any input image \(x^{(i)} \) the neural network outputs a \(128 \) dimensional encoding \(f(x^{(i)}) \). **What we want to do is to learn parameters so that if two pictures \(x^{(i)} \) and \(x^{(j)} \) are of the same person, then the distance between their encodings should be small.** At the beginning of this post we used \(x^{(1)} \) and \(x^{(2)} \), but there can be any pair \(x^{(i)} \) and \(x^{(j)} \) from our training set.

*The goal of learning*

**In **contrast,** if \(x^{(i)} \) and \(x^{(j)} \) are of different persons, then we want that distance between their encodings to be large.**

Now, we have a sense of what we want the neural network to output for us in terms of what would make a good encoding. However, we still don’t know how to actually define an objective function to make our neural network.

Let’s see how we can do that in the next post using the Triplet loss function.