#022 CNN Landmark Detection
Landmark Detection
In the previous post we saw how we can get a neural network to output \(4 \) numbers: \(b_{x} \), \(b_{y} \) ,\(b_{h} \), and \(b_{w} \) to specify the bounding box of an object we want neural network to localize. In more general cases we can have a neural network which outputs just \(x \) and \(y \) coordinates of important points in the image, sometimes called landmarks.
Let’s see a few examples. Let’s say we’re building a face recognition application, and for some reason we want the algorithm to tell us where is the corner of someone’s eye.
Landmark detection
Every point has an \(x \) and \(y \) coordinate so we can just have a neural network with final layer that outputs two more numbers which we will call \(l_{x} \) and \(l_{y} \) to specify the coordinates of a point that is for example the person’s eye).
Now, what if we wanted the neural network to tell us all four corners of the eye, or both eyes. If we call the points the first, the second, the third and fourth point, going from left to right, then we can modify the neural network to output \(l_{1x} \), \(l_{1y} \), for the first point, and \(l_{2x} \), \(l_{1y} \) for the second point and so on. The neural network can output the estimated position of all those four points of the person’s face. What if we don’t want just those four points? What if we want the output many points? For example what if we want to output different positions in the eye or shape of the mouth to see weather the person is smiling or not. We could define some number, for the sake of argument, let’s say \(64 \) points or \(64 \) landmarks on the face maybe even some points that helps us define the edge of the face, it defines the jawline. By selecting a number of landmarks and generating a label training set that contains all of these landmarks we can then have the neural network which tell us where are all the key positions or the key landmarks on a face.
Landmark detection using a \(Convnet \)
So, what we do is we have this image of person’s face as input, have it go through a \(convnet \) and have a \(convnet \) then have some set of features maybe have it output \(0 \) or \(1 \), like is there a face in this or not, and then have it also output \(l_{1x}\), \( l_{1y} \) and so on down to \( l_{64x} \), \( l_{64y} \). We use \(l \) to stand for a landmark.
This example would have \(129 \) output units, \(1 \) is for where a face or not, and then if we have \(64 \) landmarks that is \(64 \times 2 \) which is equal to \(128 \) plus \(1\) output units. This can tell us if there’s a face as well as where are all the key landmarks on the face. Of course in order to trade a network like this we will need a label training set. We have a set of images as well as labels \(Y \), where someone would have had to go through and laboriously annotate all of these landmarks.
Pose detection
If we are interested in person’s pose detection, we could also define a few key positions (as we can see in the picture below) like the midpoint of the chest, left shoulder, left elbow, wrist and so on. Then we need a neural network to annotate key positions in the person’s pose as well. Having a neural network output all of those points down annotating we could also have the neural network output the pose of the person.
Pose detection – an example
To do that we also need to specify on these key landmarks which may be \(l_{1x} \), \(l_{1y} \) that is the midpoint of the chest, down to maybe \(l_{32x} \), \(l_{32y} \), if we use \(32 \) coordinates to specify the pose of the person.
This idea might seem quite simple of just adding a bunch of output units to output the \((x,y) \) coordinates of different landmarks we want to recognize. To be clear, the identity of landmark \(1 \) must be consistent across different images like maybe landmark \(1 \) is always one corner of the eye, landmark \(2 \) is always another corner of the same eye etc. The labels have to be consistent across different images.
That is it for landmark detection. Let’s take these building blocks and use it to start building up towards object detection in the following posts.