#009 3D Face Modeling – Face Tracking and Reconstruction
Highlights: Hello and welcome. Today we will continue our series of posts about 3D face modeling. In this post, we are going to talk about parametric face model tracking and reconstruction. In the previous posts, we talked about how to create a parametric face model. Now, let’s talk about what we can do once a face model is created. More specifically, how can we reconstruct a face from an image into a 3D form (3D mesh)? So, let’s begin with our post.
In this post, we are going to review the YouTube video “3D Scanning & Motion Capture: 9. Parametric Face Models”[1]. So let’s begin!
Image formation model
First, let’s refresh our knowledge of the image formation model. Take a look at the following image.
Here we can see a 3D face that is defined by the following parameters \(P \). This parametric face model tells us how to create something we can feed into a rendering pipeline and create a 2D image.
So, first of all, we have our blendshape bases (morphable model bases, defined with parameters \(\alpha \), \(\beta \), \(\gamma \), and so on). We use these parameters, in order to produce a mesh out of our bases. We can use our pose, illumination, and albedo in order to transform, light, and color it accordingly. After that, we have our mesh, and we can feed this into the rendering pipeline, and get an image out of it. So, in other words, these parameters \(P \) will tell us how to project a 3D mash into a 2D image.
Now, the most important thing is that, after we applied dimensionality reduction, our goal is to invert this process and reconstruct the model. For that, we will use the analysis by the synthesis idea.
Analysis by the synthesis
Remember that parameters \(P \) are a low-dimensional approximation of the face. So, we might be able to figure out a way how to go from a single input image to find optimal parameters \(P \) that produced this image.
Let’s explore the following example.
Here, we have a 2D input image in the upper left corner. We can synthesize our model with a random set of parameters \(P \). Then we can render it and overlay this rendering on top of the original input image You can see the result in the bottom left corner. Notice that our model doesn’t look like the input image at all. Instead, it looks like just some random mask over the face that doesn’t really have any of the features and traits, of the original face.
The next step is to compare this rendering with the original input image and optimize the parameters of the face. Our goal is to minimize the differences between these two images. In other words, we want to make that our rendered model look more like the original image by changing the parameters \(P \) accordingly. Then, we will get an updated set of parameters, and we re-render it again. In the following example, we can see that our model already looks a bit more similar to the input image. after several iterations. Our pose looks much better as well as the albedo.
However, the model doesn’t look that ideal yet. So, we can continue to iterate over and over again until we found our final set of parameters. In the following example, we can see, that our model looks pretty close to the input image.
So, this is an optimization process. Now, you may ask how exactly the parameters \(P \) are updated. In the previous paragraph, we explained that we just need to look at the difference between the two images. However, in practice, it’s actually not so straightforward. So, let’s dive deeper and explain this process in a little bit more detail.
Parameter Estimation as Energy Minimization
The first thing that we have to think about is what is our input. In the previous example, we showed you one single image. However, in practice, our input can be:
- Single-image
- Webcam
- YouTube videos
- RGB D data from a Kinect dataset
Now, let’s explore this process step by step.
Step one – render the images
The first step is to take our parameters \(P \) and use GPU to be the rendering engine. As an output, we will get some rendered images as you can see in the following image.
Step two – compare the input image with a rendered image
The second step is to compare whatever we have as an input (image, video), with whatever our current set of parameters \(P \) is rendering.
Step three – formulate an energy term
Next, we need to formulate an energy term. Let’s talk about this step in more detail. An energy term is given by the following equation.
$$ E(P)=E_{\text {dense }}(P)+E_{\text {sparse }}(P)+E_{\text {reg }}(P) $$
This is the energy equation. In this equation, we can see 3 different terms. First, we have \(E _{dense} \) term. It is a data term, based on the input images, and the input depth maps. Typically we have another data term which is a sparse term \(E _{sparse} \). It takes a number of facial landmarks in order to get some initial alignment. The third term is regularization \(E _{reg} \), which makes sure it’s a reasonable face.
First, let’s focus on the \(E _{dense} \) term. Take a look at the following image
In the illustration, we can see that this \(E _{dense} \) term compares literally every pixel on two masks on the right side in the image above with the counterparts in the observed data shown on the left side of the image.
In practice, these are actually two terms. There’s a color term and there’s a depth term. What we want to do here is to figure out what are the optimal parameters \(P \), that minimize the distance between those two images.
$$ P^*={argmin}P E{\text {dense }}(P) $$
This process, it’s relatively straightforward. We have two images of the same resolution: the input image, and the rendering image. We want to go over every pixel in the input image and compare it to the corresponding pixel in the rendered image. We compute their squared difference, and we do the same thing for the color and for the depth terms. After that, we are summing up all measures.
Now, this would mean that the summing over every pixel would be a little bit premature because we would also consider the background. The background is actually not even part of our face model. So in practice, we just sum up all the pixels on the rendered face model and then check out what’s the corresponding pixel value, in the input image.
Now, let’s talk about this process in more detail. Notice that we have the measurement in pixel space(take every pixel and measure the color space) and measurement in the world space (take the 3D values and make a difference).
First, let’s talk about the color differences. Take a look at the following image.
There are a few things we can do to improve this process and make it a little bit more robust. So, generally speaking, this is a photometric term. So we’re just saying, compare every pixel on the face mask with everything you have observed. What we can do is we can just measure distances in RGB space. So, we have red green and blue colors, and we can take two-norm between the two colors and try to measure the difference between them. The two-norm is just a measure of length given by “the square root of the squares. Also, note that we can also use other color spaces.
The other thing that we have to consider here is there are different norms that we can use. We can use two-norm between every pixel, but also we can also use any other norm to compare the color values. Also, we can even use gradients to compare it to get a little bit more robustness towards white balance changes. The bottom line is, there are a lot of different things we can do to compute this photometric term.
We can also do some convexification. That is one problem we’re often gonna run into. If we have photometric terms, generally speaking, they’re, non-convex. For example, imagine that we have a checkerboard pattern. When we overlay this pattern, we’re always going to converge to the wrong offsets. This is something we have to consider in practice.
Next, let’s talk about the depth differences. Take a look at the following image.
This process is a little bit easier than the previous one. Here, we take every pixel that is on the facemask and project it to the respective depth values in the input depth map. Then, we have some distance we need to compute and we do this in the world space. We have a 3D coordinate on the face mask that is stored on this highlighted pixel and we compare it with the 3D point that is stored in the input image. So, how we can compute the distance between two points? Well, there are a couple of options. One thing we can do is calculate point-to-point distance. We just go to the next nearest neighbor and check the distance. However, we can also do other things like point-to-plane distance, metric-point-to-plane distance, and so on.
The next step is to combine this photometric and depth term and try to optimize the problem. What we will realize is that this works really well, if the 3D face is already pretty close to the image. This works relatively well, despite not being super convex. However, we first have to get a rough alignment, in order to make sure that these renderings make any sense. What people typically use to solve this problem is the following sparse term. We can also call it a facial landmark term.
In the image above we can see a facial landmark detector and the landmarks on the face, and on the template model. We need to make sure that for every landmark point on the template model, you know where it is in the face. Then we compute a 2D image distance in order to help the optimization.
In practice, this term is relatively convex. Therefore, it’s much easier to optimize for, because we have a one-to-one mapping by construction. So, if we have a landmark point detected on the face, we know exactly where the detected point should be on the 3D mesh. Note, the landmark detectors maybe not be super accurate, but the rough high-level alignment works pretty well.
There are a couple of options, on how to do landmark detection with different existing handcrafted feature detectors. This process is really important to get close enough to a course alignment so that we can later run a dense optimization afterward. In practice, at the beginning of the optimization, we typically only have the sparse term. So, the sparse term initializes the process, and once we’re close enough, we give more and more weight to the dense terms.
Now, let’s talk about the regularization term.
This term is a prior that just controls that we don’t deviate too much from the average face of our face model. So, what does it mean mathematically? First, we assume some Gaussian distribution in the parameter space. Practically speaking, we just literally sum up face parameters \(P _{i} \) and divide them with the deviation from the mean.
Now, brace yourself. We will go very quickly over the math terms. Take a look at the following image.
First, we have the first two terms \(E _{geom}(P) \) and \(E _{col}(P) \) which are geometry and color. These are the dense terms. The third one \(E _{mrk}(P) \) is our sparse term, which is based on landmarks. And the last one \(E _{reg}(P) \) is our regularization term.
So let’s dive deeper into each of these terms.
First of all, we have our model \(M _{i} \), which takes parameters \(P \) as input and has \(i \) vertices. In this case, we’re going over every vertex. In the previous paragraphs, we explained that we’ll be going over every pixel, not the vertex. Do not be confused, because both of them are valid methods, However, for the sake of simplicity, we will use a vertex here. The \(i \) could also be the pixel of the rendering, but then it will change every time we re-render because we’re getting a different number of pixels.
Next, let’s talk about the geometric term \(E _{geom}(P) \).
$$ E_{\text {geom }}(P)=\sum_i | M_i(P)-D\left(\pi\left(M_i(P)\right) |_2^2+\left[\left(M_i(P)-D\left(\pi\left(M_i(P)\right)\right) \cdot n_i\right]^2\right.\right. $$
It has two parts. It has a point-to-point term, and a point-to-plane term. These two terms can be re-weighted accordingly. In the equation above the weights are missing, but you can imagine just adding a lambda.
Okay, so what does it mean? So we are re-rendering our model, and we’re getting a random vertex \(M _{i}(P) \). Its position is determined based on the parameters \(P \), and we’re comparing it by looking up the depth map \(D\left(\pi\left(M_i(P)\right)\right. \). Next, we do the same thing with a point-to-plane term, except, that here, multiply with a normal \(n _{i} \).
For the color term \(E _{col}(P) \), we have the same idea.
$$ E_{c o l}(P)=\sum_i|| M_i(P)_c-I\left(\pi\left(M_i(P)\right)\right) |_2^2 $$
Again, we go over all vertices and check out the model \(M _{i} \). We take a 3D point of the model based on the parameters \(P \), project it in the color image, and compare the distance in RGB space. In this case, we have a two-norm.
So, we explained two dense data terms. Now, we’re going to talk about the sparse terms. This is our landmark term \(E _{mrk}(P) \).
$$ E_{m r k}(P)=\sum_i\left|\pi\left(M_i(P)_m\right)-C_i\right|_2^2 $$
Note that this \(i \) term in the equation would not go over every vertex on the face. It will go over every marker location. Then, we will check the marker location \(m \) based on \(P \). Then we feed it into \(\pi \) to project it in the current 2D image. Finally, we’re comparing this with \(C_{i} \) which is a 2D position where the respective landmark position was detected. Again, here we have a two-norm between 2D points.
Note that it is always important to figure out the dimensionality. The first equation is the difference between points in 3D, the second equation is the difference of color values in RGB space, and the third equation is the difference in landmark coordinates in 2D.
Now, let’s ta;k about the last term which is the regularization term, \(E _{reg}(P) \).
$$ E_{r e g}(P)=\sum_i\left[\left(\frac{\alpha_i}{\sigma_{i d, i}}\right)^2+\left(\frac{\beta_i}{\sigma_{a l b, i}}\right)^2\right]+\sum_i\left(\frac{\delta_i}{\sigma_{e x p, i}}\right)^2 $$
Here, we have three parameters. First, we have \(\alpha \) and \(\beta \) parameters, for the respective identities. And these parameters are divided by the standard deviations. All we’re doing here is summing up the square terms, and we want to minimize the deviation from the average base.
The last parameter \(\delta \) is for the expressions. These delta blendshapes, are squared and divided by the respective standard deviation. Here, we try to minimize the deviation from the average base.
So, this was the math part. Now we want to figure out, how we do optimization. Let’s write the simplified version of the equation again.
So, here again, we have the dense terms, the sparse terms, the regularizer, data terms, and priors. And now we want to find the optimal parameters \(P \) that minimize this energy. However, this energy now is an optimization in a high dimensional space. This is basically just an ordinary least squares problem.
To solve this problem faster, we want to use a second-order solver. In this case, we will use the Gouss-Newton method. Here’s a quick reminder of how the Gouss-Newton method works.
$$ \boldsymbol{E}(\boldsymbol{P})= \left\| \boldsymbol{F}(\boldsymbol{P})\right\|_2^2 $$
Iterative Update Rule:
$$ P_{n+1}=P_n-\underbrace{\left(J_F^T \cdot J_F\right)^{-1} \cdot J_F^T \cdot F_F}_{=\Delta} $$
$$ \Longrightarrow\left(J_F^T \cdot J_F\right) \cdot \Delta=J_F^T \cdot F $$
Here, we have our energy \(E \), and our parameters \(P \), and we also have these squared residuals. Then we have to apply the iterative update rule. So, we start with \(P_n \) in the \(n_{th} \) iteration, and we are subtracting \(J_F^T \cdot J_F \) inverse multiplied with \(J_F^T \cdot F_{F} \) which is the gradient. Also, we have this delta vector. Next, we want to solve this linear system at every iteration, get the update, and then iterate accordingly.
Okay, so what does it mean? Again, as a reminder, these residual terms, are the terms we already discussed in this post.
As you can see we have here, the distances in depth space, the distances in color space, and we have the distances in landmarks. Also, we have the regularizer, that controls that we do not deviate too much from the average face.
The next thing that we would have to compute is the Jacobians from this function \(F(P) \). The jacobian matrix is illustrated in the following image.
Here, you can see our Jacobian of \(F \) with respect to the parameters \(P \). As you can see, every row has partial derivatives, where each entry of the row is with respect to a different parameter \(P \). So, the number of columns here, is equal to the number of unknowns that we have in our parameter vector \(P \). In this case, we have \(P_{n} \). We know roughly the dimensions of this, which is exactly the dimensionality of the parameters of our face model. Then, this difference between the depth terms is one row per dense entry in the depth term. In other words, we are going to have one row for each sum of depth values that we projected.
Then we will apply the same process for the color landmarks and regularizer.
Some computations of these derivatives are easy and some of them are not. The easiest part is the regularizer at the bottom. It is just a sum of square terms. Computing landmarks and depth is also easy. However, the color part is not so straightforward. Here we would like to compute partial derivatives of the synthetic rendering with respect to the parameters \(P \).
In other words, we would like to know, if we change the parameters \(P \) in a certain way, how our 2D values are changing in RGB space. What this means in practice is that we need to differentiate through the rendering process.
So, we would like to understand how our models change during the re-rendering process, respectively. This is called the Differentiable Rendering process.
Now, we are going to cover this topic in more detail, because this process is not just relevant to these face models. This will be relevant to many other applications you’re going to use in practice if you ever want to do anything with analysis by synthesis.
Differentiable Rendering
How does differentiable rendering work? Let’s have a look at the following image.
The rendering is taking these parameters \(P \) here as input and constructs the parametric model accordingly (the mesh). Then, it feeds the model into the GPU, and the GPU renders something out of it. What we would like to know is, how do we compute the partial derivative for this highlighted pixel in the following image?
In other words, we want to know how this pixel is being affected with respect to the parameters \(P \). For that, we have to understand the rendering process a little bit.
As we already learned, the mesh that we have is a set of triangles. Our goal is to feed these triangles into the rasterizer of the GPU. The rasterizer is figuring out which pixels are being covered by that specific triangle that we can see in the following image.
The rasterizer does the same process for all the triangles very fast and all in parallel. Afterward, the graphics pipeline has to figure out a way to assign a color to this pixel. The pixel color is dependent on the original triangle. To be more precise, the rasterizer, once it figures out which pixel is covered by the triangle, checks which vertices here were involved in this triangle, to begin with.
As you can see in the image above, we have three, points: \(p_{k} \), \(p_{i} \), and \(p_{j} \). These points are the vertices of the original mesh that we used. Now, the rasterizer is finding an interpolation between these three vertices, and figures out how to interpolate them for this one pixel. In other words, it actually interpolates the color values, in order to find the color value for this pixel.
Now, let’s dive deeper into the following formula:
$$ C_{\text {pixel }}=\alpha \cdot C_{p_i}+\beta \cdot C_{p_j}+\gamma \cdot C_{p_k} $$
This is the formula it’s relatively straightforward. So, we have three color values for the three vertices: \(C_{p_i} \), \(C_{p_j} \), \(C_{p_k} \).
We also have three coefficients: \(\alpha \), \(\beta \), and \(\gamma \). These coefficients are used to figure out how to reweight these vertex colors, in order to find the color for this one pixel. This equation represents a linear combination of vertex colors. In practice, what the rasterizer does here, is a barycentric interpolation. So, it knows where is this pixel relative to this triangle, and then it figures out a barycentric interpolation. These \(\alpha \), \(\beta \) and \(\gamma \) coefficients, are barycentric coordinates that are given by the DirectX OpenGL pipeline. Based on these coordinates, we can interpolate the color values of these vertices in order to get the final color for this pixel.
Now, if you want to make this whole thing differentiable, what you would have to do is to memorize first, which triangle is it. Based on this triangle, we will know the three vertices, and based on these three vertices, we will know a barycentric interpolation. Then, we can go back to the original parametric model.
Now, we know that this pixel was affected in a certain way by the parameters \(P \). Therefore we can do this partial derivative accordingly.
In practice, we store this information during the forward rendering pass. In other words, we just render the 3D mesh, and for each pixel, we store the ID of the triangle, to which this pixel belongs. Also, we are storing three barycentric coordinates.
Once we stored our values we can go back to our parametric model \(P \).
Here, the colors of these vertices, are determined by the parametric model \(P \). For each of these vertices, the barycentric coordinates are given. Now we want to calculate how this color at the pixel \(p \) is affected by these parameters \(p \). For that, we will calculate the partial derivative of all of these three points. These are partial derivatives of the parametric face model accordingly. To compute this is actually relatively straightforward. Because this is all pretty much supported in the graphics pipeline. We just need to write the pixel shader accordingly to store the triangle and the vertex IDs, and barycentric coordinates. Then we will have our favorite differentiable rendering with the forward rasterizer implemented.
Summary
Okay. This is all. In this post, we have talked about parametric face model reconstruction. We learned about an important idea called Analysis of the synthesis, and we learned to estimate the parameters and minimize the energy.
References:
[1] – 3D Scanning & Motion Capture: 9. Face Tracking & Reconstruction – Prof. Matthias Nießner