#015 3D Face Modeling – The geometric image formation

#015 3D Face Modeling – The geometric image formation

Highlight: In this post, we are going to talk about the geometric image formation process and some basic camera models. We will explain in detail how 3D points/lines are projected in 2D planes.

How it all started. Inside an entirely dark room that had only one little pinhole in the door, the outside scenes and objects were projected upside down onto the dark wall. This was due to the illumination of the light that was passing through this little pinhole in the door. This is an example of the so-called Camera Obscura.

Using this principle by several artists[1], ordinary rooms were illuminated through a very small pinhole and an image was captured inside that room with a very long shutter time.

Introduction

The most basic pinhole camera can be simply described as a box that has a little pinhole inside of the walls. This hole is letting light pass through it and reflects on the wall behind it in the same direction the light came in at.

Looking at the image above, we can visualize a pinhole camera model and an object in the 3D scene. A light ray is going from the top point of this object through the pinhole and reflects on the wall inside the camera model. Because the light is not changing the direction, but rather going inside the camera in the same direction it came in, the object is projected upside down on the wall.

Usually, when considering camera models we consider that the image plane is not behind the pinhole, but rather in front of it as in the image below.

In this case, the image is not upside down, but it still keeps the same dimensions as if the image plane was behind the focal point. This is because of a parameter called focal length, which is the distance between the image plane and the focal point. In the case when the image plane was behind the focal point, we get a negative focal length. On the other hand, in the case when it is in front of it, we have a positive focal length. However, besides the sign, the values are the same, so the dimension of the object will be the same.

Projection Models

When it comes to ways of projecting 3D scenes into 2D planes, there are several ways of doing this. The 2 important models that we are going to discuss are Orthographic and Perspective projection.

The Orthographic projection assumes that light is traveling in a parallel fashion and that a 3D scene is projected also parallely.

On the other hand, the Perspective projection model is the one we discussed previously (the pinhole camera model). Here we have a focal point and all the light rays that pass from any 3D scene must hit that focal point. We know that the image plane is in front of the focal point, so on the way to the focal point, they are intersecting with the image plane, therefore forming an image. When using this type of projection, we know that the distances that are in the real world (the 3D scene) will have the same distances in the projected image.

One of the key differences between these two projections is the size of the objects. The Orthographic projection always has the same dimensions, while the Perspective projection creates different dimensions depending on the focal length.

Mathematical understanding of projection models

In all the fillustrations that we are going to see, we will show the \(x\) and \(z\) coordinate, we will not show \(y\) because all the computations and visualization done for the \(x\) are the same for \(y\).

In the image above, we can see the Orthographic projection model. The camera center is where the camera coordinate system is defined at. We know that for this projection the light rays are traveling parallel to the \(z\) coordinate or the principal line so that the values of \(x\) will remain the same. When projecting from the 3D space to 2D using this projection model, we only remove the \(z\) coordinate and keep the \(x\) values as they are. The same is for the \(y\) coordinate.

This can be written in the following way:

$$ x_s = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} x_c \Longleftrightarrow \bar{x}_s = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \bar{x}_c $$

The orthography projection is mainly used for telecentric lenses and telephoto lenses. After projection, the distance of the 3D point from the image can’t be recovered.

In practice, we will not be using this model, because the image sensor measures in pixels, instead of millimeters or meters. We need to scale these coordinates to match the pixels. Therefore, we introduce another type of projection, the Scaled Ortohraphic projection. Simply we will replace the ones from the previous matrix with a scalar value \(s\).

$$ x_s = \begin{bmatrix} s & 0 & 0 \\ 0 & s & 0 \end{bmatrix} x_c \Longleftrightarrow \bar{x}_s = \begin{bmatrix} s & 0 & 0 & 0 \\ 0 & s & 0 & 0 \\ 0 & 0 & 0 & s \end{bmatrix} \bar{x}_c $$

So \(s\) represents a ratio between two units, or how we can go from one unit to the other. For example, it can be calculated as \(pixel/meter\) or \(pixel/millimeter \) and so on.

We now come to the part for the Perspective projection which is the model for all traditional cameras where all the light rays are focused on a single focal point, which can be seen in the figure below, denoted as the camera center.

In the example above we can see an example of how a 3D point from a 3D scene, \(x_c \in \mathbb{R}^3\) projects to pixel coordinates \(x_s \in \mathbb{R}^2\). We can see that the image plane is displaced from the camera center and all the light rays must pass through the image plane to arrive at the camera center or focal point. The principal plane is orthogonal to the image plane and the lines for the \(z\) axis.

Now let us see how does the 3D point project onto the image plane? We know the 3D point, \(x_c\), \(y_c\), and \(z_c\) coordinates, and also the focal length, which is a parameter of the cameras indicating the distance between the focal point or the camera center and the image plane.

We know that the ratio between \(\frac{x_c}{z_c}\) should be the same ratio as \(\frac{x_s}{f}\). Or visually the angle of the small triangle on the left of the image plane and the big triangle should be the same. This just means that the angles in these two triangles are the same, the sides are only of different scaling. So by calculating this ratio and multiplying it by the known focal length we get the location on the image plane.

Mathematically, it can be expressed as:

$$ \begin{pmatrix} x_s \\ y_s \end{pmatrix} = \begin{pmatrix} fx_c/z_c \\ fy_c/z_c \end{pmatrix} \Longleftrightarrow \tilde{x}_s = \begin{bmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \bar{x}_c $$

This projection is linear only when using homogeneous coordinates and after the projection is done it is impossible to retrieve the 3D points.

The image coordinate system is defined on the principal axis (usually \(z\)) and it is really inconvenient to store pixels with negative coordinates. For that reason, the origin is placed at the center of the image plane or where the principal axis line is intersecting the image plane.

In practice, the origin is translated or moved to be in the top left corner of the image plane and the origin coordinate will have values (0, 0). This illustration can be seen in the image below.

This way we do not consider negative coordinates but rather positive ones. To do this, two offsets are calculated (\(c_y, c_x\)) and added to the principal point.

In doing this we need to modify our perspective projection model. You can see the new formula below.

$$ \begin{pmatrix} x_s \\ y_s \end{pmatrix} = \begin{pmatrix} f_xx_c/z_c + sy_c/z_c +c_x \\ f_yy_c + c_y \end{pmatrix} \Longleftrightarrow \tilde{x}_s = \begin{bmatrix} f_x & s & c_x & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} $$

Next to the ratio calculation and multiplication with the focal length, we have the addition of the \(c\) offset. Also, we have also allowed the focal length in the \(x\) direction to differ from the length in \(y\). Then, another scaling factor \(s\) was added, which is a skew that allows for modeling sensors that are not mounted perpendicular to the optical axis, due to manufacturing inaccuracies.

Looking at the \(3\times3\) submatrix in the \(4\times4\) matrix on the right, we usually call that matrix the calibration matrix K, which is usually also called intrinsic matrix because it stores the camera intrinsics.

Chaining transformations

Let’s see what can we do if a 3D point is not represented in a camera coordinate system but rather in the world coordinate system.

Well, we can chain the transformations (the rigid body transformations) that map a point by multiplying it by a rotation and translation matrix, the extrinsic parameters. After we obtain the new coordinates we multiply them by the camera intrinsic parameters K, and we get the 2D coordinates on the image plane.

In most cases, it is more convenient to multiply the extrinsic parameters with the intrinsic parameters and do one set of transformations on the coordinates.

$$ \tilde{x}_s = \begin{bmatrix} K & 0 \end{bmatrix} \bar{x}_c = \begin{bmatrix} K & 0 \end{bmatrix} \begin{bmatrix} R & T \\ 0^T & 1 \end{bmatrix} \bar{x}_w = K \begin{bmatrix} R & t \end{bmatrix} \bar{x}_w = P \bar{x}_w $$

This way we obtain a \(3\times4\) projection matrix that directly maps a 3D scene from the world coordinates to the screen.

Full Rank Representation

In some cases, it is preferable to use a full-rank \(4\times4\) projection matrix. We just add zeros in the last row and add a one for the last element.

$$ \bar{x}_s = \begin{bmatrix} K & 0 \\ 0^T & 1 \end{bmatrix} \begin{bmatrix} R & t \\ 0^T & 1 \end{bmatrix} \bar{x}_w = \tilde{P} \bar{x}_w $$

When the two matrices are multiplied together we get a \(4\times4\) matrix and if multiply with everything else we get a homogeneous vector that is 4D representing a 3D point. But, because we know that it represents a point on the image plane we need to normalize it with respect to its 3rd coordinate. This equation can be seen below:

$$ \bar{x}_s = \tilde{x}_s/z_s = (x_s/z_s,y_s/z_s,1,1/z_s)^T $$

We see that the first and second elements are the coordinates on the image and the third is one by definition. The last element is the so-called inverse depth, which can be used in combination with the full-rank \(4\times4\) matrix to go back to the world coordinate system or simply from the 2D point to the 3D point.

Summary

We have come to an end. In this post, we talked about two main projection models, the Orthographic and Perspective projection models. We have seen several examples of how 3D scenes are projected into 2D and also the mathematics behind these approaches.

References

[1] Morell, A. (2022) Camera Obscura, Abelardo Morell. Available at: https://www.abelardomorell.net/camera-obscura.