## #011 Linear Algebra – Nonlinear Least Squares

*Highlights:* In the real-world scenario, models don’t produce linear graphs that often. Most of the time the equation of the model involves higher-order and higher-degree functions.

In this post, we will learn how to solve the harder nonlinear equations using a heuristic algorithm of finding the least-squares approximate solution. So let’s begin.

**Tutorial Overview:**

**1. Nonlinear Equations**

### Introduction

Practically, almost all problems can be represented as a set of nonlinear equations. However, solving them is a little more complex than the method we used to apply for solving linear equations.

Let us consider a set of \(m \) nonlinear equations and \(n \) variables. We represent the variables as an \(n \)-dimensional vector as follows.

$$ x=\left(x_{1}, \ldots, x_{n}\right) $$

And, the equations as a scalar-valued function represented below.

$$ f_{i}(x)=0, \quad i=1, \ldots, m $$

In the above representations, \(f_{i}(x) \) is known as the \(i \)-th residual since this is the value we need to be equal to zero. We often write the above expression in a more compact vector form as follows.

$$ f(x)=0 $$

Here, \(f(x)=\left(f_{1}(x), \ldots, f_{m}(x)\right) \) is an \(m \)-dimensional residual that maps with the \(n \)-dimensional variable vector. The zero vector on the right-hand side also has \(m \) dimensions. Therefore, our goal is to find \(x \) which are associated with zero residual.

It is important to note that in the case of linear equations, the function \(f \) is considered to be affine and the number of equations is greater than the number of variables, i.e., \(m>n \). This type of system of equations is known as *over-determined*.

When \(m=n \), the system of equations is called *square*.

And, when \(m<n \), as in the case of nonlinear equations where \(f \) is not affine, the system of equations is called under-determined.

### Difficulty Of Solving nonlinear equations

Unlike linear equations, solving nonlinear equations is a very difficult computational problem where the solutions can either be zero or infinite – basically, any number of solutions. Even determining whether a given set of nonlinear equations have any solutions is a complex task.

Non-heuristic algorithms exist that to help ease the task of solving nonlinear equations accurately. However, they are still complicated, computationally demanding, and are hardly used in applications.

This is the reason that heuristic algorithms are more preferred and tend to produce good, if not always the best possible point \(x\) with a small residual.

In the upcoming sections, we will learn about some heuristic algorithms for solving nonlinear equations using the least-squares method that also satisfy the *Optimality Condition*. We will also get into further detail about the *Optimality Condition* going ahead.

But first, let us see how we can generalise a rule for solving nonlinear equations using the least squares method.

## 2. Nonlinear Least Squares

As in the case of linear equations, if we can’t find an exact solution to a set of equations, we must aim for an approximate solution by finding an \(x \) that minimizes the sum of squares of the residual functions. So, for the equation $$ f(x)=0 $$, we can write this sum of squares as follows.

$$ f_{1}(x)^{2}+\cdots+f_{m}(x)^{2}=\|f(x)\|^{2} $$

To solve the nonlinear least square problem above, we must find a \(\hat{x} \) for all \(x \) such that

$$ \|f(x)\|^{2} \geq\|f(\hat{x})\|^{2} $$

This point represents the least square approximate solution which is nothing but

$$ {minimize} \quad\|f(x)\|^{2} $$

All we need to do is to solve for the \(n \)-dimensional vector \(x \) such that it satisfied \(f(x) = 0\).

Now, in order to understand heuristic algorithms, we need to first learn about the Optimality Condition, a term we previously mentioned.

## 3. Optimality Condition

Basically, the Optimality Condition is a necessary but not sufficient condition that satisfies all solutions \(\hat{x} \). However, this condition is not enough to guarantee that the point is a solution to the nonlinear least-squares problem.

So, the Optimality Condition must satisfy a solution, but it may satisfy other points as well, which are not solutions.

In mathematical terms, the partial derivative of \(\|f(x)\|^{2} \) with respect to each of \(x_{1}, \ldots, x_{n} \) must vanish at \(\hat{x} \). Here’s how we represent it.

$$ \frac{\partial}{\partial x_{i}}\|f(\hat{x})\|^{2}=0, \quad i=1, \ldots, n $$

We can also write the above expression in its vector form, \(\nabla\|f(\hat{x})\|^{2}=0 \). This gradient can be expressed as:

$$ \nabla\|f(x)\|^{2}=\nabla\left(\sum_{i=1}^{m} f_{i}(x)^{2}\right)=2 \sum_{i=1}^{m} f_{i}(x) \nabla f_{i}(x)=2 D f(x)^{T} f(x) $$

Here, the \(m \times n \) matrix \(D f(x) \) is the derivative or Jacobian matrix of the function \(f \) at the point \(x \), which is the matrix of its partial derivatives.

Therefore, if \(\hat{x} \) minimizes \(\|f(x)\|^{2} \), it must satisfy

$$ 2 D f(\hat{x})^{T} f(\hat{x})=0 $$

That’s the Optimality Condition which all heuristic algorithms must necessarily satisfy. Let us, now, move ahead to understand a bit about some of the heuristic algorithms used to solve nonlinear least-squares problems.

## 4. Gauss-Newton Algorithm

One of the important heuristic algorithms for solving the nonlinear least-squares problem is the Gauss-Newton Algorithm, which is an iterative algorithm. There is a variation of this algorithm that addresses some of its shortcomings, which is known as the Levenberg-Marquardt Algorithm, which is also an iterative algorithm.

Both the Gauss-Newton and Levenberg-Marquardt algorithms generate a sequence of points \(x^{(1)}, x^{(2)}, \ldots \). The vector \(x^{(1)} \) is the starting point of the algorithm and \(x^{(k)} \) is called the \(k \)-th iterate. An iteration is when we move from \(x^{(k)}\) to \(x^{(k+1)} \).

The quality of each iterate is based on their associated residuals, \(\left\|f\left(x^{(k)}\right)\right\| \), or its square. Once we reach a small enough \(\left\|f\left(x^{(k)}\right)\right\| \) or when \(x^{(k+1)}\) is very near \(x^{(k)} \) or when a maximum number of iterations is reached, the algorithm terminates.

Let us understand the basic architecture of the Gauss-Newton Algorithm.

### Basic Gauss-Newton Algorithm

Simply put, the Gauss-Newton Algorithm is a combination of Calculus and Least Squares.

The algorithm involves alternating between finding an affine approximation of the function \(f \) at the current iterate and solving the associated linear least squares problem to find the next iterate.

The affine approximation \(\hat{f} \) of \(f \) at the current iterate \(x^{(k)} \) is given by the Taylor approximation for each iteration \(k \), as follows.

$$ \hat{f}\left(x ; x^{(k)}\right)=f\left(x^{(k)}\right)+D f\left(x^{(k)}\right)\left(x-x^{(k)}\right) $$

Here, the \(m \times n \) matrix \(D f\left(x^{\wedge \pi}\right) \) is the Jacobian of derÄ±vative matrix of \(f \). The affine function \(\hat{f}\left(x ; x^{(k)}\right) \) is a very good approximation of \(f(x) \) provided \(x \) is near \(x^{(k)} \), i.e., \(\left\|x-x^{(k)}\right\| \) is small.

Then, the subsequent iterate \(x^{(k+1)} \) is made the minimizer of \(\left\|\hat{f}\left(x ; x^{(k)}\right)\right\|^{2} \), the norm squared of the affine approximation of \(f \) at \(x^{(k)} \). Based on the assumption that the derivative matrix \(D f\left(x^{(k)}\right)\) has linearly independent columns (which requires \(\left.m \geq n\right) \), we can say:

$$ x^{(k+1)}=x^{(k)}-\left(D f\left(x^{(k)}\right)^{T} D f\left(x^{(k)}\right)\right)^{-1} D f\left(x^{(k)}\right)^{T} f\left(x^{(k)}\right) $$

The result of this iteration is, in fact, the basic Gauss-Newton Algorithm.

Given a differentiable function \(f: \mathbf{R}^{n} \rightarrow \mathbf{R}^{m} \) an initial point \(x^{(1)} \)

$$ k=1,2, \ldots, k^{\max } $$

Evaluate the \(D f\left(x^{(k)}\right) \)**Form affine approximation at current iterate using calculus.**

$$ \hat{f}\left(x ; x^{(k)}\right)=f\left(x^{(k)}\right)+D f\left(x^{(k)}\right)\left(x-x^{(k)}\right) $$

Set \(x^{(k+1)} \) as the minimizer of \(\left\|\hat{f}\left(x ; x^{(k)}\right)\right\|^{2} \)**Update iterate using linear least squares.**

$$ x^{(k+1)}=x^{(k)}-\left(D f\left(x^{(k)}\right)^{T} D f\left(x^{(k)}\right)\right)^{-1} D f\left(x^{(k)}\right)^{T} f\left(x^{(k)}\right) $$

Now, if \(f(x) \) is very small or \(x^{(k+1)} \) is very near \(x^{(k)} \), the Gauss-Newton Algorithm will terminate. This goes to show that when the Optimality Condition $$ 2 D f(\hat{x})^{T} f(\hat{x})=0 $$ holds, the Gauss-Newton Algorithm stops.

### Shortcomings Of The Gauss-Newton Algorithm

The Gauss-Newton Algorithm usually works really well. The iterates \(x^{(k)} \) converge quite quickly to a point with a small residual. However, there are a couple of issues with this algorithm.

In the first case, the residual \(\left\|f\left(x^{(k)}\right)\right\| \) starts to increase rather than decrease and reaches a large value, the algorithm diverges and fails.

And second, many times the assumption we made about the linear independence of the columns of the derivative matrix \(D f\left(x^{(k)}\right) \) doesn’t hold, in which case the algorithm stops again since \(x^{(k+1)} \) is not defined.

These shortcomings can be rectified by modifying the Gauss-Newton Algorithm into a new one called the Levenberg-Marquardt Algorithm. However, we won’t go into much detail about this algorithm in this post.

Let us move ahead and understand a special case of the Gauss-Newton Algorithm.

### Newton Algorithm

The Gauss-Newton Algorithm transforms into Newton Algorithm sometimes known as the Newton-Raphson Algorithm when \(m=n \).

When \(m=n \), the matrix \(D f\left(x^{(k)}\right) \) is square. Therefore, we can simply the basic Gauss-Newton we learnt above as follows.

$$\begin{aligned}x^{(k+1)} &=x^{(k)}-\left(D f\left(x^{(k)}\right)\right)^{-1}\left(D f\left(x^{(k)}\right)^{T}\right)^{-1} D f\left(x^{(k)}\right)^{T} f\left(x^{(k)}\right) \\&=x^{(k)}-\left(D f\left(x^{(k)}\right)\right)^{-1} f\left(x^{(k)}\right)\end{aligned} $$

The result of the above iteration gives us the Newton Algorithm.

Given a differentiable function \(f: \mathbf{R}^{n} \rightarrow \mathbf{R}^{n} \) an initial point \(x^{(1)} \)

$$ k=1,2, \ldots, k^{\max } $$

Evaluate the Jacobian \(D f\left(x^{(k)}\right) \)**Form affine approximation at current iterate.**

$$ \hat{f}\left(x ; x^{(k)}\right)=f\left(x^{(k)}\right)+D f\left(x^{(k)}\right)\left(x-x^{(k)}\right) $$Set \(x^{(k+1)} \) as the solution of \(\hat{f}\left(x ; x^{(k)}\right) = 0 \)**Update iterate by solving linear equations.**

$$ x^{(k+1)}=x^{(k)}-\left(D f\left(x^{(k)}\right)\right)^{-1} f\left(x^{(k)}\right) $$

The Newton Algorithm is much easier to understand for \(n=1 \), in which case the iteration can be written as

$$ x^{(k+1)}=x^{(k)}-f\left(x^{(k)}\right) / f^{\prime}\left(x^{(k)}\right) $$

We can see the graphical representation of this iteration in the image below.

Now, to update \(x^{(k)} \) we form the Taylor approximation as follows.

$$ \hat{f}\left(x ; x^{(k)}\right)=f\left(x^{(k)}\right)+f^{\prime}\left(x^{(k)}\right)\left(x-x^{(k)}\right) $$

Next, we set the above to zero in order to find the next iterate \(x^{(k+1)} \).

If \(f^{\prime}\left(x^{(k)}\right) \neq 0 \), the solution of \(\hat{f}\left(x ; x^{(k)}\right)=0 \) is given by the right-hand side of. If \(f^{\prime}\left(x^{(k)}\right)=0 \), the Newton algorithm terminates with an error.

### Shortcomings Of The Newton Algorithm

Similar to the Gauss-Newton Algorithm, the Newton Algorithm shares the same issues of divergence when residual becomes too large, and the iterations terminate if the derivative matrix is not invertible. To solve these issues, we would need to look at some variations of this algorithm, which we’ll study in our upcoming posts.

Well, that’s it for this post. We hope you have understood how complex it is to solve a set of nonlinear equations as compared to linear equations. With the use of heuristics such as the Gauss-Newton Algorithm, one can solve the system of nonlinear equations using simple Calculus and Least Squares.

**Nonlinear Least Squares & Heuristic Algorithms**

- Nonlinear problems have more equations than unknowns/variables
- Determining whether a system of nonlinear equations have a solution is a difficult task
- There can be zero to infinite solutions to a set of nonlinear equations
- Heuristic algorithms are used to solve the nonlinear least squares problem
- The Optimality Condition is a necessary but not a sufficient condition to solve nonlinear equations
- Gauss-Newton, Newton-Raphson, and Levenberg-Marquardt are some of the heuristic algorithms used to solve nonlinear least square problems

**Summary**

So friends, how did you like this post on Nonlinear Least Squares, an important topic from Applied Linear Algebra? You are now equipped with the fundamental knowledge of solving or say, approximating solutions for a system of nonlinear equations. We hope you will practice the concepts by finding some relevant examples online. Do tell us about a least-squares problem you recently solved. We’ll see you soon with some more interesting topics. Till then, keep having some nonlinear fun!