Skip to main content

Section6.5The Method of Least Squares

Objectives
  1. Learn examples of best-fit problems.
  2. Learn to turn a best-fit problem into a least-squares problem.
  3. Recipe: find a least-squares solution.
  4. Picture: geometry of a least-squares solution.
  5. Vocabulary words: least-squares solution.

In this section, we answer the following important question:

Suppose that Ax = b does not have a solution. What is the best approximate solution?

For our purposes, the best approximate solution is called the least-squares solution. We will present two methods for finding least-squares solutions, and we will give several applications to best-fit problems.

Subsection6.5.1Least-Squares Solutions

We begin by clarifying exactly what we will mean by a “best approximate solution” to an inconsistent matrix equation Ax = b .

Definition

Let A be an m × n matrix and let b be a vector in R m . A least-squares solution of the matrix equation Ax = b is a vector K x in R n such that

dist ( b , A K x ) dist ( b , Ax )

for all other vectors x in R n .

Recall that dist ( v , w )= A v w A is the distance between the vectors v and w . The term “least squares” comes from the fact that dist ( b , Ax )= A b A K x A is the square root of the sum of the squares of the entries of the vector b A K x . So a least-squares solution minimizes the sum of the squares of the differences between the entries of A K x and b . In other words, a least-squares solution solves the equation Ax = b as closely as possible, in the sense that the sum of the squares of the difference b Ax is minimized.

Least Squares: Picture

Suppose that the equation Ax = b is inconsistent. Recall from this note in Section 2.3 that the column space of A is the set of all other vectors c such that Ax = c is consistent. In other words, Col ( A ) is the set of all vectors of the form Ax . Hence, the closest vector of the form Ax to b is the orthogonal projection of b onto Col ( A ) . This is denoted b Col ( A ) , following this notation in Section 6.3.

Col A Ax Ax Ax A K x = b Col ( A ) b b A K x = b Col ( A ) 0

A least-squares solution of Ax = b is a solution K x of the consistent equation Ax = b Col ( A )

Note

If Ax = b is consistent, then b Col ( A ) = b , so that a least-squares solution is the same as a usual solution.

Where is K x in this picture? If v 1 , v 2 ,..., v n are the columns of A , then

A K x = A EIIGK x 1 K x 2 ... K x n FJJH = K x 1 v 1 + K x 2 v 2 + ··· + K x n v n .

Hence the entries of K x are the “coordinates” of b Col ( A ) with respect to the spanning set { v 1 , v 2 ,..., v m } of Col ( A ) .

Col A v 1 v 2 K x 1 v 1 K x 2 v 2 A K x = b Col ( A ) b b A K x = b Col ( A ) 0
Figure4The violet plane is Col ( A ) . The closest that Ax can get to b is the closest vector on Col ( A ) to b , which is the orthogonal projection b Col ( A ) (in blue). The vectors v 1 , v 2 are the columns of A , and the coefficients of K x are the lengths of the green lines. Click and drag b to move it.
Note

If Ax = b is consistent, then b Col ( A ) = b , so that a least-squares solution is the same as a usual solution.

We learned to solve this kind of orthogonal projection problem in Section 6.3.

Proof

In particular, finding a least-squares solution means solving a consistent system of linear equations. We can translate the above theorem into a recipe:

Recipe: Compute a least-squares solution

Let A be an m × n matrix and let b be a vector in R n . Here is a method for computing a least-squares solution of Ax = b :

  1. Compute the matrix A T A and the vector A T b .
  2. Form the augmented matrix for the matrix equation A T Ax = A T b , and row reduce.
  3. This equation is always consistent, and any solution K x is a least-squares solution.

To reiterate: once you have found a least-squares solution K x of Ax = b , then b Col ( A ) is equal to A K x .

The reader may have noticed that we have been careful to say “the least-squares solutions” in the plural, and “a least-squares solution” using the indefinite article. This is because a least-squares solution need not be unique: indeed, if the columns of A are linearly dependent, then Ax = b Col ( A ) has infinitely many solutions. The following theorem, which gives equivalent criteria for uniqueness, is an analogue of this corollary in Section 6.3.

Proof

Subsection6.5.2Best-Fit Problems

In this subsection we give an application of the method of least squares to data modeling. We begin with a basic example.

Example(Best-fit line)

Suppose that we have measured three data points

( 0,6 ) , ( 1,0 ) , ( 2,0 ) ,

and that our model for these data asserts that the points should lie on a line. Of course, these three points do not actually lie on a single line, but this could be due to errors in our measurement. How do we predict which line they are supposed to lie on?

( 0,6 ) ( 1,0 ) ( 2,0 )

The general equation for a (non-vertical) line is

y = Mx + B .

If our three data points were to lie on this line, then the following equations would be satisfied:

6 = M · 0 + B 0 = M · 1 + B 0 = M · 2 + B . (6.5.1)

In order to find the best-fit line, we try to solve the above equations in the unknowns M and B . As the three points do not actually lie on a line, there is no actual solution, so instead we compute a least-squares solution.

Putting our linear equations into matrix form, we are trying to solve Ax = b for

A = C 011121 D x = L MB M b = C 600 D .

We solved this least-squares problem in this example: the only least-squares solution to Ax = b is K x = A MB B = A 35 B , so the best-fit line is

y = 3 x + 5.
( 0,6 ) ( 1,0 ) ( 2,0 ) y = 3 x + 5

What exactly is the line y = f ( x )= 3 x + 5 minimizing? The least-squares solution K x minimizes the sum of the squares of the entries of the vector b A K x . The vector b is the left-hand side of (6.5.1), and

A L 35 M = C 3 ( 0 )+ 5 3 ( 1 )+ 5 3 ( 2 )+ 5 D = C f ( 0 ) f ( 1 ) f ( 2 ) D .

In other words, A K x is the vector whose entries are the y -coordinates of the graph of the line at the values of x we specified in our data points, and b is the vector whose entries are the y -coordinates of those data points. The difference b A K x is the vertical distance of the graph from the data points:

( 0,6 ) ( 1,0 ) ( 2,0 ) 1 2 1 y = 3 x + 5 b A K x = C 600 D A L 35 M = C 12 1 D

The best-fit line minimizes the sum of the squares of these vertical distances.

All of the above examples have the following form: some number of data points ( x , y ) are specified, and we want to find a function

y = B 1 g 1 ( x )+ B 2 g 2 ( x )+ ··· + B m g m ( x )

that best approximates these points, where g 1 , g 2 ,..., g m are fixed functions of x . Indeed, in the best-fit line example we had g 1 ( x )= x and g 2 ( x )= 1; in the best-fit parabola example we had g 1 ( x )= x 2 , g 2 ( x )= x , and g 3 ( x )= 1; and in the best-fit linear function example we had g 1 ( x 1 , x 2 )= x 1 , g 2 ( x 1 , x 2 )= x 2 , and g 3 ( x 1 , x 2 )= 1 (in this example we take x to be a vector with two entries). We evaluate the above equation on the given data points to obtain a system of linear equations in the unknowns B 1 , B 2 ,..., B m —once we evaluate the g i , they just become numbers, so it does not matter what they are—and we find the least-squares solution. The resulting best-fit function minimizes the sum of the squares of the vertical distances from the graph of y = f ( x ) to our original data points.

To emphasize that the nature of the functions g i really is irrelevant, consider the following example.

The next example has a somewhat different flavor from the previous ones.

Note

Gauss invented the method of least squares to find a best-fit ellipse: he correctly predicted the (elliptical) orbit of the asteroid Ceres as it passed behind the sun in 1801.