[Linear Algebra Series Part 18] Least Squares and the Basics of Linear Regression

What this post covers

This post introduces least squares and linear regression.

Why approximate solutions are needed in real data
What least squares minimizes
How projection and the residual are connected
How regression fits into a linear-algebra viewpoint

Key terms

least squares: minimizing the squared length of the error vector
linear regression: fitting a linear model to data
normal equation: the equation A^T A x = A^T b
residual: the difference between observation and model output
column space: the space of outputs the model can explain

Core idea

Real data rarely fits a line or plane exactly. For example, the three points (1, 2), (2, 3), and (3, 5.5) do not all lie on one perfect line. That means Ax = b often has no exact solution.

Instead of demanding perfection, we look for the best approximation by minimizing

||Ax - b||^2

This is the least-squares problem.

We square the error because signed errors can cancel out, and because the squared Euclidean norm is smooth and easy to optimize.

The vector Ax is what the model can explain. The vector b is what we actually observe. Their difference is the error, or residual.

Step-by-step examples

Example 1) Fitting a line

Suppose we want a line y = β0 + β1 x for data points (x1, y1), ..., (xn, yn). Each data point gives one equation β0 + β1 x_i = y_i, and stacking those equations gives one matrix problem.

Then we can write

A = [1 x1
     1 x2
     ...
     1 xn]

β_hat = [β0
         β1]

b_vec = [y1
         y2
         ...
         yn]

So the problem becomes

A β_hat ≈ b_vec

Example 2) The projection viewpoint

The outputs a model can explain form the column space of A.

If b lies outside that space, no exact solution exists. The least-squares solution chooses A β_hat to be the projection of b onto Col(A).

That means the residual

r = b - A β_hat

is orthogonal to the column space.

Example 3) The normal equation

If the residual is orthogonal to every column of A, then

A^T r = 0

Substituting r = b - Ax gives

A^T(b - Ax) = 0
A^T A x = A^T b

which is the normal equation.

Math notes

Least squares minimizes the squared Euclidean norm of the residual.
If A has full column rank, the least-squares solution is unique.
If not, multiple least-squares solutions may exist, and one often chooses the minimum-norm solution.
In practice, QR decomposition or SVD is often preferred to solving the normal equation directly.

Linear regression is also a statistical model, but here the main focus is its linear-algebra skeleton.

Common mistakes

Thinking regression is only a statistics formula sheet

It is also a projection problem in linear algebra.

Thinking “no exact solution” means “no useful answer”

Approximation is the normal case in real data.

Assuming least-squares solutions are always unique

Uniqueness depends on the columns of A being independent enough.

Practice or extension

Why does real data often fail to satisfy Ax = b exactly?
Why do we square the error instead of just summing signed errors?
What does it mean geometrically to project b onto Col(A)?

Wrap-up

This post introduced least squares and linear regression.

When exact solutions fail, best approximations still exist.
Least squares minimizes squared residual size.
The solution can be read as a projection.
Next, we move to eigenvalues and eigenvectors, where transformations reveal their special directions.