[Linear Algebra Series Part 4] Length and Distance: Reading Error as a Number

What this post covers

This post defines the length of a vector and the distance between two vectors.

What a norm is
How to read Euclidean norm and Euclidean distance intuitively
When L1, L2, and L∞ norms are commonly used
Why distance and similarity are related but not identical ideas

Key terms

vector: the object whose size and distance we measure
norm: a rule for measuring the size of a vector
dot product: the operation that will connect length and angle in the next post

Core idea

Once you accept vectors as data representations, the next questions come naturally.

How large is this vector?
How different are these two vectors?

The concept that answers the first question is a norm. A norm is a rule for measuring the size of a vector. The most familiar one is the Euclidean norm, which matches what we usually call length in 2D or 3D geometry.

If this is your first pass, treat the Euclidean norm as the main anchor for this post. Other norms matter too, but you do not need all of them at once to understand the core idea.

For a vector v = (v1, v2, ..., vn), the Euclidean norm is

||v|| = sqrt(v1^2 + v2^2 + ... + vn^2)

The meaning is simple even if the notation looks new. We square each axis contribution, add them, and then take the square root to recover one overall size.

Distance uses the same idea on the difference between two vectors. For vectors u and v, the Euclidean distance is

d(u, v) = ||u - v||

So distance asks: how large is the difference vector?

That point of view appears everywhere in programming. In regression, we compare predictions and ground truth. In graphics, we compare images. In embeddings, we compare queries, documents, users, or items.

Norms beyond Euclidean length

The Euclidean norm is not the only norm. For p >= 1, an Lp norm is usually written as

||v||_p = (|v1|^p + |v2|^p + ... + |vn|^p)^(1/p)

Common examples are:

L1 norm: sum of absolute values
L2 norm: square root of the sum of squares
L∞ norm, often called the max norm: the largest absolute component

The restriction p >= 1 matters. If p < 1, the triangle inequality fails, so the formula no longer defines a true norm.

Why length and distance matter in applications

Programming constantly turns mismatch into numbers. If we collect prediction errors into one vector, the size of that error vector tells us how badly the model missed overall.

If we flatten two images into vectors and subtract them, the norm of the difference gives one number that summarizes how different the images are. If we compare embeddings, distance can tell us how close two objects are in representation space.

But this is also the first place where an important distinction appears: distance is not the same thing as similarity. A small distance often suggests similarity, but the right notion of similarity depends on the problem. Sometimes magnitude matters. Sometimes only direction matters. The next post will show why cosine similarity is often the better tool when direction matters more than length.

Step-by-step examples

Example 1) Length of a 2D vector

Take v = (3, 4). Then

||v|| = sqrt(3^2 + 4^2) = 5

This is just the Pythagorean theorem in vector form. That is why the Euclidean norm feels natural: in 2D it is arrow length, in 3D it is spatial length, and in higher dimensions it is the same idea generalized.

Example 2) An error vector

Suppose the true values are (10, 20, 30) and the predictions are (12, 18, 29). Then the error vector is

error = (12, 18, 29) - (10, 20, 30) = (2, -2, -1)

Its Euclidean norm is

||error|| = sqrt(2^2 + (-2)^2 + (-1)^2) = 3

So one way to summarize the total error is to say its size is 3. In practice, people also optimize quantities like mean squared error or sum of squared errors, but the central idea is the same: treat the error as a vector and measure its size.

A useful detail: ||error||^2 is the square of a norm, not itself a norm. That difference matters later when we talk about loss functions.

Example 3) Difference between two small images

Suppose two small grayscale images are flattened into vectors.

img1 = (0, 120, 255, 80)
img2 = (10, 110, 250, 100)

Their difference vector is (10, -10, -5, 20). Its norm gives a numeric summary of how different the images are.

Still, this also shows a practical caveat: raw pixel Lp distances do not perfectly match human perceptual similarity. So while norms are foundational, real image systems often use more specialized metrics too.

Example 4) Distance between normalized vectors

Suppose u and v are unit vectors. Then the following identity holds.

||u - v||^2 = 2(1 - cos(theta))

Think of this as a preview of the next post rather than a formula you must fully own right now. It works because unit vectors satisfy ||u||^2 + ||v||^2 = 2. Once vectors are normalized, Euclidean distance and angular similarity become tightly connected.

Math notes

A norm is not just any function. It must behave like a consistent measure of size.

1. ||v|| >= 0
2. ||v|| = 0  <=>  v = 0
3. ||a v|| = |a| ||v||
4. ||u + v|| <= ||u|| + ||v||

These are the norm axioms. Distance can then be defined by d(u, v) = ||u - v||. In that sense, Euclidean distance is a metric induced by a norm.

The fourth property is the triangle inequality. Intuitively, going directly is never longer than going in pieces. That is exactly the behavior we expect from a valid notion of distance.

A practical guide is also helpful.

L2 is common when large deviations should be penalized more strongly.
L1 is useful when you want more robustness to outliers.
max norm is useful when the largest single deviation matters most.
Huber-style losses use L2-like behavior for small errors and L1-like behavior for large ones.

Also keep two terms separate.

normalization: rescale a vector, often to unit length
standardization: rescale features by mean and variance

Typical formulas look like this.

z = (x - mu) / sigma      # feature-wise standardization across a dataset
v_normalized = v / ||v||  # vector normalization, assuming ||v|| != 0

Common mistakes

Assuming small distance always means meaningful similarity

Not always. The meaning of distance depends on the space and what each axis represents. For example, (1, 0) and (100, 0) are far apart in Euclidean distance, but they point in exactly the same direction. By contrast, (1, 0) and (0, 1) are much closer in Euclidean distance, yet their directions are orthogonal.

Memorizing the length formula without understanding why

The Euclidean norm is not an arbitrary formula. It is the Pythagorean theorem generalized to vector spaces. That link becomes important when we later connect norms to the dot product.

Confusing normalization and standardization

They solve different problems. Normalization adjusts vector length. Standardization adjusts feature scale across data.

Thinking `L2` is always enough

Different problems emphasize different kinds of difference. Outliers, worst-case bounds, and sparse signals often motivate different choices.

Practice or extension

Try these by hand.

Compute the length of v = (6, 8).
Compute the distance between u = (1, 2) and v = (4, 6).
Compute the error-vector norm between prediction (5, 9, 2) and truth (4, 10, 1).
Compare the L1 and L2 norms of the same vector.

Then think about these questions.

Why is distance defined through the difference vector?
Why is a large vector not the same thing as two vectors being similar?
Why do we need the triangle inequality for a usable distance notion?

Wrap-up

This post introduced vector length and distance.

A norm measures the size of one vector.
Distance measures the size of a difference vector.
These ideas power error analysis, image comparison, and representation search.
Distance alone does not capture every notion of similarity.
Normalization and standardization solve different problems.

In the next post, we will use the dot product to explain how much two vectors point in the same direction, and then connect that idea to cosine similarity.