[Linear Algebra Series Part 8] Matrix Multiplication as Composition

What this post covers

This post explains matrix multiplication as composition of transformations.

Why matrix multiplication is not just a number rule
Why applying two transformations in sequence becomes one matrix product
Why order changes the result
How this connects to graphics, coordinate transforms, and neural-network layers

Key terms

matrix multiplication: the operation that combines two matrices into one single composite transformation
matrix: here, a representation of one transformation
linear transformation: the structure matrix multiplication is combining

Core idea

Ordinary scalar multiplication commutes: changing the order does not matter. Matrix multiplication usually does not.

That feels strange at first, but the reason is simple: a matrix is not just a number. It represents a transformation.

If you apply B to a vector x first and then apply A, the full result is

A(Bx)

The single matrix that represents the whole combined action is AB, so

A(Bx) = (AB)x

That is why matrix multiplication is best read as composition of transformations, much like function composition.

In this series we use the column-vector convention, so in ABx, the rightmost matrix acts first.

The component rule is still worth seeing

If A is m x n and B is n x k, then AB is m x k. Its (i, j) entry is computed by taking the dot product of row i of A with column j of B.

(AB)ij = sum_{k=1}^{n} Aik Bkj

That formula is not separate from the composition idea. It is the coordinate version of it: the jth column of B is first turned into an intermediate vector, and A then acts on that result.

Why order matters

Suppose

A = [2 0
     0 1]

B = [0 1
     1 0]

Here A stretches only the x direction, while B swaps the two axes.

Then

AB = [0 2
      1 0]

BA = [0 1
      2 0]

So AB != BA.

This is not a weird exception. It is the normal situation, because “do B then A” is usually different from “do A then B.”

Step-by-step examples

Example 1) One vector through two transformations

Let x = [1; 0].

First apply B:

Then apply A:

A(Bx) = [2 0
         0 1] [0
               1]
      = [0
         1]

Now compute the product itself:

AB = [0 2
      1 0]

and then

(AB)x = [0 2
         1 0] [1
               0]
      = [0
         1]

So A(Bx) and (AB)x agree, exactly as composition predicts.

Example 2) Rotation and scaling

In graphics, rotating first and then scaling usually gives a different answer from scaling first and then rotating. Even a simple pair such as “swap axes, then stretch the x-axis” versus “stretch first, then swap” gives different matrices, as we saw above.

So the multiplication order is not cosmetic. It changes the geometry.

Example 3) Stacking linear layers

If a neural network applies one linear layer and then another, and there is no activation in between, the total effect is still one linear transformation:

y = A(Bx) = (AB)x

That is why stacking only linear layers does not increase expressive power: without an activation between them, they collapse exactly into one linear transformation. Nonlinearity is what makes deep networks richer.

Math notes

Matrix multiplication represents composition of linear transformations.
Associativity holds, so (AB)C = A(BC), which also lets us choose a cheaper computation order in practice.
Commutativity does not generally hold, so AB is usually not equal to BA.
Dimensions must line up in a precise way: if A is m x n and B is n x k, then the number of columns of A matches the number of rows of B, so AB is defined and has shape m x k.

Common mistakes

Treating matrix multiplication like scalar multiplication

That is the main reason AB != BA feels confusing. These are transformations, not plain numbers.

Reading left to right too literally

In the column-vector convention, the matrix closest to x acts first.

Thinking dimension matching is just formal syntax

It is a structural condition: one transformation's output space has to fit the next transformation's input space, which is why the inner dimensions must match.

Thinking more linear layers automatically means more expressive power

Without nonlinear activations, many linear layers collapse into one linear transformation.

Practice or extension

Why can a 2 x 3 matrix multiply a 3 x 4 matrix, but a 3 x 4 matrix cannot multiply a 2 x 3 matrix?
Why do “rotate then scale” and “scale then rotate” usually differ?
Why are activations necessary if two linear layers can collapse into one?
Check whether a library you use follows a column-vector or row-vector convention.

Wrap-up

This post read matrix multiplication as composition.

AB combines two transformations into one.
In this convention, the rightmost matrix acts first.
Order matters, so matrix multiplication is usually not commutative.
The same idea appears in graphics, coordinate systems, and neural networks.

Next, we will use this viewpoint to read systems of linear equations in the form Ax = b.