What this post covers
This post explains matrix multiplication as composition of transformations.
- Why matrix multiplication is not just a number rule
- Why applying two transformations in sequence becomes one matrix product
- Why order changes the result
- How this connects to graphics, coordinate transforms, and neural-network layers
Key terms
- matrix multiplication: the operation that combines two matrices into one single composite transformation
- matrix: here, a representation of one transformation
- linear transformation: the structure matrix multiplication is combining
Core idea
Ordinary scalar multiplication commutes: changing the order does not matter. Matrix multiplication usually does not.
That feels strange at first, but the reason is simple: a matrix is not just a number. It represents a transformation.
If you apply B to a vector x first and then apply A, the full result is
A(Bx)
The single matrix that represents the whole combined action is AB, so
A(Bx) = (AB)x
That is why matrix multiplication is best read as composition of transformations, much like function composition.
In this series we use the column-vector convention, so in ABx, the rightmost matrix acts first.
The component rule is still worth seeing
If A is m x n and B is n x k, then AB is m x k. Its (i, j) entry is computed by taking the dot product of row i of A with column j of B.
(AB)ij = sum_{k=1}^{n} Aik Bkj
That formula is not separate from the composition idea. It is the coordinate version of it: the jth column of B is first turned into an intermediate vector, and A then acts on that result.
Why order matters
Suppose
A = [2 0
0 1]
B = [0 1
1 0]
Here A stretches only the x direction, while B swaps the two axes.
Then
AB = [0 2
1 0]
BA = [0 1
2 0]
So AB != BA.
This is not a weird exception. It is the normal situation, because “do B then A” is usually different from “do A then B.”
Step-by-step examples
Example 1) One vector through two transformations
Let x = [1; 0].
First apply B:
Bx = [0 1
1 0] [1
0]
= [0
1]
Then apply A:
A(Bx) = [2 0
0 1] [0
1]
= [0
1]
Now compute the product itself:
AB = [0 2
1 0]
and then
(AB)x = [0 2
1 0] [1
0]
= [0
1]
So A(Bx) and (AB)x agree, exactly as composition predicts.
Example 2) Rotation and scaling
In graphics, rotating first and then scaling usually gives a different answer from scaling first and then rotating. Even a simple pair such as “swap axes, then stretch the x-axis” versus “stretch first, then swap” gives different matrices, as we saw above.
So the multiplication order is not cosmetic. It changes the geometry.
Example 3) Stacking linear layers
If a neural network applies one linear layer and then another, and there is no activation in between, the total effect is still one linear transformation:
y = A(Bx) = (AB)x
That is why stacking only linear layers does not increase expressive power: without an activation between them, they collapse exactly into one linear transformation. Nonlinearity is what makes deep networks richer.
Math notes
- Matrix multiplication represents composition of linear transformations.
- Associativity holds, so
(AB)C = A(BC), which also lets us choose a cheaper computation order in practice. - Commutativity does not generally hold, so
ABis usually not equal toBA. - Dimensions must line up in a precise way: if
Aism x nandBisn x k, then the number of columns ofAmatches the number of rows ofB, soABis defined and has shapem x k.
Common mistakes
Treating matrix multiplication like scalar multiplication
That is the main reason AB != BA feels confusing. These are transformations, not plain numbers.
Reading left to right too literally
In the column-vector convention, the matrix closest to x acts first.
Thinking dimension matching is just formal syntax
It is a structural condition: one transformation's output space has to fit the next transformation's input space, which is why the inner dimensions must match.
Thinking more linear layers automatically means more expressive power
Without nonlinear activations, many linear layers collapse into one linear transformation.
Practice or extension
- Why can a
2 x 3matrix multiply a3 x 4matrix, but a3 x 4matrix cannot multiply a2 x 3matrix? - Why do “rotate then scale” and “scale then rotate” usually differ?
- Why are activations necessary if two linear layers can collapse into one?
- Check whether a library you use follows a column-vector or row-vector convention.
Wrap-up
This post read matrix multiplication as composition.
ABcombines two transformations into one.- In this convention, the rightmost matrix acts first.
- Order matters, so matrix multiplication is usually not commutative.
- The same idea appears in graphics, coordinate systems, and neural networks.
Next, we will use this viewpoint to read systems of linear equations in the form Ax = b.
💬 댓글
이 글에 대한 의견을 남겨주세요