Written 28 August 2025 ~ 12 min read

Multi-layer Perceptrons and Backpropagation

Written by

Benjamin Clark

Part 3 of 3 in Foundations of AI

↑ Top

Multi-Layer Perceptrons and Backpropagation

Introduction

An XKCD comic on Flawed Data — Flawed Data (2021) XKCD. Available at: https://xkcd.com/2494/ (Accessed: 27th August 2025)

This article is the third in our Foundations of AI series. If you haven’t read the first ones on perceptrons and activation functions, start there — this article builds on those ideas.

Multi-Layer Perceptrons (MLPs) extend the simple perceptron by stacking layers of neurons with non-linear activations. This allows networks to model curved and complex decision boundaries. Training such models requires a systematic way to adjust parameters — ranging from thousands to billions of weights. Backpropagation is the algorithm that makes this feasible.

The Goal

By the end of this article, you will understand:

Why stacking perceptrons into layers creates more expressive models.
The structure of an MLP (input, hidden, and output layers).
The intuition behind backpropagation and how it is used to train deep networks.
Common challenges in training (vanishing gradients, overfitting) and why they matter.

Why Go Beyond Single Perceptrons?

A single perceptron computes a weighted sum of its inputs and applies an activation:

y = f\!\left(\sum_{i=1}^{n} w_i x_i + b\right)

Where:

$y$ = perceptron output
$f$ = activation function
$w_i$ = weight for input $x_i$
$x_i$ = input feature
$b$ = bias term
$n$ = number of input features

This makes it a linear classifier. If the classes can be separated by a hyperplane, a single perceptron can solve it. For patterns like XOR, it cannot.

The XOR Problem

Input A	Input B	Output
0	0	0
0	1	1
1	0	1
1	1	0

There is no single straight line that separates the 1s from the 0s.

Visualisation of the XOR problem — The XOR dataset: red points (1) and blue points (0) cannot be separated by a single linear boundary.

Important

The XOR limitation was highlighted in Marvin Minsky and Seymour Papert’s Perceptrons (1969), which proved that single-layer perceptrons cannot represent linearly inseparable functions such as XOR. This helped trigger a decline in neural-network research until multi-layer training revived the field in the 1980s.
Perceptrons: An Introduction to Computational Geometry (MIT Press, 1969)

Stacking Perceptrons

To solve XOR, you need at least two layers:

First layer: constructs intermediate features.
Second layer: combines them to form the final decision.

Mathematically, instead of a single transformation

y = f(w \cdot x + b)

we compose layers:

y = f^{(2)}\!\Big(W^{(2)} \, f^{(1)}\!\big(W^{(1)} x + b^{(1)}\big) + b^{(2)}\Big)

Where:

$f^{(1)}, f^{(2)}$ = activation functions for the first and second layers
$W^{(1)}, W^{(2)}$ = weight matrices
$b^{(1)}, b^{(2)}$ = bias vectors
$x$ = input vector

This composition of functions gives MLPs their expressive power.

Structure of a Multi-Layer Perceptron (MLP)

An MLP contains three types of layers:

Input layer – receives raw features (e.g., pixel values or embeddings).
Hidden layers – each neuron applies a weighted sum, adds a bias, and passes the result through a non-linear activation to extract new features.
Output layer – produces the final prediction (classification, regression, etc.).

Per layer, the mapping is:

h^{(l)} = f^{(l)}\!\left(W^{(l)} h^{(l-1)} + b^{(l)}\right)

Where:

$h^{(l)}$ = output vector from layer $l$
$h^{(l-1)}$ = input vector to layer $l$ (output of layer $l-1$ )
$W^{(l)}$ = weight matrix for layer $l$
$b^{(l)}$ = bias vector for layer $l$
$f^{(l)}$ = activation function at layer $l$

Structure of a simple multi-layer perceptron for digit classification — A simple MLP: four inputs, two hidden layers (four neurons each), and a single output neuron.

A Simple Example (Digit Classification)

Suppose we classify a 28×28 handwritten digit (MNIST). A small MLP might use:

Input: 784 features (flattened pixels).
Hidden: 16 neurons with a non-linear activation.
Output: 10 neurons (one per digit class), normalised with softmax.

Computation:

\begin{aligned} h^{(1)} &= f^{(1)}\!\left(W^{(1)} x + b^{(1)}\right) \\\\ y &= f^{(2)}\!\left(W^{(2)} h^{(1)} + b^{(2)}\right) \end{aligned}

Where:

$x$ = input vector (length 784)
$h^{(1)}$ = hidden outputs (length 16)
$y$ = output probabilities (length 10)
$f^{(1)}$ = a non-linear activation function
$f^{(2)}$ = softmax
$W^{(1)}, W^{(2)}$ = weight matrices
$b^{(1)}, b^{(2)}$ = bias vectors

Softmax (categorical probabilities)

The softmax function maps a length- $k$ logit vector to a length- $k$ vector of non-negative values that sum to 1 (a probability distribution over the $k$ classes):

\hat{y}_i \;=\; \text{softmax}(z)_i \;=\; \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}

Where:

$z \in \mathbb{R}^k$ = logits (real-valued scores)
$\hat{y} \in (0,1)^k$ = normalised outputs (“probabilities”)
$\sum_{i=1}^k \hat{y}_i = 1$ = normalisation (on the probability simplex)

Worked example (logits $[2.0, 1.0, 0.1]$ ): exponentials $\approx [7.39, 2.72, 1.11]$ , sum $\approx 11.22$ , softmax $\hat{y} \approx [0.66, 0.24, 0.10]$ .

How it’s used with cross-entropy: for single-label classification, the loss for the correct class index $c$ is

L \;=\; -\log\!\big(\hat{y}_c\big).

With the layers defined, the next question is how to make the parameters fit the data. We start by defining a loss and a rule for updating the weights.

The Learning Problem

Stacking perceptrons increases expressiveness; training requires a principled way to update many parameters.

Loss Functions

Mean Squared Error (regression):

L = \frac{1}{n}\sum_{i=1}^{n} \big(y_i - \hat{y}_i\big)^2

Where:

$n$ = number of samples
$y_i$ = true value
$\hat{y}_i$ = predicted value

Cross-Entropy (single-label classification):

L = -\log\!\big(\hat{y}_{c}\big)

Where:

$c$ = index of the correct class
$\hat{y}_{c}$ = predicted probability for class $c$

A loss function tells us how far off a prediction is; an optimiser tells us how to move the parameters to reduce that loss.

Optimisation: Gradient Descent and SGD

Full-batch gradient descent uses the gradient over the entire dataset:

w_{t+1} = w_t - \eta \,\nabla L(w_t)

Stochastic Gradient Descent (SGD) (usually mini-batch) uses the gradient over a small batch $B_t$ :

w_{t+1} = w_t - \eta \,\nabla L_{B_t}(w_t)

Where:

$w_t$ = parameters at step $t$
$\eta$ = learning rate (step size)
$L(\cdot)$ = loss over the full dataset
$L_{B_t}(\cdot)$ = loss over mini-batch $B_t$
$\nabla$ = gradient operator

Tiny worked example: assume

\nabla L_{B_t}(w_t)= \begin{bmatrix} 0.30 \\ -0.10 \end{bmatrix}, \qquad \eta=0.01

Then one step gives

w_{t+1} = w_t - 0.01 \begin{bmatrix} 0.30 \\ -0.10 \end{bmatrix} = w_t + \begin{bmatrix} -0.003 \\ \;\;0.001 \end{bmatrix}.

Gradient descent still needs the gradients of the loss with respect to every parameter. Backpropagation provides them efficiently for multi-layer networks.

The Core Idea of Backpropagation

Backpropagation applies the chain rule to propagate error signals from the output layer back through the network, producing the gradients required by gradient descent.

Gradient Descent in One Equation

w \leftarrow w - \eta \,\frac{\partial L}{\partial w}

Where:

$w$ = a weight parameter
$\eta$ = learning rate
$L$ = loss
$\frac{\partial L}{\partial w}$ = gradient of $L$ w.r.t. $w$

Chain Rule (Mechanism)

\frac{\partial L}{\partial w} \;=\; \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w}

Where:

$L$ = loss
$y$ = neuron output (or intermediate)
$w$ = weight
$\frac{\partial L}{\partial y}$ = sensitivity of loss to the neuron output
$\frac{\partial y}{\partial w}$ = sensitivity of the neuron output to the weight

Worked example (single linear neuron):
If $y = w\,x$ and $L = \tfrac{1}{2}(y - t)^2$ , then

\frac{\partial L}{\partial y} = (y - t), \qquad \frac{\partial y}{\partial w} = x, \qquad \Rightarrow\; \frac{\partial L}{\partial w} = (y - t)\,x.

A Tiny MLP (One Hidden Neuron)

Forward pass:

\begin{aligned} h &= f^{(1)}(w_1 x + b_1) \\\\ \hat{y} &= f^{(2)}(w_2 h + b_2) \end{aligned}

Loss (MSE for simplicity):

L = \tfrac{1}{2}\,(\hat{y} - t)^2

Backward pass (gradients):

\frac{\partial L}{\partial w_2} = (\hat{y} - t)\, f^{(2)'}(w_2 h + b_2)\, h

\frac{\partial L}{\partial w_1} = (\hat{y} - t)\, f^{(2)'}(w_2 h + b_2)\, w_2 \, f^{(1)'}(w_1 x + b_1)\, x

Where:

$t$ = target value
$f^{(1)}, f^{(2)}$ = non-linear activations (differentiable)
primes denote derivatives of the activations
other symbols as defined above

Together, the forward pass, loss, backward pass, and update rule form a complete training step. The reason this matters is efficiency: the backward pass reuses forward-pass intermediates, so computing all gradients costs roughly another forward pass.

Why Backpropagation Matters

In practice, this efficiency is what lets MLPs scale to thousands or millions of parameters without prohibitive compute.

Efficiency in MLPs

Activations $h^{(l)}$ and pre-activations $z^{(l)}$ are cached in the forward pass.
The output error is computed first, then propagated backward layer by layer.
Each layer reuses cached values to compute its gradients.

The cost of computing all gradients is on the same order as a second forward pass.

Scalability and Universality

Backprop works with any differentiable activations (ReLU, tanh, Mish, …) and losses (cross-entropy, MSE, …), and scales to deep/wide MLPs. This universality is why the same training loop underpins everything from small MLPs to large modern architectures.

Limitations and Challenges

Even with efficient gradients, training deep MLPs isn’t trivial. Common issues include vanishing/exploding gradients, saddle points, overfitting, and compute cost.

Vanishing and Exploding Gradients

As gradients flow backward through many layers, repeated multiplication by derivatives can shrink them towards zero or blow them up:

\frac{\partial L}{\partial W^{(1)}} \;\propto\; \Bigg(\prod_{l=2}^{L} f^{(l)'}(z^{(l)})\,W^{(l)}\Bigg)

Where:

$L$ = loss
$W^{(1)}$ = weights of the earliest layer
$f^{(l)'}(z^{(l)})$ = derivative of the activation at layer $l$
$W^{(l)}$ = weight matrix at layer $l$

If $f^{(l)'}(z^{(l)})$ is small (e.g., saturating activations), gradients vanish and early layers learn very slowly. If products grow large, gradients explode and training becomes numerically unstable.

Local Minima and Saddle Points

The loss landscape is non-convex:

Training may converge to a local minimum (not globally optimal).
More often, networks encounter saddle points — flat regions where gradients are close to zero.

Gradient descent trajectories converging to different local minima on f(x)=sin(3x)+0.1x^2 — Gradient descent on a non-convex objective, f(x)=\sin(3x)+0.1x^2, from multiple initializations. Each path converges to a different local minimum.

In practice, stochastic gradient descent often escapes poor regions over time because mini-batch noise perturbs the updates.

Why SGD often escapes poor regions

SGD updates parameters using a mini-batch estimate of the gradient:

w_{t+1} = w_t - \eta \,\nabla L_{B_t}(w_t)

Where:

$w_t$ = parameters at step $t$
$\eta$ = learning rate
$L_{B_t}$ = loss over the mini-batch $B_t$
$\nabla L_{B_t}(w_t)$ = stochastic (noisy) gradient estimate

The mini-batch gradient can be written as the full gradient plus noise:

\nabla L_{B_t}(w_t) \;=\; \nabla L(w_t) \;+\; \xi_t

Where:

$\nabla L(w_t)$ = full-batch gradient
$\xi_t$ = zero-mean noise term whose variance shrinks as batch size increases

Near flat regions or saddle points, $\lVert \nabla L(w_t)\rVert$ is small, so the noise $\xi_t$ can dominate, nudging the iterate out of plateaus over many steps.

Overfitting

MLPs with many parameters can memorise the training set instead of learning patterns that generalise to unseen inputs.

Typical signs:

Training loss decreases steadily.
Validation loss plateaus or increases.
Predictions are confident on training data and unreliable on held-out data.

This risk is higher when the model’s capacity far exceeds the amount of informative data (an MLP with many parameters trained on a small dataset), when labels are noisy or inconsistent, or when there is data leakage or a distribution shift between training and validation. In short: training error can be low while true out-of-sample error remains high — the hallmark of overfitting.

Computational Cost

Training time and memory usage grow with model size, batch size, dataset size, and sequence/feature length. The backward pass roughly doubles the compute of the forward pass and must cache intermediate activations; at typical batch sizes, activations (not parameters) often dominate peak memory.

Compute scales with the number of floating-point operations per step (FLOPs), which increases with depth, width, and input length.
Memory (training) = parameters + optimiser state + activations; the last term scales with batch size and the sum of per-layer feature maps.

At inference, memory and bandwidth are driven mostly by parameter tensors. Quantisation (covered next in the Foundations of AI series) reduces their precision (e.g., FP32 → INT8/INT4), shrinking model size and memory bandwidth and enabling CPU-only or edge deployments. Compact MLPs can run on low-power devices such as a Raspberry Pi 5; larger models typically require desktop-class GPUs or multi-node clusters.

Summary: Common Issues at a Glance

Problem	Practical effect
Vanishing gradients	Early layers learn very slowly
Exploding gradients	Numerical instability / divergence
Saddles / flat regions	Slow progress, apparent “stuck” behaviour
Overfitting	Low train loss, high validation loss
Compute cost	Long training times, high memory usage

Looking Ahead: Making inference cheaper with quantisation

In this article, we moved from single perceptrons to multi-layer perceptrons (MLPs) and the training mechanics that make them learn.

You’ve learned:

Why a single perceptron is limited, and how stacking layers with non-linear activations increases expressiveness.
The structure of an MLP and the per-layer mapping $h^{(l)} = f^{(l)}(W^{(l)}h^{(l-1)} + b^{(l)})$ .
How we train: define a loss, choose an optimiser (full-batch vs mini-batch SGD), and obtain gradients via backpropagation.
Practical challenges: vanishing/exploding gradients, saddles/local minima, overfitting, and compute cost.

Next in this series: we’ll explore quantisation — representing weights/activations with fewer bits (e.g., FP32 → INT8/INT4) to reduce memory bandwidth and speed up inference while preserving accuracy.

Part 3 of 3 in Foundations of AI

← Previous