Written 28 August 2025 ~ 12 min read
Multi-layer Perceptrons and Backpropagation

Benjamin Clark
Part 3 of 3 in Foundations of AI
Multi-Layer Perceptrons and Backpropagation
Introduction

This article is the third in our Foundations of AI series. If you haven’t read the first ones on perceptrons and activation functions, start there — this article builds on those ideas.
Multi-Layer Perceptrons (MLPs) extend the simple perceptron by stacking layers of neurons with non-linear activations. This allows networks to model curved and complex decision boundaries. Training such models requires a systematic way to adjust parameters — ranging from thousands to billions of weights. Backpropagation is the algorithm that makes this feasible.
The Goal
By the end of this article, you will understand:
- Why stacking perceptrons into layers creates more expressive models.
- The structure of an MLP (input, hidden, and output layers).
- The intuition behind backpropagation and how it is used to train deep networks.
- Common challenges in training (vanishing gradients, overfitting) and why they matter.
Why Go Beyond Single Perceptrons?
A single perceptron computes a weighted sum of its inputs and applies an activation:
y=f(i=1∑nwixi+b)Where:
- y = perceptron output
- f = activation function
- wi = weight for input xi
- xi = input feature
- b = bias term
- n = number of input features
This makes it a linear classifier. If the classes can be separated by a hyperplane, a single perceptron can solve it. For patterns like XOR, it cannot.
The XOR Problem
Input A | Input B | Output |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
There is no single straight line that separates the 1s from the 0s.

The XOR limitation was highlighted in Marvin Minsky and Seymour Papert’s Perceptrons (1969), which proved that single-layer perceptrons cannot represent linearly inseparable functions such as XOR. This helped trigger a decline in neural-network research until multi-layer training revived the field in the 1980s.
Perceptrons: An Introduction to Computational Geometry (MIT Press, 1969)
Stacking Perceptrons
To solve XOR, you need at least two layers:
- First layer: constructs intermediate features.
- Second layer: combines them to form the final decision.
Mathematically, instead of a single transformation
y=f(w⋅x+b)we compose layers:
y=f(2)(W(2)f(1)(W(1)x+b(1))+b(2))Where:
- f(1),f(2) = activation functions for the first and second layers
- W(1),W(2) = weight matrices
- b(1),b(2) = bias vectors
- x = input vector
This composition of functions gives MLPs their expressive power.
Structure of a Multi-Layer Perceptron (MLP)
An MLP contains three types of layers:
- Input layer – receives raw features (e.g., pixel values or embeddings).
- Hidden layers – each neuron applies a weighted sum, adds a bias, and passes the result through a non-linear activation to extract new features.
- Output layer – produces the final prediction (classification, regression, etc.).
Per layer, the mapping is:
h(l)=f(l)(W(l)h(l−1)+b(l))Where:
- h(l) = output vector from layer l
- h(l−1) = input vector to layer l (output of layer l−1)
- W(l) = weight matrix for layer l
- b(l) = bias vector for layer l
- f(l) = activation function at layer l

A Simple Example (Digit Classification)
Suppose we classify a 28×28 handwritten digit (MNIST). A small MLP might use:
- Input: 784 features (flattened pixels).
- Hidden: 16 neurons with a non-linear activation.
- Output: 10 neurons (one per digit class), normalised with softmax.
Computation:
h(1)y=f(1)(W(1)x+b(1))=f(2)(W(2)h(1)+b(2))Where:
- x = input vector (length 784)
- h(1) = hidden outputs (length 16)
- y = output probabilities (length 10)
- f(1) = a non-linear activation function
- f(2) = softmax
- W(1),W(2) = weight matrices
- b(1),b(2) = bias vectors
The softmax function maps a length-k logit vector to a length-k vector of non-negative values that sum to 1 (a probability distribution over the k classes):
y^i=softmax(z)i=∑j=1kezjeziWhere:
- z∈Rk = logits (real-valued scores)
- y^∈(0,1)k = normalised outputs (“probabilities”)
- ∑i=1ky^i=1 = normalisation (on the probability simplex)
Worked example (logits [2.0,1.0,0.1]): exponentials ≈[7.39,2.72,1.11], sum ≈11.22, softmax y^≈[0.66,0.24,0.10].
How it’s used with cross-entropy: for single-label classification, the loss for the correct class index c is
L=−log(y^c).With the layers defined, the next question is how to make the parameters fit the data. We start by defining a loss and a rule for updating the weights.
The Learning Problem
Stacking perceptrons increases expressiveness; training requires a principled way to update many parameters.
Loss Functions
Mean Squared Error (regression):
L=n1i=1∑n(yi−y^i)2Where:
- n = number of samples
- yi = true value
- y^i = predicted value
Cross-Entropy (single-label classification):
L=−log(y^c)Where:
- c = index of the correct class
- y^c = predicted probability for class c
A loss function tells us how far off a prediction is; an optimiser tells us how to move the parameters to reduce that loss.
Optimisation: Gradient Descent and SGD
Full-batch gradient descent uses the gradient over the entire dataset:
wt+1=wt−η∇L(wt)Stochastic Gradient Descent (SGD) (usually mini-batch) uses the gradient over a small batch Bt:
wt+1=wt−η∇LBt(wt)Where:
- wt = parameters at step t
- η = learning rate (step size)
- L(⋅) = loss over the full dataset
- LBt(⋅) = loss over mini-batch Bt
- ∇ = gradient operator
Tiny worked example: assume
∇LBt(wt)=[0.30−0.10],η=0.01Then one step gives
wt+1=wt−0.01[0.30−0.10]=wt+[−0.0030.001].Gradient descent still needs the gradients of the loss with respect to every parameter. Backpropagation provides them efficiently for multi-layer networks.
The Core Idea of Backpropagation
Backpropagation applies the chain rule to propagate error signals from the output layer back through the network, producing the gradients required by gradient descent.
Gradient Descent in One Equation
w←w−η∂w∂LWhere:
- w = a weight parameter
- η = learning rate
- L = loss
- ∂w∂L = gradient of L w.r.t. w
Chain Rule (Mechanism)
∂w∂L=∂y∂L⋅∂w∂yWhere:
- L = loss
- y = neuron output (or intermediate)
- w = weight
- ∂y∂L = sensitivity of loss to the neuron output
- ∂w∂y = sensitivity of the neuron output to the weight
Worked example (single linear neuron):
If y=wx and L=21(y−t)2, then
A Tiny MLP (One Hidden Neuron)
Forward pass:
hy^=f(1)(w1x+b1)=f(2)(w2h+b2)Loss (MSE for simplicity):
L=21(y^−t)2Backward pass (gradients):
∂w2∂L=(y^−t)f(2)′(w2h+b2)h ∂w1∂L=(y^−t)f(2)′(w2h+b2)w2f(1)′(w1x+b1)xWhere:
- t = target value
- f(1),f(2) = non-linear activations (differentiable)
- primes denote derivatives of the activations
- other symbols as defined above
Together, the forward pass, loss, backward pass, and update rule form a complete training step. The reason this matters is efficiency: the backward pass reuses forward-pass intermediates, so computing all gradients costs roughly another forward pass.
Why Backpropagation Matters
In practice, this efficiency is what lets MLPs scale to thousands or millions of parameters without prohibitive compute.
Efficiency in MLPs
- Activations h(l) and pre-activations z(l) are cached in the forward pass.
- The output error is computed first, then propagated backward layer by layer.
- Each layer reuses cached values to compute its gradients.
The cost of computing all gradients is on the same order as a second forward pass.
Scalability and Universality
Backprop works with any differentiable activations (ReLU, tanh, Mish, …) and losses (cross-entropy, MSE, …), and scales to deep/wide MLPs. This universality is why the same training loop underpins everything from small MLPs to large modern architectures.
Limitations and Challenges
Even with efficient gradients, training deep MLPs isn’t trivial. Common issues include vanishing/exploding gradients, saddle points, overfitting, and compute cost.
Vanishing and Exploding Gradients
As gradients flow backward through many layers, repeated multiplication by derivatives can shrink them towards zero or blow them up:
∂W(1)∂L∝(l=2∏Lf(l)′(z(l))W(l))Where:
- L = loss
- W(1) = weights of the earliest layer
- f(l)′(z(l)) = derivative of the activation at layer l
- W(l) = weight matrix at layer l
If f(l)′(z(l)) is small (e.g., saturating activations), gradients vanish and early layers learn very slowly. If products grow large, gradients explode and training becomes numerically unstable.
Local Minima and Saddle Points
The loss landscape is non-convex:
- Training may converge to a local minimum (not globally optimal).
- More often, networks encounter saddle points — flat regions where gradients are close to zero.

In practice, stochastic gradient descent often escapes poor regions over time because mini-batch noise perturbs the updates.
Why SGD often escapes poor regions
SGD updates parameters using a mini-batch estimate of the gradient:
wt+1=wt−η∇LBt(wt)Where:
- wt = parameters at step t
- η = learning rate
- LBt = loss over the mini-batch Bt
- ∇LBt(wt) = stochastic (noisy) gradient estimate
The mini-batch gradient can be written as the full gradient plus noise:
∇LBt(wt)=∇L(wt)+ξtWhere:
- ∇L(wt) = full-batch gradient
- ξt = zero-mean noise term whose variance shrinks as batch size increases
Near flat regions or saddle points, ∥∇L(wt)∥ is small, so the noise ξt can dominate, nudging the iterate out of plateaus over many steps.
Overfitting
MLPs with many parameters can memorise the training set instead of learning patterns that generalise to unseen inputs.
Typical signs:
- Training loss decreases steadily.
- Validation loss plateaus or increases.
- Predictions are confident on training data and unreliable on held-out data.
This risk is higher when the model’s capacity far exceeds the amount of informative data (an MLP with many parameters trained on a small dataset), when labels are noisy or inconsistent, or when there is data leakage or a distribution shift between training and validation. In short: training error can be low while true out-of-sample error remains high — the hallmark of overfitting.
Computational Cost
Training time and memory usage grow with model size, batch size, dataset size, and sequence/feature length. The backward pass roughly doubles the compute of the forward pass and must cache intermediate activations; at typical batch sizes, activations (not parameters) often dominate peak memory.
- Compute scales with the number of floating-point operations per step (FLOPs), which increases with depth, width, and input length.
- Memory (training) = parameters + optimiser state + activations; the last term scales with batch size and the sum of per-layer feature maps.
At inference, memory and bandwidth are driven mostly by parameter tensors. Quantisation (covered next in the Foundations of AI series) reduces their precision (e.g., FP32 → INT8/INT4), shrinking model size and memory bandwidth and enabling CPU-only or edge deployments. Compact MLPs can run on low-power devices such as a Raspberry Pi 5; larger models typically require desktop-class GPUs or multi-node clusters.
Summary: Common Issues at a Glance
Problem | Practical effect |
---|---|
Vanishing gradients | Early layers learn very slowly |
Exploding gradients | Numerical instability / divergence |
Saddles / flat regions | Slow progress, apparent “stuck” behaviour |
Overfitting | Low train loss, high validation loss |
Compute cost | Long training times, high memory usage |
Looking Ahead: Making inference cheaper with quantisation
In this article, we moved from single perceptrons to multi-layer perceptrons (MLPs) and the training mechanics that make them learn.
You’ve learned:
- Why a single perceptron is limited, and how stacking layers with non-linear activations increases expressiveness.
- The structure of an MLP and the per-layer mapping h(l)=f(l)(W(l)h(l−1)+b(l)).
- How we train: define a loss, choose an optimiser (full-batch vs mini-batch SGD), and obtain gradients via backpropagation.
- Practical challenges: vanishing/exploding gradients, saddles/local minima, overfitting, and compute cost.
Next in this series: we’ll explore quantisation — representing weights/activations with fewer bits (e.g., FP32 → INT8/INT4) to reduce memory bandwidth and speed up inference while preserving accuracy.
Part 3 of 3 in Foundations of AI