Written 28 July 2025 · Last edited 04 August 2025 ~ 8 min read
Perceptrons

Benjamin Clark
Part 1 of 2 in Foundations of AI
Introduction

This is the first in a series exploring the foundational building blocks of modern AI.
In 1958, psychologist Frank Rosenblatt introduced the perceptron — a simple model for how a machine could separate inputs into categories by adjusting numerical “weights.” It may look primitive by today’s standards, but Rosenblatt’s model introduced key ideas that still underpin modern AI: weighted inputs, decision boundaries, and data‑driven learning.
Why start here? Because even the most advanced architectures — from GPT‑style transformers to convolutional networks — are built on these same principles.
As I near the end of my six‑year Open University degree — including TM358: Machine Learning and Artificial Intelligence — I’ve been reflecting on how these foundational concepts connect to the large‑scale systems I now build in my professional AI/ML work. This series is my way of sharing that journey.
The original paper is well worth a read if you want to see how these ideas were first described.
The Goal
By the end of this article, you’ll understand:
- What it means for data to be linearly separable and why that matters for perceptrons.
- How a perceptron processes inputs using weights, bias, and summation.
- How the step activation function enables binary classification.
- Why single‑layer perceptrons fail on non‑linear problems (like XOR) — and why that matters for modern neural networks.
What Does It Mean for Data to Be Linearly Separable?
Before we open up the perceptron, we need to understand the kinds of problems it was designed to solve. The key concept here is linear separability.
Defining Linear Separability
Data is linearly separable if you can draw a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that perfectly divides two classes of points: everything on one side belongs to Class A, everything on the other belongs to Class B.
A class is simply a label — the “bucket” each data point belongs to.
Examples:
- Email filtering: Spam vs Not spam
- Medical diagnosis: Has disease vs No disease
- Image recognition: Dog vs Cat
In this article, we focus on binary classification: every point is either Class A or Class B, and the perceptron’s job is to separate them as simply as possible.
Visualising Decision Boundaries
Imagine plotting two types of points:
- Red circles = Class A
- Blue circles = Class B
If you can separate them with a single straight line, the dataset is linearly separable.

In 3D, this becomes a plane. In higher dimensions (where most machine learning happens), it becomes a hyperplane.

Why Linear Separability Matters
A single‑layer perceptron can only handle linearly separable data. If the classes are arranged so that no single line (or plane) can divide them, the perceptron fails.
This famously shows up in problems like XOR — one of the simplest examples of non‑linear separability.
Rosenblatt’s 1958 paper didn’t use terms like “linearly separable” or “decision boundary.” Instead, he described how stimuli of one class produce stronger responses than another, dividing the perceptual world into classes. Later, Marvin Minsky and Seymour Papert (1969) formalized this limitation in Perceptrons.
When Data Isn’t Linearly Separable: The XOR Problem
Not all data is that tidy.
Consider XOR:
- If one input is “on” (1) and the other is “off” (0), the output is 1.
- If both are on or both are off, the output is 0.
This creates a checkerboard pattern that no single straight line can separate.

In 3D, the same idea holds: the classes are interwoven so no plane can divide them.

This is the perceptron’s fundamental limitation: if no single line or plane can separate the classes, it fails.
This limitation pushed researchers toward multi‑layer networks, which can model more complex, non‑linear decision boundaries. We’ll cover this in a future article.
Inside the Perceptron: How It Processes Information
Now that we know what problems perceptrons solve, let’s look inside.
Formal definition:
A perceptron is a linear classifier. It maps input features to a binary output by computing a weighted sum of those features, adding a bias, and passing the result through a threshold‑based activation function.
Or put simply: it’s a basic neuron that decides “yes” or “no.”
Inputs, Weights and Bias
Think of it like a hiring manager scoring job applications:
- Inputs: Candidate attributes — experience, education, portfolio.
- Weights: How much the manager values each attribute.
- Bias: The manager’s baseline tendency (lenient or strict).
Formally, the perceptron multiplies each input by its weight, adds them up, adjusts with a bias, and then decides whether to “fire” (activate).

A Quick Primer: What’s a Dot Product?
The core operation inside a perceptron is the dot product.
- A vector is just an ordered list of numbers.
- The dot product multiplies two vectors element‑by‑element and adds the results.
Example:
xw=[0.50.80.2]=1.00.50.2Dot product:
w⊤x=(0.5⋅1.0)+(0.8⋅0.5)+(0.2⋅0.2)=0.94This operation projects inputs onto the decision boundary — compressing many features into one meaningful score.
GPUs were designed for graphics, which involve massive amounts of matrix multiplications on pixel grids. Neural networks use the same math — multiplying inputs by weights at scale — making GPUs perfect for AI training and inference.
Summation: Combining It All
The perceptron’s weighted sum:
z=w⊤x+bWhere:
- w = weights
- x = inputs
- b = bias
This single score becomes the perceptron’s basis for decision‑making.
The Step Activation Function: Turning Scores Into Decisions
So far, we’ve produced a single value z. But how do we turn that into a decision?
That’s where the activation function comes in.
What is an Activation Function?
An activation function takes the score z and converts it into an output. In perceptrons, this is a binary decision.
The Step Function
The simplest activation function is the step function:
f(z)={10if z≥θif z<θIf z crosses the threshold θ, the perceptron outputs 1. Otherwise, it outputs 0.

This is how the perceptron draws a line (or plane) to divide data into two classes.
How Does a Perceptron Learn?
So far, we’ve assumed the perceptron’s weights are set. But how are they learned?
The Learning Rule
Rosenblatt’s perceptron learning rule:
wiΔwi←wi+Δwi=η(y−y^)xiWhere:
- η = learning rate (step size)
- y = true label
- y^ = predicted label
- xi = input value
If the perceptron is wrong, it nudges the weights toward the correct classification. Over many iterations, this shifts the decision boundary into the right position.
This rule is a precursor to backpropagation, the algorithm that powers deep learning. Backpropagation applies the same idea — adjusting weights based on errors — across multiple layers.
From Training to Inference
Once trained, the perceptron freezes its weights. It no longer learns; it applies what it has learned to new inputs. This phase is called inference.
Looking Ahead: From Single Neurons to Deep Learning
In this article, we’ve unpacked the perceptron — a deceptively simple model that introduced key principles for all neural networks.
You’ve learned:
- What linear separability means (and why it matters).
- How perceptrons use weights, bias, and a step function to make decisions.
- Why they fail on non‑linear problems like XOR.
- How they learn using a simple rule that inspired modern backpropagation.
This is the foundation for everything from multi‑layer perceptrons to the transformers powering today’s AI.
Next in this series:
We’ll explore activation functions beyond the step function — and how they made neural networks more flexible and powerful.
Later:
We’ll dive into multi‑layer perceptrons and backpropagation — the innovations that brought neural networks back to life.
Part 1 of 2 in Foundations of AI