28 July 2025 ~ 14 min read
Perceptrons

Benjamin Clark
Part 1 of 1 in Foundations of AI
Introduction

This is the first in a series exploring the foundations of modern AI.
This post begins a multi-part series covering the foundational building blocks of artificial intelligence — from activation functions and multi-layer perceptrons to backpropagation, transformers, and beyond.
As I near the end of my six‑year Open University degree, I have been reflecting on what I have learned, how I’ve applied it to my work, and how I can share that knowledge with others.
Two years ago, I enrolled in TM358: Machine Learning and Artificial Intelligence, where I studied the mathematical foundations of AI and applied them to a range of practical problems. This experience aligned perfectly with my transition into professional AI/ML work — and I haven’t looked back since.
My final‑year project has been centred on creating my own chatbot, a journey that led me deep into the mechanics of transformers, natural language processing, and more.
But at its core, everything I’ve studied — from chatbots to large‑scale transformers — ultimately traces back to one foundational idea: Frank Rosenblatt’s 1958 Perceptron, a simple but revolutionary model for how information can be stored and organized in the brain.
In preparing for an upcoming community presentation on quantization in AI, I decided to start this series by revisiting Rosenblatt’s work — the starting point for much of what we now call modern artificial intelligence and machine learning.
The original paper is well worth a read.
The Goal
By the end of this post, you will understand:
- What it means for data to be linearly separable and why that matters for perceptrons (with clear 2D and 3D examples).
- How a perceptron takes inputs, applies weights and biases, and uses summation to make decisions.
- How a simple step activation function enables binary classification.
- Why perceptrons struggle with non‑linearly separable problems — and what that means for their limitations.
What Does It Mean for Data to Be Linearly Separable?
Before diving into how a perceptron works, it’s important to understand the type of problems it was designed to solve. The key concept here is linear separability.
In simple terms, data is linearly separable if you can draw a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that cleanly separates two classes of points. Every point on one side belongs to one class, and every point on the other side belongs to another class.
In data science, a class is simply a label — the “bucket” each data point belongs to.
For example:
- Email filtering: Spam vs Not spam
- Medical diagnosis: Has disease vs No disease
- Image recognition: Dog vs Cat
In this post, we’ll focus on binary classification: every point falls into either Class A or Class B, and the perceptron’s job is to find the simplest possible way to separate them.
Visualising Decision Boundaries
Imagine you’re plotting two types of points on a 2D graph:
- Red circles represent Class A.
- Blue circles represent Class B.
If these two groups can be divided by drawing a single straight line, they’re linearly separable.

In 3D, the concept is the same, except instead of a line, you’d use a plane. And in higher dimensions (which is where most machine learning happens), we generalize this idea to hyperplanes.

Why Linear Separability Matters
A single perceptron can only classify data that’s linearly separable.
If the classes are mixed together in a way that no single line (or plane) can separate them, the perceptron fails. This limitation famously shows up in problems like XOR — one of the simplest examples of non‑linear separability.
Understanding this boundary is crucial: it’s the dividing line (literally) between what the earliest neural networks could solve and why more complex, multi-layer networks became necessary.
This is exactly the kind of boundary Rosenblatt envisioned perceptrons learning — but it also shows why they hit a wall when faced with more complex problems.
Rosenblatt’s 1958 paper doesn’t use the now‑familiar terms “linearly separable” or “decision surface.”
Instead, he described the process as one in which “stimuli of one class will tend to evoke a stronger impulse in one response set than in another … [so that] the perceptual world is divided into classes of ‘things’ … determined by the organization of the system and its interaction with the environment” (Rosenblatt, 1958).
Today, we describe this idea — dividing the input space into distinct categories — as linear separability.
The formal terminology, and much of the critique of these limitations, emerged later, most notably in Marvin Minsky and Seymour Papert’s 1969 book Perceptrons.
When Data Isn’t Linearly Separable: The XOR Problem
But not all datasets are this simple.
A classic example is the XOR problem, where the classes are arranged so that no single straight line can separate them — making it impossible for a simple perceptron to classify them correctly

In 3D, the situation is similar — the classes can be positioned so that no single plane can cleanly separate them.

This simple example illustrates the core limitation of single-layer perceptrons: if no straight line (or plane) can separate the classes, they fail. Later in this series, we’ll see how multi-layer perceptrons overcome this challenge.
This exact limitation of single-layer perceptrons led researchers to develop multi-layer networks — a technology I’ll cover in a future post in this series.
Next, let’s open the hood and see how a perceptron actually works — exploring inputs, weights, and how these simple units make decisions.
Inside the Perceptron: How It Processes Information
So far, we’ve looked at what kinds of problems a perceptron can solve. Now let’s open the hood and see how it actually processes information.
At its core, a perceptron is a very simple model of a neuron. It takes several inputs, applies a bit of math to them, and then decides whether to “fire” (activate) or not.
Inputs, Weights and Biases
Think of a perceptron like a hiring manager scoring job applications.
- Inputs: These are the candidate’s attributes — years of experience, education level, portfolio quality, certifications. In data terms, these are the features of your dataset.
- Weights: Not all attributes are equally important. Weights represent how much the manager values each one. Maybe experience is weighted heavily, while a minor certification carries little influence.
- Bias: Even before looking at the applications, the manager has a baseline tendency — perhaps they’re under pressure to fill the role quickly (lowering the threshold for acceptance), or perhaps they’re being very selective (raising the bar). This “bias” shifts the decision point up or down, independent of the actual attributes.
Put together: The perceptron multiplies each attribute by its importance, adds them up, adjusts for the baseline bias, and then decides whether the candidate gets shortlisted.

A Quick Primer: What’s a Dot Product?
Before we jump into the math, let’s unpack two terms you’ll see: matrix multiplication and dot product.
- A vector is just an ordered list of numbers. For example:
could represent three features of an email (like spam score, length, and exclamation count).
- A dot product is how we combine two vectors: multiply each pair of numbers and add them together.
For example:
[0.50.80.2]1.00.50.2=(0.5⋅1.0)+(0.8⋅0.5)+(0.2⋅0.2)=0.94This is matrix multiplication in its simplest form: a row vector multiplied by a column vector.
That’s all the perceptron is doing at this stage: combining features (inputs) with their importance (weights) to produce a single number.
This dot product is the heart of a perceptron. It compresses many inputs into one meaningful value — a single score that the perceptron can use to make a decision.
Matrix multiplication isn’t just useful for perceptrons — it’s everywhere in computing, especially in graphics.
GPUs were originally designed to render images by manipulating huge grids of pixels, where each pixel has Red, Green, and Blue (RGB) values. Transforming these pixels — scaling, rotating, applying filters, or generating 3D scenes — involves massive amounts of matrix multiplication.
Because of this, GPUs are built to handle thousands of these matrix operations in parallel.
Machine learning uses the same idea: instead of pixels, we’re multiplying and adding weights and inputs in large neural networks. The very thing GPUs were designed for — rapidly performing matrix math — makes them uniquely suited for accelerating AI training and inference. This is why companies like Nvidia have thrived by providing GPUs to organizations competing to develop increasingly sophisticated AI systems.
Summation: Combining It All
Once we have our inputs, weights, and bias, the perceptron combines them into a single score z:
z=w⊤x+b=(w1⋅x1)+(w2⋅x2)+⋯+(wn⋅xn)+bThis is just the dot product of the inputs and weights, plus the bias. In plain English: multiply each input by its importance, add them all up, then adjust with the baseline bias.
Worked Example
Suppose we have three features:
- x1=0.50
- x2=0.80
- x3=0.20
And three weights:
- w1=1.00
- w2=0.50
- w3=0.20
With a bias b=0.00.
The perceptron’s weighted sum becomes:
z=(0.50⋅1.00)+(0.80⋅0.50)+(0.20⋅0.20)+(1⋅0.00)=0.94This single value z is what the perceptron uses in the next step — passing it through an activation function to make a binary decision.
This score on its own doesn’t mean much — we still need to decide: does the perceptron “fire” (output 1) or not (output 0)?
That’s where activation functions come in — which we’ll cover next.
The Step Activation Function: Turning Math Into Decisions
So far, our perceptron has taken inputs, multiplied them by their weights, added a bias, and produced a single score z. But that’s not the final output.
To actually make a decision, we need one more step: an activation function.
What is an Activation Function?
An activation function takes the combined score z and turns it into an output — often a simple decision like “yes” or “no.”
In neural networks, activation functions can be complex (like ReLU or sigmoid), but for perceptrons, we use the simplest one of all: the step function.
The Step Function
The step function does exactly what its name implies — it “steps” the output from one value to another once z crosses a threshold.
f(z)={10if z≥θif z<θWhere θ is the threshold.
- If z is above or equal to the threshold → output 1 (e.g., yes, Class A).
- If z is below the threshold → output 0 (e.g., no, Class B).
In other words: if the evidence is strong enough, classify as 1. Otherwise, classify as 0.

How This Enables Classification
With the step function, the perceptron turns its weighted sum into a binary decision.
- Spam or not spam
- Dog or cat
- Approve or reject
This is exactly how Rosenblatt’s perceptron performed classification: by drawing a boundary (line, plane, or hyperplane) in the data and assigning everything on one side to one class, and everything on the other to another.
How Does a Perceptron Learn?
So far, we’ve treated the perceptron’s weights and bias as if they magically exist. But how does it actually learn them?
The short answer is trial and error.
During training, the perceptron is shown inputs along with their correct outputs (labels). It makes a prediction, compares it to the correct answer, and adjusts its weights slightly if it was wrong. Over time, these small adjustments shift the perceptron’s decision boundary to correctly separate the training data.
This is the simplest form of learning for a neural network — and the ancestor of the more powerful backpropagation algorithm used in deep learning today.
The Perceptron Learning Rule
The perceptron adjusts its weights using a simple rule:
wiΔwi←wi+Δwi=η(y−y^)xiWhere:
- η is the learning rate — how big a step to take each time.
- y is the true label — the correct answer.
- y^ is the predicted output — what the perceptron guessed.
- xi is the input for that weight.
In plain English:
- If the perceptron gets it right, do nothing.
- If it gets it wrong, nudge each weight in the direction that would have produced the correct output.
Over many iterations, these small adjustments shift the perceptron’s decision boundary until it correctly separates the classes in the training data.
The learning rule Rosenblatt described here is a primitive precursor to backpropagation — the algorithm that powers modern deep learning.
Backpropagation takes this same idea — adjusting weights based on errors — and extends it to multi‑layer networks, propagating those errors backward through each layer.
We will explore backpropagation in detail in a future post, when we move beyond single‑layer perceptrons to the deeper networks that underpin today’s AI systems.
From Training to Inference
Once training is complete, the perceptron stops adjusting its weights. These learned weights and bias are now fixed — this phase is called inference.
During inference, the perceptron no longer learns; it simply applies its trained weights to new, unseen inputs to make predictions.
Looking Ahead: From Single Neurons to Deep Learning
In this post, we’ve unpacked one of the simplest — yet most important — ideas in machine learning: the perceptron.
You learned:
- What linear separability means and why it defines the kinds of problems a perceptron can solve.
- How a perceptron processes inputs using weights, a bias, and summation to produce a single score.
- How the step activation function turns that score into a binary decision.
- Why single-layer perceptrons are limited — they fail on non-linearly separable problems like XOR.
- How perceptrons learn through trial and error, using a simple weight update rule — a primitive ancestor of the backpropagation algorithm used in deep learning today.
This foundation sets the stage for everything that came after Rosenblatt’s original idea — ultimately leading to the multi-layer perceptrons and deep networks that power modern AI.
In my next post, we’ll explore different activation functions and how they change what a perceptron can do — paving the way for more flexible and powerful models.
Later in the series, we’ll dive into multi‑layer perceptrons and backpropagation — the key innovations that transformed simple neurons into the deep networks powering today’s AI systems.
Thank you for reading. In the next post, we will explore activation functions and how they expand what perceptrons can do.
Part 1 of 1 in Foundations of AI