Written 05 August 2025 · Last edited 17 September 2025 ~ 16 min read

Activation Functions

Written by

Benjamin Clark

Introduction

An XKCD comic on Supernovas — “Supernova“ (2024) XKCD. Available at: https://xkcd.com/2878 (Accessed: 4th August 2025)

As covered in the previous article in this series, activation functions take the output of a neuron and turn it into a decision.

But, as covered there, the step activation function falls flat on its face when data is not linearly separable.

So a more detailed overview of the different activation functions available is required before this series moves on to multilayered neural networks. Otherwise, all the little neurons in future diagrams will just be black boxes rather than something you actually understand.

The Goal

By the end of this article, you will understand:

What activation functions do and why they’re essential to neural networks.
Why the step function is too limited for modern networks.
How common activations (sigmoid, tanh, ReLU, Mish) work and their trade‑offs.
How non‑linearity enables networks to build complex, multi‑layer representations.

What is an activation function?

Activation functions simply transform a neuron’s inputs into an output that is then either passed on through to the next layer of the network, or used to inform the network’s final decision.

The step function, covered in the previous post about perceptrons, is fine and dandy for binary output. But modern neural networks require more nuance than a simple “yes” or “no”. How about a “maybe”, or working with non-linearly separable data?

Having graded activation functions with more nuance is fundamental to producing more advanced neural networks. Rather than simple binary outputs, more advanced activation functions allow neural networks to produce probabilistic scores, discover more nuanced features, and interact with more nuanced data than the Perceptron ever could.

Why Neural Networks Need Non‑Linear Activation

Without non‑linear activations, layering up neurons doesn’t actually expand the network at all; it all collapses into a linear transformation.

Imagine you’re an Eques during the reign of Nero, and a really cushy administrative job has opened in Egypt. You climb the social ladder, do your military service, grease the palm of senators to eventually be considered for the role. However, Nero is famously corrupt and so all this hard work is for naught. When the final decision is made, who gets the job collapses into a singular linear transformation: does Nero like you (1) or not (0)?

For example, if we’re stacking two linear layers:

\begin{aligned} y &= W_2 (W_1 x + b_1) + b_2 \\ &= (W_2 W_1) x + (W_2 b_1 + b_2) \end{aligned}

This simplifies to just one linear transformation — offering no additional representational power compared to a single layer.

Non‑linear activations break this limitation. They allow each layer to reshape the data space in new ways, enabling the network to learn curved, flexible decision boundaries instead of being limited to straight lines or flat planes.

Following on from my Nero example, imagine instead you are an Eques during the Punic Wars. You serve as a Centurion under Scipio Africanus, perform valiantly at the battle of Zama, and are mentioned in dispatches. When you come back from the War you settle down, open a shop, amass some wealth. When election time in the Senate comes round, you fancy your chances and decide to begin climbing the Cursus Honorum. Entering the election for Quaestor, your previous military service, good reputation among the elite, and considerable means all combine to inspire confidence in the electorate and you get the job. The decision didn’t collapse to a linear transformation; many contributing factors compounded to arrive at the outcome.

Comparison between linear and non-linear networks — Why non‑linearity matters: stacking linear layers only gives you another straight line. Adding non‑linear activations creates flexible boundaries capable of handling complex data.

The Step Function

In the previous article, we introduced the step function — Rosenblatt’s original choice for the perceptron. It outputs either 0 or 1 based on whether the combined input $z$ crosses a threshold:

f(z) = \begin{cases} 1 & \text{if } z \geq \theta \\ 0 & \text{if } z < \theta \end{cases}

Diagram showing a step function with a threshold — The step function outputs 0 until it reaches the threshold, then jumps to 1.

The lack of any gradient to the output, sharp switch between classifications, and restriction to linear boundaries makes it wholly unsuitable for training any network which requires nuance. Researchers quickly replaced it with smoother, differentiable, functions which could support multilayered learning.

Common Activation Functions

Modern activation functions address the limitations of Rosenblatt’s step function. They provide gradients for learning, handle a wider range of inputs, and give networks the expressive power needed for complex tasks.

Each one reshapes inputs differently, offering distinct trade-offs in training speed, stability, and representational power.

Sigmoid

Definition

The sigmoid function maps any real-valued input into a smooth curve between 0 and 1:

f(z) = \frac{1}{1 + e^{-z}}

Graph of the sigmoid activation function — The sigmoid function smoothly squashes inputs between 0 and 1.

Purpose

This is a softer version of the step function. Small inputs approach 0, large inputs saturate near 1, and there is a smooth transition - rather than a large jump or “step” - between the extremes.

Pros

Differentiable everywhere. The derivative exists at all inputs, enabling gradient-based learning.
Outputs values between 0 and 1, making said output interpretable as probability.

Cons

Suffers from vanishing gradients; extremes flatten the smooth curve between 0 and 1, which has the capability to slow or even halt learning altogether.
Not zero-centred, meaning that during training optimisation outputs can “zig-zag” between extremes negatively impacting neurons further upstream.

Gradient?

In machine learning, a gradient measures how much a function changes when its inputs change. i.e. in the above image, it may be thought of as the slope of the curve between 0 and 1.

Said gradient informs how to adjust each weight to reduce errors. If we need a higher value, we nudge the weight as appropriate to follow the gradient to a higher score etc.

The overall gradient of learning across all neurons is directly affected by the activation functions used in each layer of the neural network.

More details on the vanishing gradient

In deep networks, some activation functions (like sigmoid and tanh) produce gradients that get very close to zero for extreme input values.

As these small gradients are propagated back through many layers, they shrink further, effectively stopping weight updates.

This is the vanishing gradient problem — one of the key reasons sigmoid fell out of favor for deep networks. i.e. a gradient of 0.000001 isn’t really affected much if you change a weight from 2 to 5.

And what about zero-centred?

A zero-centred activation outputs values roughly balanced between negative and positive.

This helps during optimization: when activations are centered around zero, gradients can flow in both directions, leading to faster and more stable convergence.

If an activation function is not zero-centred (like sigmoid) all outputs are positive, meaning bias is created in subsequent layers which further slows down training.

Use cases

Useful in final output layers for outputting probabilistic distributions.
Rarely used in hidden layers due to its vulnerability to vanishing gradients and the fact it’s not zero-centred.

2D decision boundary with sigmoid activation — In two dimensions, the sigmoid activation creates a smooth, non-linear decision boundary.

3D decision surface with sigmoid activation — In three dimensions, sigmoid activation allows for flexible, curved decision surfaces.

Some Historical Context

Sigmoid became popular in the 1980s alongside backpropagation, notably in the work of Rumelhart, Hinton, and Williams Learning Representations by Back‑propagating Errors (1986).