Written 05 August 2025 ~ 15 min read
Activation Functions

Benjamin Clark
Part 2 of 2 in Foundations of AI
Introduction

An activation function defines how a neuron transforms its inputs before passing them forward. By introducing non‑linearity, these functions give neural networks the expressive power to model complex relationships. This article examines their role, evolution from step functions to ReLU and beyond, and the properties that make some more effective than others.
The Goal
By the end of this article, you will understand:
- What activation functions do and why they’re essential.
- Why the step function is too limited for modern networks.
- How common activations (sigmoid, tanh, ReLU, Mish) work and their trade‑offs.
- How non‑linearity enables networks to build complex, multi‑layer representations.
Activation Functions Explained
At its core, an activation function decides how much of a neuron’s input “makes it through” to the next layer.
The perceptron’s step function acts like a binary switch — fully on or fully off — whereas modern activation functions allow graded responses.
This ability to represent varying levels of activation is what makes neural networks so powerful. Instead of restricting neurons to binary outputs, activation functions enable networks to represent more nuanced relationships.
Why Neural Networks Need Non‑Linear Activation
Without non‑linear activations, stacking layers adds no new representational power: the network collapses to a single linear transformation. Non‑linear functions break this limitation, letting each layer reshape inputs in new ways and enabling flexible decision boundaries.
For example, stacking two linear layers:
y=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)This simplifies to just one linear transformation — offering no additional representational power compared to a single layer.
Non‑linear activations break this limitation. They allow each layer to reshape the data space in new ways, enabling the network to learn curved, flexible decision boundaries instead of being limited to straight lines or flat planes.

Non‑linear activations are what enable networks to recognize patterns such as handwritten digits or spoken words.
Linear models can’t capture the nuanced, curved boundaries needed for such tasks; non‑linear activations give networks the flexibility to extract and combine features at multiple levels.
The Step Function
In the previous article, we introduced the step function — Rosenblatt’s original choice for the perceptron. It outputs either 0 or 1 based on whether the combined input z crosses a threshold:
f(z)={10if z≥θif z<θ
These weaknesses — lack of gradient, hard switching, and only linear boundaries — make it unsuitable for training deep networks. Researchers quickly replaced it with smoother, differentiable functions that could support multi‑layer learning.
Common Activation Functions
Modern activation functions address the limitations of Rosenblatt’s step function. They provide gradients for learning, handle a wider range of inputs, and give networks the expressive power needed for complex tasks.
Each one reshapes inputs differently, offering distinct trade-offs in training speed, stability, and representational power.
Sigmoid
Definition:
The sigmoid function maps any real-valued input into a smooth curve between 0 and 1:

Intuition:
Acts like a soft switch: small inputs approach 0, large inputs saturate near 1, and the transition between them is smooth.
Pros:
- Differentiable everywhere (supports backpropagation).
- Outputs between 0 and 1, making it interpretable as probabilities.
Cons:
- Vanishing gradients: Extreme inputs flatten the curve, slowing or halting learning.
- Not zero-centered, which can make optimization less efficient.
In machine learning, a gradient measures how much a function changes when its inputs change — think of it as the slope of a curve.
During training, gradients tell us how to adjust each weight to reduce error.
Activation functions directly influence these gradients, making them crucial for learning.
In deep networks, some activation functions (like sigmoid and tanh) produce gradients that get very close to zero for extreme input values.
As these small gradients are propagated back through many layers, they shrink further, effectively stopping weight updates.
This is the vanishing gradient problem — one of the key reasons sigmoid fell out of favor for deep networks.
A zero-centered activation outputs values roughly balanced between negative and positive.
This helps during optimization: when activations are centered around zero, gradients can flow in both directions, leading to faster and more stable convergence.
Sigmoid is not zero-centered — all outputs are positive — which can bias updates and slow training.
Use cases:
- Still common in output layers for binary classification tasks.
- Rarely used in hidden layers of deep networks (mostly replaced by ReLU).


Sigmoid became popular in the 1980s alongside backpropagation, notably in the work of Rumelhart, Hinton, and Williams Learning Representations by Back‑propagating Errors (1986).
Tanh
Definition:
The hyperbolic tangent (tanh) function is similar to sigmoid but outputs between -1 and 1, making it zero-centered:

Intuition:
Like sigmoid, but centered at zero, allowing for both negative and positive outputs.
Pros:
- Zero-centered: Keeps activations balanced, improving gradient flow.
- Smooth and differentiable.
Cons:
- Suffers from the same vanishing gradient issue at extreme values.
Use cases:
- Common in hidden layers for shallow networks.
- Still used in recurrent neural networks (RNNs).


Tanh became a preferred alternative to sigmoid in the late 1980s, popularized by Yann LeCun and colleagues in Efficient BackProp (1998).
ReLU
Definition:
The Rectified Linear Unit (ReLU) outputs zero for negative inputs and passes positive inputs through unchanged:

Intuition:
Like a one-way gate: negative inputs are blocked, positive inputs flow through.
Pros:
- Computationally cheap and simple to implement.
- No saturation for positive inputs → avoids vanishing gradients.
- Works well in deep networks, enabling faster training.
Cons:
- Dying ReLU problem: Neurons can become permanently inactive.
- Not zero-centered.
A ReLU neuron outputs 0 for all negative inputs. If its weights are updated such that it only ever receives negative inputs, it will always output 0 — effectively “dying.”
Once dead, these neurons may never reactivate, reducing the model’s capacity to learn.
Use cases:
- Default activation for hidden layers in CNNs, MLPs, and transformers.


ReLU gained prominence after Nair and Hinton’s Rectified Linear Units Improve Restricted Boltzmann Machines (2010) and was popularized by Krizhevsky et al. in AlexNet (2012).
Leaky ReLU
Definition:
Leaky ReLU modifies ReLU by adding a small negative slope for z<0:

Intuition:
Allows small negative outputs instead of zero, helping prevent neurons from dying.
Pros:
- Fixes the “dying ReLU” problem.
- Still fast and efficient.
Cons:
- Introduces a hyperparameter α.
- Not zero-centered.
A hyperparameter is a value you set before training (like α in Leaky ReLU or ELU) that controls how the model behaves.
These values are not learned from data — they’re chosen by the practitioner and can significantly influence performance.
Use cases:
- Preferred when training deep networks with dead neuron issues.


Leaky ReLU was introduced by Maas et al. in Rectifier Nonlinearities Improve Neural Network Acoustic Models (2013).
ELU
Definition:
The Exponential Linear Unit (ELU) smoothly blends negative inputs toward a small negative value:

Intuition:
By curving the negative side, ELU avoids dead neurons and makes activations zero-centered, improving learning dynamics.
Pros:
- Zero-centered output.
- Smooth gradient for negative inputs.
Cons:
- Slightly more expensive than ReLU.
- Still requires tuning α.
Use cases:
- Deep networks where stable convergence is critical.


ELU was proposed by Clevert et al. in Fast and Accurate Deep Network Learning by Exponential Linear Units (2015).
Mish
Definition:
The Mish activation is a smooth, non-monotonic function:

Intuition:
Mish behaves like a smooth ReLU variant with better gradient flow and zero-centered output. It has been shown to improve performance in some deep models.
Pros:
- Smooth, zero-centered, and non-monotonic.
- Empirically effective in some vision/NLP tasks.
Cons:
- More computationally expensive.
- Still relatively new and less widely supported.
Use cases:
- Experimental models that benefit from smooth gradients.


Mish was introduced by Diganta Misra in Mish: A Self Regularized Non-Monotonic Neural Activation Function (2019).
Leaky ReLU and ELU include a tunable parameter α. Common defaults: α=0.01 for Leaky ReLU and α=1.0 for ELU.
In Parametric ReLU (PReLU), α becomes learnable, allowing the model to adapt it during training.
Comparing Activation Functions
Here’s a quick reference comparing the activation functions we’ve covered — their output ranges, whether they’re zero‑centered, and why these details matter in practice.
Function | Output Range | Zero‑Centered? | Why It Matters |
---|---|---|---|
Step | 0 or 1 | No | Historically important but too limited for modern learning. |
Sigmoid | 0 to 1 | No | Smooth, interpretable as probabilities, but saturates (vanishing gradients). |
Tanh | ‑1 to 1 | Yes | Centered at zero for balanced outputs, but still suffers from vanishing gradients. |
ReLU | 0 to ∞ | No | Simple, fast, and dominant in deep networks, though neurons can “die.” |
Leaky ReLU | ~‑∞ to ∞ | No | Fixes the “dying ReLU” problem by allowing small negative outputs. |
ELU | ~‑1 to ∞ | Yes | Improves convergence by smoothing negative values, but slower to compute. |
Mish | ~‑∞ to ∞ | Yes | Smooth and promising for stability, though slower and less tested than ReLU. |

Some activation functions don’t just decide whether a neuron “fires.” They produce probability distributions across multiple classes.
This is where Softmax comes in — a function that allows networks to output class probabilities (e.g., “90% cat, 10% dog”) instead of binary decisions.
We’ll explore Softmax in a future article, showing how it underpins multi‑class classification and modern architectures like transformers.
A probability distribution assigns a likelihood to each possible outcome.
In classification tasks, Softmax turns a network’s outputs into values between 0 and 1 that sum to 1 — representing how confident the model is in each class.
Each of these activations represents a step forward from the binary switch of Rosenblatt’s step function — adding the flexibility, stability, and expressiveness needed for modern deep learning systems.
From Single‑Layer to Multi‑Layer Networks
In the Why Neural Networks Need Non‑Linear Activation section, we saw visually that stacking purely linear layers collapses into a single straight line or flat plane.
Now let’s make that intuition concrete with some math — and see why non‑linear activations are what make multi‑layer networks powerful.
From One Layer to Multiple Layers
Recall from the previous article that a single perceptron computes:
a=f(w⊤x+b)This works well for classifying simple, linearly separable data.
But what if we want to model more complex relationships?
The natural idea is to stack perceptrons into layers — so the output of one layer becomes the input to the next.
If we stack two perceptrons without an activation function (i.e., using only linear transformations):
a(1)a(2)=W(1)x+b(1)=W(2)a(1)+b(2)=W(2)(W(1)x+b(1))+b(2)=(W(2)W(1))x+(W(2)b(1)+b(2))This simplifies to a single linear transformation.
In other words: stacking linear layers gives you… one linear layer.
No matter how many you add, you can’t model anything more complex than a straight line or flat plane.
How Non‑Linear Activation Unlocks Complexity
Now insert a non‑linear activation function f between the layers:
a(1)=f(W(1)x+b(1)) a(2)=f(W(2)a(1)+b(2))This breaks the collapse.
Each layer can now reshape the data in a new, non‑linear way, making it possible for the network to model curved, complex decision boundaries and capture intricate patterns.
A multi‑layer perceptron (MLP) is simply a network of stacked perceptrons, where each layer passes its outputs through a non‑linear activation function before feeding the next layer.
MLPs are the backbone of many deep learning models, from image classifiers to transformers, and will be explored in more depth in a future article.
Imagine building a network to recognize handwritten digits (like in the MNIST dataset).
- First layer: Detects simple features like edges or curves in the pixels.
- Second layer: Combines those edges into shapes — loops, lines, and corners.
- Third layer: Combines those shapes into complete digit representations (like “3” vs “8”).
Without non‑linear activations, each layer would just perform another linear transformation, and these rich, layered features would never emerge.
Non‑linearity is what lets each layer extract increasingly complex features from the data.
In upcoming posts, we’ll explore backpropagation — the algorithm used to train multi‑layer networks.
Backpropagation relies on the gradients of activation functions to update weights across all layers. Without non‑linear activations, this process would fail to create useful deep representations.
From Activation Functions to Multi‑Layer Networks
We’ve now seen how activation functions transform simple perceptrons into powerful building blocks for deep learning.
They introduce non‑linearity, enabling networks to model complex decision boundaries, and their design choices directly affect training stability, learning speed, and model performance.
These functions form the backbone of multi‑layer perceptrons (MLPs) — networks that stack perceptrons into layers to build increasingly rich representations of data.
Without non‑linear activations, a deep network collapses to a single linear layer, no matter how many layers you add.
Activations are what make deep networks truly deep — giving each layer the ability to reshape and reinterpret the data.
In the next article, we’ll dive into multi‑layer perceptrons and explore backpropagation — the algorithm that lets these networks learn by adjusting weights across all layers.
Part 2 of 2 in Foundations of AI