Activation Functions: Sigmoid, ReLU, and Tanh Explained
Interview Preparation Hub for AI/ML Engineering Roles
1. Introduction
Activation functions are at the heart of neural networks. They introduce non-linearity into the model, enabling networks to learn complex patterns beyond simple linear relationships. Without activation functions, neural networks would collapse into linear regression models, severely limiting their power.
This guide explores three of the most widely used activation functions—Sigmoid, ReLU, and Tanh. We will cover their mathematical foundations, properties, advantages, limitations, applications, and relevance in modern deep learning architectures.
2. What is an Activation Function?
An activation function defines how the weighted sum of inputs and bias is transformed before passing to the next layer. Mathematically:
z = Σ (w_i * x_i) + b
a = f(z)
Where f is the activation function. The choice of f determines the network’s ability to capture non-linear relationships.
3. Sigmoid Function
The Sigmoid function maps inputs to the range (0, 1). It is defined as:
f(x) = 1 / (1 + e^(-x))
Properties:
- Range: (0, 1).
- S-shaped curve.
- Useful for probability interpretation.
Advantages:
- Simple and smooth.
- Interpretable as probability.
Limitations:
- Vanishing gradient problem.
- Outputs not zero-centered.
- Slow convergence in deep networks.
4. Tanh Function
The Tanh function maps inputs to the range (-1, 1). It is defined as:
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Properties:
- Range: (-1, 1).
- S-shaped curve, centered at zero.
Advantages:
- Zero-centered outputs.
- Better convergence than Sigmoid.
Limitations:
- Still suffers from vanishing gradients.
- Computationally expensive compared to ReLU.
5. ReLU Function
ReLU is the most widely used activation function in modern deep learning. It is defined as:
f(x) = max(0, x)
Properties:
- Range: [0, ∞).
- Piecewise linear.
- Sparse activation (many neurons output zero).
Advantages:
- Efficient computation.
- Mitigates vanishing gradient problem.
- Enables deep networks to train faster.
Limitations:
- “Dying ReLU” problem (neurons stuck at zero).
- Not zero-centered.
6. Comparison of Sigmoid, Tanh, and ReLU
| Function | Range | Advantages | Limitations |
|---|---|---|---|
| Sigmoid | (0, 1) | Probability interpretation | Vanishing gradients, not zero-centered |
| Tanh | (-1, 1) | Zero-centered, smoother convergence | Vanishing gradients |
| ReLU | [0, ∞) | Fast, efficient, mitigates vanishing gradients | Dying ReLU problem |
7. Applications in Deep Learning
- Sigmoid: Binary classification, probability outputs.
- Tanh: Hidden layers in shallow networks.
- ReLU: Default choice in deep networks (CNNs, RNNs, Transformers).
8. Advanced Variants of ReLU
To address ReLU’s limitations, several variants were introduced:
- Leaky ReLU: Allows small negative values.
- Parametric ReLU (PReLU): Learns slope for negative inputs.
- Exponential Linear Unit (ELU): Smooths negative values.
9. Challenges
- Choosing the right activation function for the task.
- Balancing computational efficiency with accuracy.
- Handling vanishing/exploding gradients.
10. Interview Notes
- Be ready to explain mathematical definitions of Sigmoid, Tanh, and ReLU.
- Discuss advantages and limitations of each.
- Explain vanishing gradient and dying ReLU problems.
- Describe applications and modern relevance.
Sigmoid → Tanh → ReLU → Variants → Applications → Challenges → Interview Prep