Gradient Descent Explained Visually: From Concept to Code

Introduction

"If you dropped a ball on a valley-shaped hill, where would it settle? That's gradient descent in action."

Explain in simple terms: - Gradient Descent is an algorithm used to find the minimum of a function. - It is used heavily in machine learning to minimize error/loss in prediction.

The Intuition Behind Gradient Descent

🚶‍♂️ Imagine Walking Downhill

You're blindfolded.
You can only feel the slope at your feet.
You take small steps in the direction of steepest descent.

This is what the algorithm does:

"Take a small step opposite to the gradient."

Where Does Gradient Descent Fit in a Machine Learning Pipeline?

Gradient Descent is the engine that powers the learning process in most supervised ML models. Here's a breakdown:

Typical ML Workflow:

Data Preparation - Input features (X) and target (y) are prepared.
Model Initialisation - Weights and biases (parameters) are set randomly.
Forward Propagation - Predictions are made using current parameters.
Loss Calculation - Compute how far predictions are from actual targets.
Backward Propagation - Use gradient descent to compute gradients of the loss.
Parameter Update - Adjust weights using gradient descent.
Repeat - Iterate over multiple epochs until convergence.

Gradient Descent is responsible for step 6, which updates the model to reduce error.

Why Use Gradient Descent?

🔍 Why Not Just Try Every Possible Value?

In theory, we could: - Evaluate the loss function for all combinations of parameters. - Pick the one with the lowest error.

🚫 Why this doesn't work:

A modern deep learning model has millions (or billions) of parameters.
Even checking a tiny slice of possibilities is computationally infeasible.
The number of combinations grows exponentially with more dimensions.

For example, if we try 1 million values for 100 parameters:

10⁶^{100} = 10^{600}

This is more than the number of atoms in the universe.

Even the world's fastest supercomputer can't handle this.

Why Gradient Descent Wins

Gradient Descent gives us a smart shortcut: - It doesn't search randomly. - It uses the gradient (slope) to find the best direction to reduce error. - It iteratively refines parameters with minimal computation.

You go from being lost in a jungle with no map… to having a compass that always points downhill.

Understanding Convex Functions

What is a Convex Function?

A convex function has one global minimum.
Think of a bowl-shaped curve.
Important because gradient descent is guaranteed to converge here.

Visual:

Plot a simple convex function:

import numpy as np  
import matplotlib.pyplot as plt  

x = np.linspace(-10, 10, 100)  
y = x**2  

plt.plot(x, y)  
plt.title("Convex Function: y = x²")  
plt.xlabel("x")  
plt.ylabel("y")  
plt.grid(True)  
plt.show()

Plotting Descent Steps

import numpy as np  
import matplotlib.pyplot as plt  

# Function and Gradient  
def f(x): return x**2  
def grad(x): return 2*x  

# Gradient Descent  
x_vals = [8]  # Start from x=8  
alpha = 0.1  

for _ in range(20):  
    x_new = x_vals[-1] - alpha * grad(x_vals[-1])  
    x_vals.append(x_new)  

# Plot descent steps  
x = np.linspace(-10, 10, 100)  
y = f(x)  

plt.plot(x, y, label="y = x²")  
plt.scatter(x_vals, [f(i) for i in x_vals], color='red')  
plt.plot(x_vals, [f(i) for i in x_vals], 'r--', label="Descent Path")  
plt.title("Gradient Descent on Convex Function")  
plt.xlabel("x")  
plt.ylabel("f(x)")  
plt.legend()  
plt.grid(True)  
plt.show()

Key Observations

Each red dot is a step downhill.
The smaller the learning rate, the slower the descent.
Too big a learning rate? It may overshoot or diverge!

How Do You Know the Result Is Acceptable?

Loss is minimised (reaches a plateau)
Model performance is good on unseen/validation data
Gradients are close to zero (no further updates required)
Predictions are stable across epochs

If these aren't true: - You might need a lower learning rate - Consider switching to advanced optimisers (Adam, RMSprop) - Add regularization to avoid overfitting

Fine-tuning Gradient Descent

Key Hyperparameters:

Learning Rate (α) - How big each step is
Batch Size - Number of samples to compute gradient per step
Number of Epochs - Total passes over the dataset

Practical Tips:

Use learning rate decay to gradually reduce
Track both training and validation losses

Variants of Gradient Descent

Batch GD: Uses all data to compute gradients (stable, slow)
Stochastic GD: One sample at a time (noisy, fast)
Mini-batch GD: Subsets of data (balanced)
Adam: Adaptive Momentum method (recommended for deep nets)

Final Thoughts

Gradient descent is fundamental to ML and deep learning.
It automates learning by reducing prediction error.
Visualising it helps grasp both intuitive understanding and mathematical rigour.

Whether you're training a linear model or a deep neural net, understanding how gradient descent adjusts parameters brings transparency and control to your modelling journey.