Gradient Descent Explained Visually: From Concept to Code

Introduction

"If you dropped a ball on a valley-shaped hill, where would it settle? That's gradient descent in action."

Explain in simple terms: - Gradient Descent is an algorithm used to find the minimum of a function. - It is used heavily in machine learning to minimize error/loss in prediction.

The Intuition Behind Gradient Descent

🚶‍♂️ Imagine Walking Downhill

  • You're blindfolded.
  • You can only feel the slope at your feet.
  • You take small steps in the direction of steepest descent.

This is what the algorithm does:

"Take a small step opposite to the gradient."

Where Does Gradient Descent Fit in a Machine Learning Pipeline?

Gradient Descent is the engine that powers the learning process in most supervised ML models. Here's a breakdown:

Typical ML Workflow:

  1. Data Preparation - Input features (X) and target (y) are prepared.
  2. Model Initialisation - Weights and biases (parameters) are set randomly.
  3. Forward Propagation - Predictions are made using current parameters.
  4. Loss Calculation - Compute how far predictions are from actual targets.
  5. Backward Propagation - Use gradient descent to compute gradients of the loss.
  6. Parameter Update - Adjust weights using gradient descent.
  7. Repeat - Iterate over multiple epochs until convergence.

Gradient Descent is responsible for step 6, which updates the model to reduce error.

Why Use Gradient Descent?

🔍 Why Not Just Try Every Possible Value?

In theory, we could: - Evaluate the loss function for all combinations of parameters. - Pick the one with the lowest error.

🚫 Why this doesn't work:

  • A modern deep learning model has millions (or billions) of parameters.
  • Even checking a tiny slice of possibilities is computationally infeasible.
  • The number of combinations grows exponentially with more dimensions.

For example, if we try 1 million values for 100 parameters:

10⁶^{100} = 10^{600}

This is more than the number of atoms in the universe.

Even the world's fastest supercomputer can't handle this.

Why Gradient Descent Wins

Gradient Descent gives us a smart shortcut: - It doesn't search randomly. - It uses the gradient (slope) to find the best direction to reduce error. - It iteratively refines parameters with minimal computation.

You go from being lost in a jungle with no map… to having a compass that always points downhill.

Understanding Convex Functions

What is a Convex Function?

  • A convex function has one global minimum.
  • Think of a bowl-shaped curve.
  • Important because gradient descent is guaranteed to converge here.

Visual:

Plot a simple convex function:

import numpy as np  
import matplotlib.pyplot as plt  

x = np.linspace(-10, 10, 100)  
y = x**2  

plt.plot(x, y)  
plt.title("Convex Function: y = x²")  
plt.xlabel("x")  
plt.ylabel("y")  
plt.grid(True)  
plt.show()

Plotting Descent Steps

import numpy as np  
import matplotlib.pyplot as plt  

# Function and Gradient  
def f(x): return x**2  
def grad(x): return 2*x  

# Gradient Descent  
x_vals = [8]  # Start from x=8  
alpha = 0.1  

for _ in range(20):  
    x_new = x_vals[-1] - alpha * grad(x_vals[-1])  
    x_vals.append(x_new)  

# Plot descent steps  
x = np.linspace(-10, 10, 100)  
y = f(x)  

plt.plot(x, y, label="y = x²")  
plt.scatter(x_vals, [f(i) for i in x_vals], color='red')  
plt.plot(x_vals, [f(i) for i in x_vals], 'r--', label="Descent Path")  
plt.title("Gradient Descent on Convex Function")  
plt.xlabel("x")  
plt.ylabel("f(x)")  
plt.legend()  
plt.grid(True)  
plt.show()

Key Observations

  • Each red dot is a step downhill.
  • The smaller the learning rate, the slower the descent.
  • Too big a learning rate? It may overshoot or diverge!

How Do You Know the Result Is Acceptable?

  • Loss is minimised (reaches a plateau)
  • Model performance is good on unseen/validation data
  • Gradients are close to zero (no further updates required)
  • Predictions are stable across epochs

If these aren't true: - You might need a lower learning rate - Consider switching to advanced optimisers (Adam, RMSprop) - Add regularization to avoid overfitting

Fine-tuning Gradient Descent

Key Hyperparameters:

  • Learning Rate (α) - How big each step is
  • Batch Size - Number of samples to compute gradient per step
  • Number of Epochs - Total passes over the dataset

Practical Tips:

  • Use learning rate decay to gradually reduce
  • Track both training and validation losses

Variants of Gradient Descent

  • Batch GD: Uses all data to compute gradients (stable, slow)
  • Stochastic GD: One sample at a time (noisy, fast)
  • Mini-batch GD: Subsets of data (balanced)
  • Adam: Adaptive Momentum method (recommended for deep nets)

Final Thoughts

  • Gradient descent is fundamental to ML and deep learning.
  • It automates learning by reducing prediction error.
  • Visualising it helps grasp both intuitive understanding and mathematical rigour.

Whether you're training a linear model or a deep neural net, understanding how gradient descent adjusts parameters brings transparency and control to your modelling journey.

Also Read on Medium