Introduction to Gradient Descent: The Foundation of Machine Learning

Posted Nov 20, 2025 Updated Dec 2, 2025

Gradient Descent

By ML Kenya

2 min read

Introduction to Gradient Descent: The Foundation of Machine Learning

What is Gradient Descent?

Gradient descent is the workhorse optimization algorithm behind most machine learning models. It’s an iterative method for finding the minimum of a function by following the direction of steepest descent.

The Mathematical Foundation

The core update rule for gradient descent is elegantly simple:

\[\theta = \theta - \alpha \nabla J(\theta)\]

Where:

$\theta$ represents the model parameters
$\alpha$ is the learning rate (step size)
$\nabla J(\theta)$ is the gradient of the cost function with respect to $\theta$

For a simple linear regression, the cost function is:

\[J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\]

Python Implementation

Here’s a minimal NumPy implementation of gradient descent for linear regression:

  
import numpy as np

def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    """
    Perform gradient descent to learn theta parameters.
    
    Args:
        X: Feature matrix (m x n)
        y: Target vector (m x 1)
        learning_rate: Step size for parameter updates
        iterations: Number of optimization steps
    
    Returns:
        theta: Optimized parameters
        cost_history: Loss at each iteration
    """
    m, n = X.shape
    theta = np.zeros((n, 1))
    cost_history = []
    
    for i in range(iterations):
        # Compute predictions
        predictions = X.dot(theta)
        
        # Compute error
        error = predictions - y
        
        # Compute gradient
        gradient = (1/m) * X.T.dot(error)
        
        # Update parameters
        theta = theta - learning_rate * gradient
        
        # Track cost
        cost = (1/(2*m)) * np.sum(error**2)
        cost_history.append(cost)
    
    return theta, cost_history

# Example usage
X = np.random.randn(100, 3)  # 100 samples, 3 features
y = np.random.randn(100, 1)  # 100 target values

theta_optimal, costs = gradient_descent(X, y, learning_rate=0.1, iterations=500)
print(f"Optimal parameters: {theta_optimal.flatten()}")

Training Loop Visualization

The following diagram illustrates the typical training loop in machine learning:

graph LR
    A[Load Data] --> B[Initialize Parameters θ]
    B --> C[Forward Pass: Compute Predictions]
    C --> D[Compute Loss J θ]
    D --> E[Backward Pass: Compute Gradients ∇J]
    E --> F[Update Parameters: θ = θ - α∇J]
    F --> G{Converged?}
    G -->|No| C
    G -->|Yes| H[Return Trained Model]
    
    style D fill:#ff6b6b
    style E fill:#4ecdc4
    style F fill:#95e1d3

Common Pitfalls

Warning: Learning Rate Selection
Choosing the wrong learning rate is one of the most common mistakes in gradient descent:
Too large: The algorithm may overshoot the minimum and diverge (loss explodes)
Too small: Convergence becomes painfully slow, wasting computational resources
Start with $\alpha = 0.01$ and adjust based on your loss curve behavior.

Tip: Monitoring Convergence
Always plot your loss curve! A healthy training process shows:
Monotonically decreasing loss
Smooth convergence (not oscillating)
Plateau indicating convergence

Variants and Extensions

Modern deep learning rarely uses vanilla gradient descent. Common variants include:

Stochastic Gradient Descent (SGD): Updates parameters using one sample at a time
Mini-batch GD: Balances computational efficiency and gradient accuracy
Adam: Adaptive learning rates with momentum (the current industry standard)

Conclusion

Gradient descent is the foundation upon which modern machine learning is built. Understanding its mechanics, limitations, and variants is essential for any ML practitioner. In future posts, we’ll explore advanced optimization techniques and their applications in deep learning.

Next in this series: Momentum and Adaptive Learning Rates

Machine Learning, Math

This post is licensed under CC BY 4.0 by the author.