Introduction to Gradient Descent: The Foundation of Machine Learning
What is Gradient Descent?
Gradient descent is the workhorse optimization algorithm behind most machine learning models. It’s an iterative method for finding the minimum of a function by following the direction of steepest descent.
The Mathematical Foundation
The core update rule for gradient descent is elegantly simple:
\[\theta = \theta - \alpha \nabla J(\theta)\]Where:
- $\theta$ represents the model parameters
- $\alpha$ is the learning rate (step size)
- $\nabla J(\theta)$ is the gradient of the cost function with respect to $\theta$
For a simple linear regression, the cost function is:
\[J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2\]Python Implementation
Here’s a minimal NumPy implementation of gradient descent for linear regression:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
"""
Perform gradient descent to learn theta parameters.
Args:
X: Feature matrix (m x n)
y: Target vector (m x 1)
learning_rate: Step size for parameter updates
iterations: Number of optimization steps
Returns:
theta: Optimized parameters
cost_history: Loss at each iteration
"""
m, n = X.shape
theta = np.zeros((n, 1))
cost_history = []
for i in range(iterations):
# Compute predictions
predictions = X.dot(theta)
# Compute error
error = predictions - y
# Compute gradient
gradient = (1/m) * X.T.dot(error)
# Update parameters
theta = theta - learning_rate * gradient
# Track cost
cost = (1/(2*m)) * np.sum(error**2)
cost_history.append(cost)
return theta, cost_history
# Example usage
X = np.random.randn(100, 3) # 100 samples, 3 features
y = np.random.randn(100, 1) # 100 target values
theta_optimal, costs = gradient_descent(X, y, learning_rate=0.1, iterations=500)
print(f"Optimal parameters: {theta_optimal.flatten()}")
Training Loop Visualization
The following diagram illustrates the typical training loop in machine learning:
graph LR
A[Load Data] --> B[Initialize Parameters θ]
B --> C[Forward Pass: Compute Predictions]
C --> D[Compute Loss J θ]
D --> E[Backward Pass: Compute Gradients ∇J]
E --> F[Update Parameters: θ = θ - α∇J]
F --> G{Converged?}
G -->|No| C
G -->|Yes| H[Return Trained Model]
style D fill:#ff6b6b
style E fill:#4ecdc4
style F fill:#95e1d3
Common Pitfalls
Warning: Learning Rate Selection
Choosing the wrong learning rate is one of the most common mistakes in gradient descent:
- Too large: The algorithm may overshoot the minimum and diverge (loss explodes)
- Too small: Convergence becomes painfully slow, wasting computational resources
Start with $\alpha = 0.01$ and adjust based on your loss curve behavior.
Tip: Monitoring Convergence
Always plot your loss curve! A healthy training process shows:
- Monotonically decreasing loss
- Smooth convergence (not oscillating)
- Plateau indicating convergence
Variants and Extensions
Modern deep learning rarely uses vanilla gradient descent. Common variants include:
- Stochastic Gradient Descent (SGD): Updates parameters using one sample at a time
- Mini-batch GD: Balances computational efficiency and gradient accuracy
- Adam: Adaptive learning rates with momentum (the current industry standard)
Conclusion
Gradient descent is the foundation upon which modern machine learning is built. Understanding its mechanics, limitations, and variants is essential for any ML practitioner. In future posts, we’ll explore advanced optimization techniques and their applications in deep learning.
Next in this series: Momentum and Adaptive Learning Rates
