Definition

Gradient descent is the fundamental optimization algorithm used to train machine learning models. It works by computing the gradient (direction of steepest increase) of the loss function with respect to model parameters, then taking a step in the opposite direction to reduce the loss. Through many iterations, this process finds parameter values that minimize prediction errors.

Why it matters

Gradient descent enables neural network training:

Universal optimizer — works for any differentiable loss function
Scalable — handles billions of parameters efficiently
Foundation — basis for all modern deep learning
Variants — Adam, SGD, AdaGrad improve basic algorithm
Convergence — mathematically guaranteed under certain conditions

Without gradient descent, training large language models would be impossible.

How it works

┌────────────────────────────────────────────────────────────┐
│                    GRADIENT DESCENT                        │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Core update rule: θ_new = θ_old - α × ∇L(θ)              │
│                                                            │
│  θ = parameters                                            │
│  α = learning rate                                         │
│  ∇L = gradient of loss                                    │
│                                                            │
│  ┌────────────────────────────────────────────────┐        │
│  │  VISUALIZATION (2D parameter space):          │        │
│  │                                                │        │
│  │       Loss                                     │        │
│  │        │    ○ Start                           │        │
│  │        │     \                                │        │
│  │        │      \  ← gradient points uphill    │        │
│  │        │       ○                              │        │
│  │        │        \                             │        │
│  │        │         ○                            │        │
│  │        │          \                           │        │
│  │        │           ★ Minimum (goal)          │        │
│  │        └────────────────────────► Parameter  │        │
│  │                                                │        │
│  │  Each step moves opposite to gradient         │        │
│  └────────────────────────────────────────────────┘        │
│                                                            │
│  VARIANTS:                                                 │
│  ─────────                                                 │
│                                                            │
│  ┌─────────────┬────────────────────────────────┐          │
│  │ Batch GD    │ Use all data per update        │          │
│  │             │ Slow but stable                │          │
│  ├─────────────┼────────────────────────────────┤          │
│  │ SGD         │ Use one sample per update      │          │
│  │             │ Fast but noisy                 │          │
│  ├─────────────┼────────────────────────────────┤          │
│  │ Mini-batch  │ Use batch of samples           │          │
│  │             │ Best of both (most common)     │          │
│  └─────────────┴────────────────────────────────┘          │
│                                                            │
│  LEARNING RATE EFFECT:                                     │
│  ─────────────────────                                     │
│                                                            │
│  Too small: ....○....○....○....○.... (slow)              │
│  Just right: ○---○---○---★ (converges)                    │
│  Too large:  ○       ○       ○ (diverges/oscillates)      │
│                   \/    \/                                 │
│                                                            │
└────────────────────────────────────────────────────────────┘

Modern optimizers:

Optimizer	Key Feature	When to use
SGD	Simple, momentum optional	Well-tuned training
Adam	Adaptive learning rates	Default choice
AdamW	Adam + weight decay	Transformers, LLMs
Adagrad	Per-parameter rates	Sparse data
RMSprop	Exponential moving average	RNNs

Common questions

Q: What’s a good learning rate?

A: It depends on the model and optimizer. For Adam, 1e-4 to 3e-4 is often good for fine-tuning LLMs. Too high causes divergence; too low causes slow training. Learning rate schedulers that decrease over time often help.

Q: What’s the difference between batch, mini-batch, and stochastic gradient descent?

A: Batch GD uses all training data per update (accurate but slow). Stochastic GD uses one sample (fast but noisy). Mini-batch uses a subset (typically 16-128 samples)—the practical default combining speed and stability.

Q: Why might training loss not decrease?

A: Common causes: learning rate too high (diverging), learning rate too low (stuck), bad initialization, vanishing/exploding gradients, or data issues. Try reducing learning rate, gradient clipping, or different initialization.

Q: How does gradient descent handle local minima?

A: In high-dimensional spaces (millions of parameters), local minima are rarely problematic—most critical points are saddle points. Momentum and noise from mini-batching help escape suboptimal regions.

Loss Function — what gradient descent minimizes
Backpropagation — computes gradients efficiently
Fine-tuning — applies gradient descent to adapt models
Neural Network — trained via gradient descent

References

Ruder (2016), “An overview of gradient descent optimization algorithms”, arXiv. [5,000+ citations]

Kingma & Ba (2015), “Adam: A Method for Stochastic Optimization”, ICLR. [100,000+ citations]

Goodfellow et al. (2016), “Deep Learning”, MIT Press. Chapter 8. [20,000+ citations]

Loshchilov & Hutter (2019), “Decoupled Weight Decay Regularization”, ICLR. [5,000+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References