Definition

A loss function (or cost function) is a mathematical measure of the difference between a model’s predictions and the actual target values. During training, the model’s parameters are adjusted to minimize this loss, effectively teaching the model to make better predictions. For language models, cross-entropy loss is most common—measuring how well the predicted probability distribution matches the true next token.

Why it matters

Loss functions are central to machine learning:

Training signal — guides parameter updates during optimization
Model comparison — compare different architectures or hyperparameters
Progress tracking — monitor if training is improving
Convergence detection — identify when to stop training
Quality proxy — lower loss generally indicates better performance

The choice of loss function shapes what the model learns to optimize.

How it works

┌────────────────────────────────────────────────────────────┐
│                      LOSS FUNCTION                         │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  CROSS-ENTROPY LOSS (for language models):                 │
│  ─────────────────────────────────────────                 │
│                                                            │
│  True label: "cat" (one-hot: [0, 1, 0, 0])                │
│  Predicted:          [0.1, 0.7, 0.15, 0.05]               │
│                                                            │
│  Loss = -Σ true_i × log(pred_i)                           │
│       = -0×log(0.1) - 1×log(0.7) - 0×log(0.15) - ...      │
│       = -log(0.7)                                          │
│       = 0.36                                               │
│                                                            │
│  ┌────────────────────────────────────────────────┐        │
│  │  LOSS LANDSCAPE VISUALIZATION:                │        │
│  │                                                │        │
│  │     Loss                                       │        │
│  │       │     *                                  │        │
│  │       │    * *        *                        │        │
│  │       │   *   *      * *                       │        │
│  │       │  *     *    *   *                      │        │
│  │       │ *       *  *     *                     │        │
│  │       │*         **       *                    │        │
│  │       │           ▲        **                  │        │
│  │       └───────────┼──────────────► Params     │        │
│  │                   │                            │        │
│  │                   Local minimum (goal)         │        │
│  └────────────────────────────────────────────────┘        │
│                                                            │
│  COMMON LOSS FUNCTIONS:                                    │
│  ─────────────────────                                     │
│                                                            │
│  Cross-Entropy    Classification, LLMs                     │
│  ─────────────────────────────────────                     │
│  L = -Σ y_i × log(ŷ_i)                                    │
│                                                            │
│  Mean Squared Error (MSE)    Regression                    │
│  ─────────────────────────────────────                     │
│  L = 1/n × Σ(y - ŷ)²                                      │
│                                                            │
│  Binary Cross-Entropy    Binary classification             │
│  ─────────────────────────────────────                     │
│  L = -[y×log(ŷ) + (1-y)×log(1-ŷ)]                        │
│                                                            │
└────────────────────────────────────────────────────────────┘

Loss functions by task:

Task	Loss Function	Notes
Language modeling	Cross-entropy	Predicts next token distribution
Classification	Cross-entropy	Multi-class predictions
Regression	MSE / MAE	Continuous outputs
Contrastive learning	InfoNCE	Embedding similarity
Reinforcement learning	Policy gradient	Reward optimization

Common questions

Q: Why does loss go down but model quality doesn’t improve?

A: This often indicates overfitting—the model memorizes training data instead of learning generalizable patterns. Monitor validation loss alongside training loss; if training loss drops but validation loss rises, you’re overfitting.

Q: What’s a good loss value?

A: It depends entirely on the task and dataset. Focus on whether loss decreases during training and how it correlates with evaluation metrics. For language models, loss around 2-3 nats often indicates good learning.

Q: What’s the difference between loss and accuracy?

A: Loss is a continuous differentiable function used for optimization; accuracy is a discrete metric for evaluation. A model can have improving loss but stagnant accuracy—training uses loss gradients to adjust weights.

Q: Why use cross-entropy instead of accuracy for training?

A: Cross-entropy provides smooth gradients for optimization. Accuracy is non-differentiable (0 or 1 per sample) so can’t guide gradient descent. Cross-entropy penalizes confident wrong predictions more heavily.

Gradient Descent — optimization using loss
Backpropagation — computes loss gradients
Perplexity — exp(loss) for language models
Fine-tuning — minimizes loss on new data

References

Goodfellow et al. (2016), “Deep Learning”, MIT Press. Chapter 6. [20,000+ citations]

Murphy (2012), “Machine Learning: A Probabilistic Perspective”, MIT Press. [8,000+ citations]

Bishop (2006), “Pattern Recognition and Machine Learning”, Springer. [50,000+ citations]

Brown et al. (2020), “Language Models are Few-Shot Learners”, NeurIPS. [15,000+ citations]

Definition

Why it matters

How it works

Common questions

Related terms

References