Definition

RLHF (Reinforcement Learning from Human Feedback) is a training technique that aligns language models with human preferences by using reinforcement learning. Instead of optimizing for prediction accuracy alone, RLHF trains models to generate outputs that humans rate as helpful, harmless, and honest. A reward model learns to predict human preferences, then RL optimizes the language model to maximize those predicted preferences.

Why it matters

RLHF is key to modern AI alignment:

Beyond prediction — optimizes for what humans actually want, not just accuracy
Reduces harmful outputs — models learn to avoid toxic, biased, or dangerous content
Improves helpfulness — responses become more useful and relevant
Powers ChatGPT — the technique that made conversational AI practical
Safety foundation — critical step toward aligned, trustworthy AI systems

RLHF transformed language models from text predictors into helpful assistants.

How it works

┌────────────────────────────────────────────────────────────┐
│                         RLHF                               │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  THE THREE STAGES OF RLHF:                                 │
│  ─────────────────────────                                 │
│                                                            │
│  STAGE 1: SUPERVISED FINE-TUNING (SFT)                     │
│  ──────────────────────────────────────                    │
│                                                            │
│  Base LLM + Human-written examples ──► SFT Model          │
│                                                            │
│  "How do I cook pasta?"                                    │
│  → [Human writes ideal response]                          │
│  → Model learns to generate similar quality                │
│                                                            │
│  STAGE 2: TRAIN REWARD MODEL                               │
│  ───────────────────────────                               │
│                                                            │
│  ┌─────────────────────────────────────────────────┐      │
│  │     Prompt: "What is machine learning?"         │      │
│  │                                                  │      │
│  │  Response A:        Response B:                 │      │
│  │  "Machine learning  "ML is basically           │      │
│  │   is a subset of    just computers             │      │
│  │   AI that enables   doing stuff                │      │
│  │   systems to..."    automatically lol"         │      │
│  │                                                  │      │
│  │         Human picks: A is better ✓             │      │
│  └─────────────────────────────────────────────────┘      │
│                        │                                   │
│                        ▼                                   │
│         Reward Model learns: Score(A) > Score(B)          │
│                                                            │
│  STAGE 3: REINFORCEMENT LEARNING (PPO)                     │
│  ─────────────────────────────────────                     │
│                                                            │
│  ┌─────────────────────────────────────────────────┐      │
│  │                                                  │      │
│  │  SFT Model ──► Generate Response ──► Reward Model│     │
│  │       ↑                                    │     │      │
│  │       │                                    │     │      │
│  │       └───── Update weights ◄── Score ◄───┘     │      │
│  │                                                  │      │
│  │    (Using PPO algorithm to optimize)            │      │
│  │    (KL penalty prevents too much drift)         │      │
│  └─────────────────────────────────────────────────┘      │
│                        │                                   │
│                        ▼                                   │
│               RLHF-aligned Model                           │
│      (Helpful, Harmless, Honest responses)                │
│                                                            │
│  KEY COMPONENTS:                                           │
│  ───────────────                                           │
│  Reward Model:  Predicts human preference scores          │
│  PPO:          Policy optimization algorithm              │
│  KL Penalty:   Prevents catastrophic forgetting           │
│  Preference Data: Comparison pairs with human choices     │
│                                                            │
└────────────────────────────────────────────────────────────┘

RLHF progress:

Stage	Training Signal	Result
Pretraining	Next token prediction	Raw language ability
SFT	Human demonstrations	Follows instructions
RLHF	Human preferences	Helpful, safe, aligned

Common questions

Q: Why is RLHF needed if we have fine-tuning?

A: Fine-tuning teaches models to imitate examples, but doesn’t optimize for nuanced preferences. RLHF can learn subtle distinctions like “polite but not sycophantic” or “detailed but not overwhelming” that are hard to capture in demonstration data alone. It optimizes for human judgment holistically.

Q: What is a reward model?

A: The reward model is a neural network trained to predict human preferences. Given two responses to the same prompt, it learns to assign higher scores to the response humans prefer. This turns subjective human judgment into a differentiable reward signal for RL.

Q: What is DPO and how does it relate to RLHF?

A: Direct Preference Optimization (DPO) is a simpler alternative that achieves RLHF-like results without explicitly training a reward model or using RL. It directly optimizes language model weights on preference pairs. Many recent models use DPO because it’s simpler and more stable than PPO-based RLHF.

Q: What are RLHF’s limitations?

A: Key challenges include: (1) Reward hacking—models find unintended ways to get high scores, (2) Preference quality—human raters may be inconsistent or biased, (3) Scalability—collecting preference data is expensive, (4) Misalignment—reward model may not capture true preferences.

Reinforcement Learning — the underlying paradigm
Fine-tuning — adapting pretrained models
LLM — models trained with RLHF
Instruction Tuning — often precedes RLHF

References

Ouyang et al. (2022), “Training language models to follow instructions with human feedback”, NeurIPS. [InstructGPT paper - introduced RLHF for LLMs]

Christiano et al. (2017), “Deep reinforcement learning from human preferences”, NeurIPS. [Foundational RLHF paper]

Stiennon et al. (2020), “Learning to summarize with human feedback”, NeurIPS. [Early RLHF for summarization]

Rafailov et al. (2023), “Direct Preference Optimization”, NeurIPS. [DPO - simpler alternative to RLHF]

Definition

Why it matters

How it works

Common questions

Related terms

References