Skip to main content
AI & Machine Learning

Reinforcement Learning

A machine learning approach where agents learn optimal behavior through trial-and-error interactions with an environment.

Also known as: RL, Reward-based learning, Trial-and-error learning, Agent-based learning

Definition

Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns a policy that maximizes cumulative reward over time. Unlike supervised learning (which needs labeled examples) or unsupervised learning (which finds patterns), RL learns from the consequences of actions through trial and error.

Why it matters

Reinforcement learning enables AI to learn complex behaviors:

  • Learns without labels — only needs a reward signal, not labeled examples
  • Handles sequential decisions — optimizes long-term outcomes, not just immediate
  • Superhuman performance — AlphaGo, game-playing agents, robotics
  • Real-world control — autonomous vehicles, recommendation systems
  • Foundation for RLHF — key technique for aligning LLMs with human preferences

RL bridges the gap between AI and decision-making in complex environments.

How it works

┌────────────────────────────────────────────────────────────┐
│                  REINFORCEMENT LEARNING                    │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  THE RL LOOP:                                              │
│  ────────────                                              │
│                                                            │
│         ┌─────────────────────────────────────┐           │
│         │           ENVIRONMENT               │           │
│         │    (Game, Robot World, Website)     │           │
│         └──────────────┬──────────────────────┘           │
│                        │                                   │
│              State s   │   Reward r                        │
│                ↓       │       ↓                           │
│         ┌──────────────▼───────────────────┐              │
│         │              AGENT               │              │
│         │                                  │              │
│         │  1. Observe state s             │              │
│         │  2. Choose action a (via policy)│              │
│         │  3. Receive reward r            │              │
│         │  4. Update policy to maximize r │              │
│         │                                  │              │
│         └──────────────┬───────────────────┘              │
│                        │                                   │
│              Action a  ↓                                   │
│         ┌──────────────▼──────────────────────┐           │
│         │           ENVIRONMENT               │           │
│         │    (responds to action, new state)  │           │
│         └─────────────────────────────────────┘           │
│                                                            │
│  KEY CONCEPTS:                                             │
│  ─────────────                                             │
│                                                            │
│  State (s):    Current situation                          │
│  Action (a):   Choice agent makes                         │
│  Reward (r):   Feedback signal (+positive, -negative)     │
│  Policy (π):   Strategy for choosing actions              │
│  Value (V):    Expected future cumulative reward          │
│                                                            │
│  EXPLORATION VS EXPLOITATION:                              │
│  ────────────────────────────                              │
│                                                            │
│  ┌─────────────────┐    ┌─────────────────┐               │
│  │   EXPLORATION   │    │   EXPLOITATION  │               │
│  │                 │    │                 │               │
│  │  Try new things │    │  Use what works │               │
│  │  to discover    │    │  to maximize    │               │
│  │  better options │    │  known rewards  │               │
│  │                 │    │                 │               │
│  │   "Explore"     │    │    "Exploit"    │               │
│  └─────────────────┘    └─────────────────┘               │
│                                                            │
│           Must BALANCE both for optimal learning           │
│                                                            │
│  COMMON RL ALGORITHMS:                                     │
│  ─────────────────────                                     │
│  Q-Learning:    Learn value of state-action pairs         │
│  Policy Gradient: Directly optimize the policy            │
│  Actor-Critic:  Combine value estimation + policy         │
│  PPO:          Stable policy optimization (used in RLHF)  │
│  DQN:          Deep Q-networks for complex states         │
│                                                            │
└────────────────────────────────────────────────────────────┘

RL paradigms:

ApproachHow it learnsExample
Value-basedEstimate value of states/actionsDQN playing Atari
Policy-basedDirectly learn action probabilitiesPolicy gradient in robotics
Model-basedLearn environment dynamicsPlanning in games
Model-freeLearn directly from experienceMost game-playing agents

Common questions

Q: How is RL different from supervised learning?

A: In supervised learning, you tell the agent the correct answer for each input. In RL, the agent only gets reward signals—it must discover good behavior through exploration. RL handles sequential decisions where actions affect future states; supervised learning typically handles independent predictions.

Q: What is RLHF and how does it relate to RL?

A: RLHF (Reinforcement Learning from Human Feedback) uses RL to fine-tune LLMs. Human preferences become the reward signal—a separate model predicts how much humans would prefer one response over another, and RL optimizes the LLM to generate preferred responses.

Q: Why is exploration vs exploitation important?

A: If an agent only exploits known good actions, it might miss better options. If it only explores, it never takes advantage of what it’s learned. Finding the right balance is crucial—too little exploration leads to suboptimal policies, too much wastes time on poor actions.

Q: Can RL solve any decision problem?

A: In theory, RL can optimize any system with definable rewards. In practice, RL struggles with: sparse rewards (rare feedback), sample efficiency (needs many trials), credit assignment (which action caused the reward?), and defining good reward signals.


References

Sutton & Barto (2018), “Reinforcement Learning: An Introduction”, MIT Press. [The foundational RL textbook]

Mnih et al. (2015), “Human-level control through deep reinforcement learning”, Nature. [DQN paper, 20,000+ citations]

Silver et al. (2016), “Mastering the game of Go with deep neural networks”, Nature. [AlphaGo paper]

Schulman et al. (2017), “Proximal Policy Optimization Algorithms”, arXiv. [PPO - used in RLHF]