researchengineer.ing
activeReinforcement Learning

Reinforcement Learning

Learning to act through interaction — the study of sequential decision-making under uncertainty.

Landscape

RL studies how an agent learns to maximize cumulative reward through trial and error. It connects to optimization, control theory, game theory, and cognitive science.

Sub-areas

  • Model-free RL — learn policy directly from experience without a world model
    • Policy gradient: REINFORCE, PPO, A3C, SAC
    • Value-based: DQN, Rainbow, C51
  • Model-based RL — use a world model to plan or generate synthetic experience (Dyna, DreamerV3, MuZero)
  • Offline RL — learn from a fixed dataset without environment interaction (CQL, IQL, Decision Transformer)
  • Multi-agent RL — multiple agents interacting (MADDPG, MAPPO, AlphaStar)
  • Hierarchical RL — decompose tasks into sub-goals and sub-policies (Options framework, HAC, HRL)
  • RL from Human Feedback (RLHF) — aligning language models with human preferences

Landmark papers

  • Playing Atari with Deep Reinforcement Learning (DQN) — Mnih et al., 2013.
  • Proximal Policy Optimization (PPO) — Schulman et al., 2017. The workhorse algorithm.
  • Soft Actor-Critic (SAC) — Haarnoja et al., 2018. Maximum entropy RL.
  • Mastering Go without Human Knowledge (AlphaGo Zero) — Silver et al., 2017.
  • Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero) — Schrittwieser et al., 2019.
  • Decision Transformer — Chen et al., 2021. RL as sequence modeling.

Key figures

Richard Sutton, Andrew Barto (foundations), David Silver, Pieter Abbeel, Sergey Levine, Ilya Kostrikov.


Open Problems

  1. Sample efficiency. Model-free RL requires millions of interactions to learn what humans learn in minutes. Model-based RL helps but doesn't close the gap. The right inductive biases for fast learning are unknown.

  2. Reward specification. Specifying a reward function that captures what you actually want is hard. Reward hacking (optimizing the proxy, not the intent) is a fundamental problem — not just a misalignment problem.

  3. Generalization across environments. RL policies are notoriously brittle to distribution shift. An agent trained in one environment often fails completely in a slightly different one.

  4. Exploration in sparse reward settings. When reward is rare, random exploration fails. Curiosity, count-based exploration, and empowerment help but don't fully solve the problem.

  5. Offline RL reliability. Learning from fixed datasets without environment interaction is critical for real-world deployment. But current offline RL methods are fragile and sensitive to data quality and coverage.

  6. RL + language. RLHF works empirically but is poorly understood theoretically. The relationship between language model capabilities and RL optimization is not well characterized.


Questions & Ideas

  • Is there a principled way to choose between model-free and model-based RL for a given problem, beyond empirical tuning?
  • Does reward shaping always introduce unintended optima, or can it be done in a way that's provably safe?
  • Can offline RL ever match online RL given sufficient data quality and coverage? What's the theoretical upper bound?
  • Why does PPO work so well in practice despite its relatively weak theoretical guarantees? Is there a better explanation than "it just does"?
  • What is the right inductive bias for exploration in sparse reward settings — curiosity, uncertainty, or something else?
  • If RL from Human Feedback works, what does that say about the structure of human preferences? Are they consistent enough to be modeled as a reward?

My Take

This section evolves as thinking develops.

The divide between model-free and model-based RL feels increasingly artificial. Algorithms like DreamerV3 and MuZero blur the line — they use learned models but the policy is still trained through a value-based or policy-gradient objective. The real question is: what is the model for? Planning, data augmentation, or representation learning? The answer changes the design.

The Decision Transformer direction (RL as sequence modeling) is interesting precisely because it sidesteps the credit assignment problem. But it requires good offline data — it doesn't solve exploration. The tension between offline and online RL is the central unresolved tension in the field.


Journal

2026-02-27 — Setting up this area. Background: solid foundation through Sutton & Barto, PPO implementation, basic DQN. The gap: model-based RL (beyond Dreamer), offline RL, and the theory behind policy gradient methods.

The connection I want to understand: how does MuZero's learned model interact with MCTS? The planning-in-model-space question connects directly to the World Models area. Planning to work through these two areas in parallel.