World Models
Learning internal models of how the world works — compressing environment dynamics into something an agent can plan with.
Landscape
A world model is a learned model of environment dynamics: given current state and action, predict next state (and reward). With a good world model, an agent can plan entirely in imagination — no environment interaction needed.
Sub-areas
- Latent space models — learn to predict in compressed latent space rather than pixel space (RSSM, DreamerV1/V2/V3)
- Transformer-based world models — sequential prediction with attention (IRIS, TransDreamer, Genie)
- Video prediction — treating world modeling as video generation (DALL-E, VideoGPT applied to RL)
- Physics-informed models — embedding physical priors into the learned model
- Foundation world models — large-scale, general-purpose environment models (GAIA-1, Genie 2)
Landmark papers
- World Models — Ha & Schmidhuber, 2018. The paper that named the concept in modern form.
- Dream to Control (DreamerV1) — Hafner et al., 2019.
- Mastering Atari with Discrete World Models (DreamerV2) — Hafner et al., 2020.
- Dreamer V3 — Hafner et al., 2023. Scales across domains without tuning.
- IRIS: Transformers are Sample-Efficient World Models — Micheli et al., 2022.
- Genie: Generative Interactive Environments — Bruce et al., 2024. Generalist world model from video.
Key figures
David Ha, Jürgen Schmidhuber, Danijar Hafner, Yann LeCun (JEPA approach), Timothy Lillicrap.
Open Problems
-
What's the right representation space for world models? Pixel space is expensive. Latent space requires a good encoder. Abstract state space requires structure we don't know how to induce. The tradeoff is unresolved.
-
How do you evaluate world model quality independently of downstream RL performance? Current evaluation is almost entirely through RL metrics. A world model could be useful without being optimal for RL.
-
Can world models generalize across tasks? Most world models are trained per-environment. Foundation world models (Genie 2) are promising but still limited.
-
How do world models handle stochasticity? Environments are often partially observable and stochastic. Current models handle this imperfectly — either averaging over futures or being overconfident.
-
What's the relationship between world models and language models? LLMs are trained to predict next tokens — they are world models of text. Whether this generalizes to grounded, physical world modeling is deeply unclear.
Questions & Ideas
- What is the minimal information a world model needs to capture for an RL agent to plan optimally? Is there a formal answer?
- How do you evaluate a world model's predictions without running the real environment? What's the right proxy metric?
- Can a world model trained on one agent's trajectories generalize to a different agent's action space?
- Is there a meaningful difference between a language model predicting tokens and a world model predicting states? Where do they diverge?
- How should a world model handle irreversible actions? Does it need a notion of time direction?
- What would it mean for a world model to be causally correct vs. just predictively accurate?
My Take
This section evolves as thinking develops.
LeCun's JEPA (Joint Embedding Predictive Architecture) framing is interesting: predict in abstract representation space rather than pixel space, avoid learning irrelevant details. This connects directly to the representation learning question — the world model quality is bounded by the representation quality.
The "foundation world model" direction (Genie, GAIA-1) feels like the near-term version of what AGI needs: a general model of how the world works, grounded in perception and action. The question is whether scaling video prediction gets you there or whether you need something more structured.
Journal
2026-02-27 — Starting this area. Background: read DreamerV3, IRIS, Genie. The Dreamer line is the most rigorous — careful latent space design, proper handling of stochasticity, scales well. IRIS is interesting because it uses a discrete world model with a transformer, separating tokenization from prediction.
The thing I want to understand better: what does planning in latent space actually look like? How does DreamerV3's actor-critic operate in imagination? Starting with that as the next deep dive.