February 25, 2026
World Models & Model-Based RLWorld Models: From Dreaming Agents to Physical AI
A structured learning path covering world models — from classical model-based RL foundations through modern latent imagination (Dreamer), transformer-based world models (IRIS, DIAMOND), and frontier foundation models (Cosmos, Genie, V-JEPA) — for researchers and engineers targeting MBRL, robotics, autonomous driving, and video generation.
Overview
World models are learned internal simulators that capture environment dynamics, enabling agents to "imagine" future trajectories and plan without exhaustive real-world interaction. This roadmap takes you from classical model-based RL roots through the modern Dreamer lineage, transformer-based approaches, and into the 2025–2026 frontier of foundation-scale world models for physical AI.
Who is this for? ML researchers and engineers with RL fundamentals who want to build, extend, or apply world models across domains — games, robotics, autonomous driving, and generative simulation.
End state: You can read and implement state-of-the-art world model architectures, understand the design tradeoffs (latent vs. pixel prediction, RSSM vs. transformer, decision-coupled vs. general-purpose), and identify open research directions for your own work.
Sequence
1. Foundations — Mental Models & Classical Roots
Build intuition for why learning a model of the world matters and the core abstraction of Dyna-style model-based RL.
- concept Internal models in cognitive science — Kenneth Craik's The Nature of Explanation (1943) introduced the idea that organisms carry "small-scale models" of the world to anticipate events.
- paper Sutton, "Dyna, an Integrated Architecture for Learning, Planning, and Reacting" (1991) — The foundational framework: learn a model, generate imagined transitions, update the policy. Everything that follows is a richer version of this loop.
- paper Schmidhuber, "An On-Line Algorithm for Dynamic Reinforcement Learning and Planning in Reactive Environments" (1990) — Early RNN-based world model + controller training in latent space; the intellectual ancestor of all modern approaches.
- resource Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 8 (Planning and Learning with Tabular Methods) for Dyna-Q and model-based planning fundamentals.
- prereq VAEs (Kingma & Welling, 2013), RNNs/LSTMs, and policy gradient basics.
2. The Modern Era — Ha & Schmidhuber to Dreamer
The key conceptual leap: VAE encodes pixels → RNN predicts latent dynamics → compact controller acts in dream space.
- paper Ha & Schmidhuber, "World Models" (NeurIPS 2018) — The landmark paper. V(AE)-M(DN-RNN)-C architecture. Agent trained entirely inside its own hallucinated dream. Interactive version at worldmodels.github.io.
- paper Hafner et al., "Learning Latent Dynamics for Planning from Pixels" (PlaNet, 2019) — Introduces the Recurrent State-Space Model (RSSM) with deterministic + stochastic paths. Planning via CEM in latent space.
- paper Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (DreamerV1, 2020) — End-to-end differentiable: actor-critic trained entirely inside imagined rollouts from the RSSM.
- paper Hafner et al., "Mastering Atari with Discrete World Models" (DreamerV2, 2021) — Discrete latent representations, KL balancing, surpasses human on Atari.
- paper Hafner et al., "Mastering Diverse Domains through World Models" (DreamerV3, Nature 2025) — The capstone. Single configuration across 150+ tasks. First to collect diamonds in Minecraft from scratch. Study the robustness techniques: symlog, unimix, percentile return normalization.
- build Reproduce the DreamerV3 RSSM on a simple environment (CartPole or DMC Walker). Focus on understanding the imagination rollout and actor-critic optimization inside latent space.
3. Parallel Lineage — Planning with Learned Models (MuZero Family)
World models for search-based planning rather than imagination-based policy optimization.
- paper Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (MuZero, Nature 2020) — Learns dynamics, reward, and value functions without access to environment rules. Plans via MCTS in learned latent space.
- paper Ye et al., "EfficientZero" (NeurIPS 2021) — Data-efficient MuZero variant achieving superhuman Atari in 2 hours of gameplay.
- exercise Write a one-page comparison memo: Dreamer's explicit latent imagination vs. MuZero's implicit planning via search. When would you choose one over the other?
4. Transformer-Based World Models
Replacing RSSMs with transformers for sequence modeling of world dynamics.
- paper Micheli et al., "Transformers are Sample-Efficient World Learners" (IRIS, ICLR 2023) — Discrete autoencoder + autoregressive transformer as the world model. Achieves strong Atari performance with far fewer environment interactions.
- paper Chen et al., "Transdreamer: Reinforcement Learning with Transformer World Models" (2022) — Transformer variant of Dreamer's RSSM.
- paper Alonso et al., "Diffusion for World Modeling: Visual Details Matter in Atari" (DIAMOND, NeurIPS 2024) — Diffusion-based world model that predicts directly in pixel space. Challenges the "latent space is necessary" assumption.
- paper Robine et al., "Transformer-based World Models Are Happy With 100k Interactions" (TWM, ICLR 2023) — Efficient transformer world model with categorical latent codes.
- build Implement a minimal transformer-based world model on a gridworld or simple Atari game. Compare autoregressive token prediction vs. RSSM rollouts.
5. Foundation-Scale World Models & Physical AI (2024–2026 Frontier)
Scaling world models to internet-scale video data, moving from game agents to physical world understanding.
- report NVIDIA, "Cosmos: World Foundation Model Platform for Physical AI" (2025) — Architecture, training, and evaluation metrics for video world models targeting robotics and autonomous driving.
- paper Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (Meta, 2025) — Joint embedding predictive architecture; prediction in representation space rather than pixel space. Web-scale video pretraining transfers to robotics.
- blog Google DeepMind, "Genie 3" (2025) — Foundation world model generating real-time interactive 3D environments from text prompts.
- industry OpenAI Sora & Sora 2 (2024–2025) — Video generation as world simulation; debate on whether video generators are "true" world models.
- industry Wayve, "GAIA-2" (2025) — Domain-specific world model for autonomous driving with controllable scenario generation.
- industry World Labs, "Marble" (2025) — Fei-Fei Li's lab; 4D world model for persistent spatial understanding.
- paper Yann LeCun, "A Path Towards Autonomous Machine Intelligence" (2022) — The philosophical position paper arguing world models are essential for AGI. Proposes the JEPA architecture family.
6. World Models in Robotics & Embodied AI
Where world models meet the physical world — manipulation, navigation, and sim-to-real transfer.
- paper Physical Intelligence, "π0: A Vision-Language-Action Flow Model for General Robot Control" (2024) — Foundation model approach to robotic control.
- paper Figure AI, "Helix" (2025) — VLA model for humanoid robots.
- paper NVIDIA, "GR00T N1" (2025) — Foundation model for humanoid robot learning.
- paper Google DeepMind, "Gemini Robotics" (2025) — Multimodal foundation model applied to robotic control.
- survey Ding et al., "Understanding World or Predicting Future? A Comprehensive Survey of World Models" (ACM Computing Surveys, 2025) — Excellent taxonomy: understanding-oriented vs. prediction-oriented world models across games, driving, robotics, and social simulacra.
- survey Li et al., "A Comprehensive Survey on World Models for Embodied AI" (2025) — Three-axis taxonomy: functionality, temporal modeling, spatial representation.
7. Implementations & Hands-On Projects
- Build 1 (Beginner): Reproduce Ha & Schmidhuber's World Models on VizDoom or CarRacing-v0. Train the VAE, MDN-RNN, and controller separately. Then train the agent inside its own dream.
- Build 2 (Intermediate): Implement DreamerV3's RSSM with the key robustness tricks (symlog, KL balancing, unimix). Train on DeepMind Control Suite tasks.
- Build 3 (Intermediate): Build a minimal IRIS-style transformer world model with VQ-VAE tokenization on a simple Atari game.
- Build 4 (Advanced): Fine-tune or evaluate a video prediction model (e.g., using the Cosmos codebase) on a custom domain. Measure physical consistency metrics.
- Build 5 (Research): Design a hybrid architecture — e.g., combine Dreamer-style latent imagination with MuZero-style planning, or integrate causal structure into the world model's transition dynamics.
code
- DreamerV3 official (Danijar Hafner)
- IRIS (Micheli et al.)
- DIAMOND (Alonso et al.)
- worldmodels.github.io (Ha & Schmidhuber)
- Awesome World Models (Tsinghua)
8. Advanced / Going Deeper — Open Research Directions
- Compositionality: Can world models be built from modular, object-centric components rather than monolithic neural networks? See PoE-World and slot-attention approaches.
- Causal world models: Integrating structural causal models with learned dynamics — enabling counterfactual reasoning and intervention planning. Natural connection to your causal inference work.
- Multi-agent world models: Modeling other agents' behaviors and intentions within the world model. Connects to your PRPO/MARL research.
- Long-horizon consistency: Current video world models degrade over long rollouts. How do we maintain temporal coherence over hundreds of steps?
- Evaluation & benchmarks: What does it mean for a world model to "understand" physics? See the 2025 benchmark showing LLMs have near-random accuracy on motion trajectory tasks.
- World models as the path to AGI? The DeepMind result (2025) proving that any agent generalizing across broad tasks must have learned a predictive model. LeCun's argument that world models + JEPA = the missing piece.
- Critique: Read Xing et al., "Critiques of World Models" (2025) for a rigorous counterpoint to the hype.
Suggested 4-Week Study Plan
| Week | Focus | Key Reads | Build |
|---|---|---|---|
| 1 | Foundations + Ha & Schmidhuber | Dyna (1991), World Models (2018) | Recreate V-M-C latent rollout diagram |
| 2 | Dreamer lineage + MuZero | PlaNet, DreamerV1–V3, MuZero | Implement toy RSSM, write Dreamer vs MuZero memo |
| 3 | Transformer & diffusion world models | IRIS, DIAMOND, TWM | Minimal transformer world model on gridworld |
| 4 | Foundation models & frontiers | V-JEPA 2, Cosmos, Genie 3, surveys | Evaluate a video world model; identify your research angle |