World Models: From Dreaming Agents to Physical AI

Overview

World models are learned internal simulators that capture environment dynamics, enabling agents to "imagine" future trajectories and plan without exhaustive real-world interaction. This roadmap takes you from classical model-based RL roots through the modern Dreamer lineage, transformer-based approaches, and into the 2025–2026 frontier of foundation-scale world models for physical AI.

Who is this for? ML researchers and engineers with RL fundamentals who want to build, extend, or apply world models across domains — games, robotics, autonomous driving, and generative simulation.

End state: You can read and implement state-of-the-art world model architectures, understand the design tradeoffs (latent vs. pixel prediction, RSSM vs. transformer, decision-coupled vs. general-purpose), and identify open research directions for your own work.

Sequence

1. Foundations — Mental Models & Classical Roots

Build intuition for why learning a model of the world matters and the core abstraction of Dyna-style model-based RL.

concept Internal models in cognitive science — Kenneth Craik's The Nature of Explanation (1943) introduced the idea that organisms carry "small-scale models" of the world to anticipate events.
paper Sutton, "Dyna, an Integrated Architecture for Learning, Planning, and Reacting" (1991) — The foundational framework: learn a model, generate imagined transitions, update the policy. Everything that follows is a richer version of this loop.
paper Schmidhuber, "An On-Line Algorithm for Dynamic Reinforcement Learning and Planning in Reactive Environments" (1990) — Early RNN-based world model + controller training in latent space; the intellectual ancestor of all modern approaches.
resource Sutton & Barto, Reinforcement Learning: An Introduction — Chapter 8 (Planning and Learning with Tabular Methods) for Dyna-Q and model-based planning fundamentals.
prereq VAEs (Kingma & Welling, 2013), RNNs/LSTMs, and policy gradient basics.

2. The Modern Era — Ha & Schmidhuber to Dreamer

The key conceptual leap: VAE encodes pixels → RNN predicts latent dynamics → compact controller acts in dream space.

paper Ha & Schmidhuber, "World Models" (NeurIPS 2018) — The landmark paper. V(AE)-M(DN-RNN)-C architecture. Agent trained entirely inside its own hallucinated dream. Interactive version at worldmodels.github.io.
paper Hafner et al., "Learning Latent Dynamics for Planning from Pixels" (PlaNet, 2019) — Introduces the Recurrent State-Space Model (RSSM) with deterministic + stochastic paths. Planning via CEM in latent space.
paper Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (DreamerV1, 2020) — End-to-end differentiable: actor-critic trained entirely inside imagined rollouts from the RSSM.
paper Hafner et al., "Mastering Atari with Discrete World Models" (DreamerV2, 2021) — Discrete latent representations, KL balancing, surpasses human on Atari.
paper Hafner et al., "Mastering Diverse Domains through World Models" (DreamerV3, Nature 2025) — The capstone. Single configuration across 150+ tasks. First to collect diamonds in Minecraft from scratch. Study the robustness techniques: symlog, unimix, percentile return normalization.
build Reproduce the DreamerV3 RSSM on a simple environment (CartPole or DMC Walker). Focus on understanding the imagination rollout and actor-critic optimization inside latent space.

3. Parallel Lineage — Planning with Learned Models (MuZero Family)

World models for search-based planning rather than imagination-based policy optimization.

paper Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (MuZero, Nature 2020) — Learns dynamics, reward, and value functions without access to environment rules. Plans via MCTS in learned latent space.
paper Ye et al., "EfficientZero" (NeurIPS 2021) — Data-efficient MuZero variant achieving superhuman Atari in 2 hours of gameplay.
exercise Write a one-page comparison memo: Dreamer's explicit latent imagination vs. MuZero's implicit planning via search. When would you choose one over the other?

4. Transformer-Based World Models

Replacing RSSMs with transformers for sequence modeling of world dynamics.

paper Micheli et al., "Transformers are Sample-Efficient World Learners" (IRIS, ICLR 2023) — Discrete autoencoder + autoregressive transformer as the world model. Achieves strong Atari performance with far fewer environment interactions.
paper Chen et al., "Transdreamer: Reinforcement Learning with Transformer World Models" (2022) — Transformer variant of Dreamer's RSSM.
paper Alonso et al., "Diffusion for World Modeling: Visual Details Matter in Atari" (DIAMOND, NeurIPS 2024) — Diffusion-based world model that predicts directly in pixel space. Challenges the "latent space is necessary" assumption.
paper Robine et al., "Transformer-based World Models Are Happy With 100k Interactions" (TWM, ICLR 2023) — Efficient transformer world model with categorical latent codes.
build Implement a minimal transformer-based world model on a gridworld or simple Atari game. Compare autoregressive token prediction vs. RSSM rollouts.

5. Foundation-Scale World Models & Physical AI (2024–2026 Frontier)

Scaling world models to internet-scale video data, moving from game agents to physical world understanding.

report NVIDIA, "Cosmos: World Foundation Model Platform for Physical AI" (2025) — Architecture, training, and evaluation metrics for video world models targeting robotics and autonomous driving.
paper Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (Meta, 2025) — Joint embedding predictive architecture; prediction in representation space rather than pixel space. Web-scale video pretraining transfers to robotics.
blog Google DeepMind, "Genie 3" (2025) — Foundation world model generating real-time interactive 3D environments from text prompts.
industry OpenAI Sora & Sora 2 (2024–2025) — Video generation as world simulation; debate on whether video generators are "true" world models.
industry Wayve, "GAIA-2" (2025) — Domain-specific world model for autonomous driving with controllable scenario generation.
industry World Labs, "Marble" (2025) — Fei-Fei Li's lab; 4D world model for persistent spatial understanding.
paper Yann LeCun, "A Path Towards Autonomous Machine Intelligence" (2022) — The philosophical position paper arguing world models are essential for AGI. Proposes the JEPA architecture family.

6. World Models in Robotics & Embodied AI

Where world models meet the physical world — manipulation, navigation, and sim-to-real transfer.

paper Physical Intelligence, "π0: A Vision-Language-Action Flow Model for General Robot Control" (2024) — Foundation model approach to robotic control.
paper Figure AI, "Helix" (2025) — VLA model for humanoid robots.
paper NVIDIA, "GR00T N1" (2025) — Foundation model for humanoid robot learning.
paper Google DeepMind, "Gemini Robotics" (2025) — Multimodal foundation model applied to robotic control.
survey Ding et al., "Understanding World or Predicting Future? A Comprehensive Survey of World Models" (ACM Computing Surveys, 2025) — Excellent taxonomy: understanding-oriented vs. prediction-oriented world models across games, driving, robotics, and social simulacra.
survey Li et al., "A Comprehensive Survey on World Models for Embodied AI" (2025) — Three-axis taxonomy: functionality, temporal modeling, spatial representation.

7. Implementations & Hands-On Projects

Build 1 (Beginner): Reproduce Ha & Schmidhuber's World Models on VizDoom or CarRacing-v0. Train the VAE, MDN-RNN, and controller separately. Then train the agent inside its own dream.
Build 2 (Intermediate): Implement DreamerV3's RSSM with the key robustness tricks (symlog, KL balancing, unimix). Train on DeepMind Control Suite tasks.
Build 3 (Intermediate): Build a minimal IRIS-style transformer world model with VQ-VAE tokenization on a simple Atari game.
Build 4 (Advanced): Fine-tune or evaluate a video prediction model (e.g., using the Cosmos codebase) on a custom domain. Measure physical consistency metrics.
Build 5 (Research): Design a hybrid architecture — e.g., combine Dreamer-style latent imagination with MuZero-style planning, or integrate causal structure into the world model's transition dynamics.

code

8. Advanced / Going Deeper — Open Research Directions

Compositionality: Can world models be built from modular, object-centric components rather than monolithic neural networks? See PoE-World and slot-attention approaches.
Causal world models: Integrating structural causal models with learned dynamics — enabling counterfactual reasoning and intervention planning. Natural connection to your causal inference work.
Multi-agent world models: Modeling other agents' behaviors and intentions within the world model. Connects to your PRPO/MARL research.
Long-horizon consistency: Current video world models degrade over long rollouts. How do we maintain temporal coherence over hundreds of steps?
Evaluation & benchmarks: What does it mean for a world model to "understand" physics? See the 2025 benchmark showing LLMs have near-random accuracy on motion trajectory tasks.
World models as the path to AGI? The DeepMind result (2025) proving that any agent generalizing across broad tasks must have learned a predictive model. LeCun's argument that world models + JEPA = the missing piece.
Critique: Read Xing et al., "Critiques of World Models" (2025) for a rigorous counterpoint to the hype.

Suggested 4-Week Study Plan

Week	Focus	Key Reads	Build
1	Foundations + Ha & Schmidhuber	Dyna (1991), World Models (2018)	Recreate V-M-C latent rollout diagram
2	Dreamer lineage + MuZero	PlaNet, DreamerV1–V3, MuZero	Implement toy RSSM, write Dreamer vs MuZero memo
3	Transformer & diffusion world models	IRIS, DIAMOND, TWM	Minimal transformer world model on gridworld
4	Foundation models & frontiers	V-JEPA 2, Cosmos, Genie 3, surveys	Evaluate a video world model; identify your research angle