What a world model is
A generative language model is, at heart, a one-line program: given everything seen so far, put a distribution over the next symbol. That is enough to write essays — but it is a strange foundation for an agent. An agent does not want to predict the next word a human would type; it wants to know what will happen next in the environment if it acts. A world model is the component that answers exactly that question: a learned simulator of the environment's dynamics that, given a compact summary of the present and a proposed action, predicts the future.
The decisive design choice is where the prediction lives. Predicting the next raw observation — every pixel of the next video frame — is the naïve target, and it is mostly the wrong one: it forces the model to spend capacity on photo-realistic but behaviorally irrelevant detail (the exact texture of grass, the flicker of a shadow) while the bits that matter for control are a handful. Modern world models instead carry a latent state \(z_t\): a low-dimensional code, produced by an encoder, that compresses an observation down to what is needed to predict the future. Dynamics are then learned in that latent space, and only decoded back to pixels when a human needs to look.
It helps to separate three jobs a world model can do, because different systems emphasise different ones. Prediction: given a state and action, what is the next state? Generation: sample whole plausible futures, or even whole playable environments, from the learned dynamics. Planning: search over action sequences by imagining their consequences and picking the one whose imagined return is highest (§5.5). Ha & Schmidhuber's 2018 "World Models" paper made the point vividly — an agent could be trained entirely inside its own dream of a racing game and then transferred to the real game — and that demonstration set the agenda for everything that followed.
# EQ MM5.1: learn a tiny LINEAR latent-dynamics model, then roll it forward
import numpy as np
rng = np.random.default_rng(0)
d = 3
A_true = np.array([[0.96, 0.04, 0.00], # the environment's hidden dynamics
[0.00, 0.91, 0.09], # z_{t+1} = A_true @ z_t (+ noise)
[0.05, 0.00, 0.93]])
T = 60
z = np.zeros((T, d)); z[0] = [1.0, 0.0, 0.0]
for t in range(T - 1):
z[t+1] = A_true @ z[t] + rng.normal(0, 0.01, d) # observed noisy rollout
A_hat = np.linalg.lstsq(z[:-1], z[1:], rcond=None)[0].T # fit z_{t+1} ~ A z_t
zhat = np.zeros_like(z); zhat[0] = z[0]
for t in range(T - 1):
zhat[t+1] = A_hat @ zhat[t] # IMAGINE forward: open loop, no peeking
one_step = np.sqrt((((A_hat @ z[:-1].T).T - z[1:]) ** 2).mean())
rollout = np.sqrt(((zhat - z) ** 2).mean())
print("learned A diagonal :", A_hat.diagonal().round(3), "(truth: 0.96 0.91 0.93)")
print("one-step RMSE :", round(float(one_step), 4))
print("free-rollout RMSE :", round(float(rollout), 4), "<- errors compound over the horizon")
plot_xy(range(T), np.sqrt(((zhat - z) ** 2).sum(1)))
Latent dynamics — Dreamer
The Dreamer line (DreamerV1 → V3, Hafner et al.) is the most complete worked example of latent-dynamics control. Its world model is a Recurrent State-Space Model (RSSM), which splits the latent into two parts: a deterministic recurrent hidden state \(h_t\) that carries history, and a stochastic state \(z_t\) sampled from a learned distribution. Keeping a stochastic component is what lets the model represent genuine uncertainty about the future rather than a single brittle guess.
The model is trained the way a variational autoencoder is (see Vol II / DEEP LEARNING 05): an encoder proposes a posterior \(z_t\) from the actual observation, a transition proposes a prior \(\hat z_t\) from the recurrent state alone, and the loss pulls them together while a decoder reconstructs the observation. The complete objective sums a reconstruction term, the reward and termination predictions, and a KL term that is the heart of the world model:
Once the world model is trained, Dreamer never plans in the real environment. It rolls the RSSM forward in latent space for a short horizon (typically \(H = 15\) steps), and trains an actor and a critic purely on these imagined trajectories — the policy-gradient and value-learning machinery of RL (RL 04–05) applied to dreamed data. Because a single forward pass of the latent transition is orders of magnitude cheaper than stepping a real simulator or robot, the agent can practise millions of imagined steps per real step, which is why Dreamer is so dramatically sample-efficient.
DreamerV3's 2023 headline is worth stating precisely, because it is a genuine landmark and also frequently overstated. With a single fixed set of hyperparameters, it set state-of-the-art across a remarkable spread of domains — Atari, continuous control, DMLab — and was the first method to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing open challenge. The honest caveats: the symlog/two-hot tricks that make one configuration work everywhere are engineering, not magic; imagined rollouts still suffer compounding model error past their horizon; and "world model" here means a compact game/control simulator, not a general model of physical reality.
Why not just predict pixels? Early latent models did include a heavy pixel-reconstruction loss, and it works — but it ties the latent's capacity to visual fidelity. DreamerV3 keeps a decoder for grounding, yet the behaviorally important signal flows through the reward and KL terms. The next section takes the argument to its logical end: drop pixel reconstruction altogether.
JEPA — joint-embedding predictive architectures
Yann LeCun's 2022 position paper, A Path Towards Autonomous Machine Intelligence, makes one architectural commitment the organising principle of the whole programme: do not predict in observation space — predict in representation space. A Joint-Embedding Predictive Architecture (JEPA) encodes both an input \(x\) and a target \(y\) (a masked region, or a future) into embeddings \(s_x, s_y\), and trains a predictor to map the input embedding to the target embedding, never back to pixels.
The argument has teeth beyond philosophy. I-JEPA (images, 2023) and V-JEPA / V-JEPA 2 (video, 2024–2025) showed that embedding-prediction self-supervision learns features competitive with or better than reconstruction-based pretraining (masked autoencoders) and contrastive methods, while training faster and without the heavy augmentation pipelines contrastive learning needs. The predictive framing also connects directly to world models: predict a future embedding instead of a masked one and the same architecture becomes a latent dynamics model — V-JEPA 2 is explicitly pitched as a world model for planning, exactly the §5.5 use.
Two honest caveats keep this from being a clean victory. First, collapse is a real and finicky failure mode; the EMA target, stop-gradient, and variance/covariance regularisers (the VICReg family) are load-bearing, not optional. Second, because there is no decoder, you cannot directly visualise what a JEPA has predicted — you only have an embedding — which makes debugging and human inspection harder than in a Dreamer-style model that can render its dream. JEPA trades interpretability for representational efficiency, and whether that trade is the right path to general intelligence is, as of 2026, an active and genuinely contested research bet rather than settled fact.
# EQ MM5.3: embedding-prediction loss vs pixel-reconstruction loss on a toy
import numpy as np
rng = np.random.default_rng(1)
D = 64 # pixels per "frame"
N = 400 # samples
pos = rng.uniform(-1, 1, N) # the one PREDICTABLE factor (object position)
grid = np.linspace(-1, 1, D)
signal = np.exp(-((grid[None, :] - pos[:, None]) ** 2) / 0.05) # a blob at `pos`
noise = rng.normal(0, 1.0, (N, D)) # UNPREDICTABLE per-pixel texture
frames = signal + 0.8 * noise # what a pixel decoder must reproduce
# encoder: project to a 1-D embedding that recovers position (the predictable part)
w = grid / (grid @ grid) # least-squares readout of the blob center
emb = frames @ w # s_y : embedding of each frame
# a predictor that knows position perfectly (best case) vs the two losses it implies
pixel_loss = ((frames - signal) ** 2).mean() # decoder can't predict the noise
embed_loss = ((emb - pos) ** 2).mean() # embedding strips the noise away
print(f"pixel-reconstruction loss : {pixel_loss:.3f} (dominated by texture noise)")
print(f"embedding-prediction loss : {embed_loss:.3f} (keeps only the position)")
print(f"ratio pixel/embedding : {pixel_loss / max(embed_loss, 1e-9):.1f}x")
print("JEPA predicts the embedding -> it never pays for noise it cannot predict.")
Genie & learned interactive simulators
Dreamer and JEPA learn dynamics to control. Genie (Bruce et al., DeepMind, 2024) pushes the world model in a different and striking direction: learn dynamics to generate playable worlds. Trained on 200,000+ hours of internet 2D-platformer gameplay videos — with no action labels at all — Genie produces, from a single image or text prompt, an environment you can then step through frame by frame with a controller, even though nobody ever told it what the buttons do.
The trick that makes label-free training possible is a latent action model. Genie has three learned pieces: a video tokenizer (compress frames to discrete tokens), an autoregressive dynamics model (predict the next frame's tokens), and — the key idea — a latent-action module trained to infer, for each pair of consecutive frames, a discrete latent action \(a_t\) drawn from a small codebook that best explains the transition.
Why this matters: the binding constraint on training agents has always been the cost of interactive, action-labelled data. Genie loosens it dramatically — passive video is essentially unlimited. The follow-up, Genie 2 (late 2024), extended the recipe to action-controllable, 3D, minutes-long consistent worlds generated from a single image, positioning learned simulators as potential training grounds for embodied agents (the bridge to the next chapter). Related lines — Google's GameNGen reproducing DOOM as a neural simulator, and the broad family of video-diffusion-as-world-model systems — point at the same convergence: a sufficiently good video predictor is an interactive environment.
The caveats are real and current. These simulators hallucinate and drift over long horizons; physical consistency (object permanence, conservation) is approximate, not guaranteed; frame rates and resolutions remain well below real-time photorealism for long rollouts; and inferred latent actions are not guaranteed to align with any human control scheme. "Learned interactive simulator" in 2026 means an impressive, improving research artifact — not a drop-in replacement for a physics engine.
World models for planning & RL
The payoff of all this machinery is that a world model turns reinforcement learning from a problem of acting into a problem of imagining. There are two dominant ways to spend a learned model.
Background planning (Dreamer-style). Use imagined rollouts to train a fast reactive policy, then act with the policy alone. This is the actor–critic-in-imagination loop of §5.2: cheap to run at deployment because the world model is only used during training. Decision-time planning (MuZero / MPC-style). Use the model at the moment of acting to search over action sequences and execute the first action of the best plan. The canonical objective is to choose the action sequence whose imagined return is greatest:
The recurring tension across §5.1–5.5 is the same one the latent-rollout instrument showed: compounding error. A one-step prediction can be excellent and an \(H\)-step rollout still useless, because each small error feeds the next step's input. This is why horizons are short (Dreamer's \(H \approx 15\)), why uncertainty-aware models that know when to distrust themselves matter, and why background planning (which only needs the model to be locally right) is often more robust than long decision-time search (which needs it globally right). The frontier question for 2026 — pursued by V-JEPA 2, Genie 2 and the broader video-world-model crowd — is whether a single large pretrained world model can be accurate enough, over long enough horizons, to plan real-world behavior. It is genuinely open.
A world model that can imagine and plan is only half an agent — the other half has a body. Chapter 06 turns to embodied AI: vision-language-action models, sim-to-real transfer, and how the latent dynamics of this chapter become the inner loop of robots that act in the physical world.
References
- Ha, D. & Schmidhuber, J. (2018). World Models.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence.
- Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (DreamerV3).
- Bruce, J. et al. (2024). Genie: Generative Interactive Environments.
- Hafner, D. et al. (2025). V-JEPA 2: Self-Supervised Video World Models — see also Assran et al., I-JEPA (arXiv:2301.08243).
- Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero).
- Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen).