World Models — JEPA, Genie & Dreamer

5.1

What a world model is

A generative language model is, at heart, a one-line program: given everything seen so far, put a distribution over the next symbol. That is enough to write essays — but it is a strange foundation for an agent. An agent does not want to predict the next word a human would type; it wants to know what will happen next in the environment if it acts. A world model is the component that answers exactly that question: a learned simulator of the environment's dynamics that, given a compact summary of the present and a proposed action, predicts the future.

The decisive design choice is where the prediction lives. Predicting the next raw observation — every pixel of the next video frame — is the naïve target, and it is mostly the wrong one: it forces the model to spend capacity on photo-realistic but behaviorally irrelevant detail (the exact texture of grass, the flicker of a shadow) while the bits that matter for control are a handful. Modern world models instead carry a latent state $z_t$: a low-dimensional code, produced by an encoder, that compresses an observation down to what is needed to predict the future. Dynamics are then learned in that latent space, and only decoded back to pixels when a human needs to look.

EQ MM5.1 — THE WORLD-MODEL CONTRACT $$ z_t = \mathrm{enc}(o_t), \qquad \hat{z}_{t+1} = f_\theta(z_t,\, a_t), \qquad \hat{r}_{t+1} = g_\theta(z_t,\, a_t) $$

An encoder maps the raw observation $o_t$ to a latent state $z_t$; a learned transition $f_\theta$ predicts the next latent given the current latent and an action $a_t$; an optional reward head $g_\theta$ predicts the scalar payoff. Train these and you can roll the model forward in imagination — feed $\hat z_{t+1}$ back into $f_\theta$ — generating a trajectory of futures without ever touching the real environment. That closed loop, latent in and latent out, is the whole idea; everything in this chapter is a different choice of encoder, transition, and training objective.

It helps to separate three jobs a world model can do, because different systems emphasise different ones. Prediction: given a state and action, what is the next state? Generation: sample whole plausible futures, or even whole playable environments, from the learned dynamics. Planning: search over action sequences by imagining their consequences and picking the one whose imagined return is highest (§5.5). Ha & Schmidhuber's 2018 "World Models" paper made the point vividly — an agent could be trained entirely inside its own dream of a racing game and then transferred to the real game — and that demonstration set the agenda for everything that followed.

True or false: a world model lets an agent plan by imagining rollouts — chaining its learned transition $ \hat z_{t+1} = f_\theta(z_t, a_t) $ forward over a candidate action sequence and scoring the imagined return, all without acting in the real environment. (Answer true or false.)

A world model's transition function predicts the next latent state from the current state and a proposed action. Feeding each predicted state back in produces an entire imagined trajectory for any candidate plan, and the reward head scores it — so the agent can compare plans purely in imagination and act only on the winner. That imagine-then-act loop is planning with a world model. The statement is true.

INSTRUMENT MM5.1 — LATENT-ROLLOUT VISUALIZER2-D LATENT · z₊₁ = A z + B a · EQ MM5.1

DYNAMICS DECAY (spectral) 0.92

ROTATION (per step, °) 22

ACTION PUSH 0.30

HORIZON STEPS

FINAL ‖z‖

—

REGIME

—

The dot is the latent state $z_t$; each step applies $z_{t+1} = A z_t + B a_t$, where $A$ is a rotation scaled by the decay you set and $B a_t$ is a constant action push. Decay < 1 spirals the imagined trajectory inward to a fixed point (a stable, controllable model); decay = 1 holds a circle; decay > 1 spirals outward — the signature of an unstable, compounding-error rollout, the central failure mode of long-horizon imagination. This is the dynamics half of EQ MM5.1 made visible, with no pixels in sight.

PYTHON · RUNNABLE IN-BROWSER

# EQ MM5.1: learn a tiny LINEAR latent-dynamics model, then roll it forward
import numpy as np
rng = np.random.default_rng(0)

d = 3
A_true = np.array([[0.96, 0.04, 0.00],     # the environment's hidden dynamics
                   [0.00, 0.91, 0.09],     # z_{t+1} = A_true @ z_t  (+ noise)
                   [0.05, 0.00, 0.93]])
T = 60
z = np.zeros((T, d)); z[0] = [1.0, 0.0, 0.0]
for t in range(T - 1):
    z[t+1] = A_true @ z[t] + rng.normal(0, 0.01, d)   # observed noisy rollout

A_hat = np.linalg.lstsq(z[:-1], z[1:], rcond=None)[0].T   # fit z_{t+1} ~ A z_t

zhat = np.zeros_like(z); zhat[0] = z[0]
for t in range(T - 1):
    zhat[t+1] = A_hat @ zhat[t]            # IMAGINE forward: open loop, no peeking

one_step = np.sqrt((((A_hat @ z[:-1].T).T - z[1:]) ** 2).mean())
rollout  = np.sqrt(((zhat - z) ** 2).mean())
print("learned A diagonal :", A_hat.diagonal().round(3), "(truth: 0.96 0.91 0.93)")
print("one-step RMSE      :", round(float(one_step), 4))
print("free-rollout RMSE  :", round(float(rollout),  4), "<- errors compound over the horizon")
plot_xy(range(T), np.sqrt(((zhat - z) ** 2).sum(1)))

edits are live — break it on purpose

5.2

Latent dynamics — Dreamer

The Dreamer line (DreamerV1 → V3, Hafner et al.) is the most complete worked example of latent-dynamics control. Its world model is a Recurrent State-Space Model (RSSM), which splits the latent into two parts: a deterministic recurrent hidden state $h_t$ that carries history, and a stochastic state $z_t$ sampled from a learned distribution. Keeping a stochastic component is what lets the model represent genuine uncertainty about the future rather than a single brittle guess.

The model is trained the way a variational autoencoder is (see Vol II / DEEP LEARNING 05): an encoder proposes a posterior $z_t$ from the actual observation, a transition proposes a prior $\hat z_t$ from the recurrent state alone, and the loss pulls them together while a decoder reconstructs the observation. The complete objective sums a reconstruction term, the reward and termination predictions, and a KL term that is the heart of the world model:

EQ MM5.2 — RSSM PRIOR / POSTERIOR KL $$ h_t = \mathrm{GRU}(h_{t-1},\, z_{t-1},\, a_{t-1}), \qquad \mathcal{L}_{\text{dyn}} = \mathrm{KL}\!\big(\, q(z_t \mid h_t, o_t)\ \big\Vert\ p(z_t \mid h_t)\, \big) $$

The posterior $q$ sees the real observation $o_t$; the prior $p$ — the transition that runs at imagination time — sees only the recurrent state $h_t$. Minimising their KL trains the prior to predict what the posterior knows, i.e. it teaches the transition to anticipate the next latent before the observation arrives. DreamerV3 uses a symmetric "KL balancing" split and free-bits floor so neither side collapses. At planning time the observation is gone and only the prior runs — which is exactly why this term, not the pixel reconstruction, is what makes the dream coherent.

Once the world model is trained, Dreamer never plans in the real environment. It rolls the RSSM forward in latent space for a short horizon (typically $H = 15$ steps), and trains an actor and a critic purely on these imagined trajectories — the policy-gradient and value-learning machinery of RL (RL 04–05) applied to dreamed data. Because a single forward pass of the latent transition is orders of magnitude cheaper than stepping a real simulator or robot, the agent can practise millions of imagined steps per real step, which is why Dreamer is so dramatically sample-efficient.

DreamerV3's 2023 headline is worth stating precisely, because it is a genuine landmark and also frequently overstated. With a single fixed set of hyperparameters, it set state-of-the-art across a remarkable spread of domains — Atari, continuous control, DMLab — and was the first method to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing open challenge. The honest caveats: the symlog/two-hot tricks that make one configuration work everywhere are engineering, not magic; imagined rollouts still suffer compounding model error past their horizon; and "world model" here means a compact game/control simulator, not a general model of physical reality.

A learned scalar latent transition is $ z_{t+1} = a\,z_t + b\,u_t $ with $ a = 0.9 $ and $ b = 0.5 $. The agent is in latent state $ z_t = 2.0 $ and imagines taking the constant action $ u_t = 1 $. What single next latent state $ z_{t+1} $ does the world model predict?

Apply the transition once: $ z_{t+1} = 0.9 \times 2.0 + 0.5 \times 1 = 1.8 + 0.5 = $ 2.3. Chaining this same rule $H$ times — feeding each prediction back in — is one imagined Dreamer rollout; with $a < 1$ the free response decays and the controllable push $b\,u$ is what the actor learns to steer.

Why not just predict pixels? Early latent models did include a heavy pixel-reconstruction loss, and it works — but it ties the latent's capacity to visual fidelity. DreamerV3 keeps a decoder for grounding, yet the behaviorally important signal flows through the reward and KL terms. The next section takes the argument to its logical end: drop pixel reconstruction altogether.

5.3

JEPA — joint-embedding predictive architectures

Yann LeCun's 2022 position paper, A Path Towards Autonomous Machine Intelligence, makes one architectural commitment the organising principle of the whole programme: do not predict in observation space — predict in representation space. A Joint-Embedding Predictive Architecture (JEPA) encodes both an input $x$ and a target $y$ (a masked region, or a future) into embeddings $s_x, s_y$, and trains a predictor to map the input embedding to the target embedding, never back to pixels.

EQ MM5.3 — JEPA: PREDICT THE EMBEDDING, NOT THE PIXELS $$ s_x = \mathrm{enc}_\theta(x), \quad s_y = \mathrm{enc}_{\bar\theta}(y), \qquad \mathcal{L}_{\text{JEPA}} = \big\Vert\, \mathrm{pred}_\phi(s_x,\, c)\ -\ \mathrm{sg}(s_y)\, \big\Vert^2 $$

A predictor maps the context embedding $s_x$ (plus optional latent variable $c$ for the parts it cannot know) to the target embedding $s_y$. The target encoder $\mathrm{enc}_{\bar\theta}$ is an EMA (exponential moving average) of the online encoder, and $\mathrm{sg}$ is stop-gradient — together they prevent the trivial representation collapse where the encoder maps everything to a constant and the loss hits zero. By predicting an embedding, JEPA is free to discard unpredictable detail (exact textures, leaf positions): the encoder learns to keep what is predictable and throw away what is noise. That is the structural advantage a pixel-reconstruction loss can never have — it is forced to reproduce the noise too.

The argument has teeth beyond philosophy. I-JEPA (images, 2023) and V-JEPA / V-JEPA 2 (video, 2024–2025) showed that embedding-prediction self-supervision learns features competitive with or better than reconstruction-based pretraining (masked autoencoders) and contrastive methods, while training faster and without the heavy augmentation pipelines contrastive learning needs. The predictive framing also connects directly to world models: predict a future embedding instead of a masked one and the same architecture becomes a latent dynamics model — V-JEPA 2 is explicitly pitched as a world model for planning, exactly the §5.5 use.

Two honest caveats keep this from being a clean victory. First, collapse is a real and finicky failure mode; the EMA target, stop-gradient, and variance/covariance regularisers (the VICReg family) are load-bearing, not optional. Second, because there is no decoder, you cannot directly visualise what a JEPA has predicted — you only have an embedding — which makes debugging and human inspection harder than in a Dreamer-style model that can render its dream. JEPA trades interpretability for representational efficiency, and whether that trade is the right path to general intelligence is, as of 2026, an active and genuinely contested research bet rather than settled fact.

True or false: a JEPA predicts in embedding (representation) space rather than pixel space — its loss $ \big\Vert \mathrm{pred}_\phi(s_x, c) - \mathrm{sg}(s_y) \big\Vert^2 $ compares a predicted embedding against a target embedding, with no pixel-reconstruction term. (Answer true or false.)

EQ MM5.3 is a squared distance between the predictor's output and the target encoder's embedding $s_y$ — both vectors in representation space. There is no decoder and no pixel target anywhere in the objective; that is the defining JEPA choice and the reason it can ignore unpredictable visual detail. The statement is true.

INSTRUMENT MM5.2 — PIXEL vs EMBEDDING PREDICTIONSAME SCENE · TWO LOSSES · EQ MM5.3

UNPREDICTABLE DETAIL (texture noise) 0.45

PREDICTABLE SIGNAL (object position) 0.70

PIXEL-RECON LOSS

—

EMBEDDING LOSS

—

WASTED ON NOISE

—

Two predictors face the same scene: the left bar is a pixel-reconstruction loss, the right is a JEPA embedding loss. The embedding encoder keeps only the predictable signal (where the object is) and drops the unpredictable detail (texture noise) before measuring error — so its loss tracks the signal slider and barely moves with the noise slider. The pixel loss must reproduce everything, so it climbs with noise the model can never predict. Crank the noise: the "wasted on noise" readout is the fraction of the pixel objective spent on bits that carry no behavioral information — capacity a JEPA reclaims.

PYTHON · RUNNABLE IN-BROWSER

# EQ MM5.3: embedding-prediction loss vs pixel-reconstruction loss on a toy
import numpy as np
rng = np.random.default_rng(1)

D = 64                                   # pixels per "frame"
N = 400                                  # samples
pos = rng.uniform(-1, 1, N)             # the one PREDICTABLE factor (object position)
grid = np.linspace(-1, 1, D)

signal = np.exp(-((grid[None, :] - pos[:, None]) ** 2) / 0.05)   # a blob at `pos`
noise  = rng.normal(0, 1.0, (N, D))     # UNPREDICTABLE per-pixel texture
frames = signal + 0.8 * noise           # what a pixel decoder must reproduce

# encoder: project to a 1-D embedding that recovers position (the predictable part)
w = grid / (grid @ grid)                 # least-squares readout of the blob center
emb = frames @ w                         # s_y : embedding of each frame

# a predictor that knows position perfectly (best case) vs the two losses it implies
pixel_loss = ((frames - signal) ** 2).mean()       # decoder can't predict the noise
embed_loss = ((emb - pos) ** 2).mean()             # embedding strips the noise away
print(f"pixel-reconstruction loss : {pixel_loss:.3f}  (dominated by texture noise)")
print(f"embedding-prediction loss : {embed_loss:.3f}  (keeps only the position)")
print(f"ratio pixel/embedding     : {pixel_loss / max(embed_loss, 1e-9):.1f}x")
print("JEPA predicts the embedding -> it never pays for noise it cannot predict.")

edits are live — break it on purpose

5.4

Genie & learned interactive simulators

Dreamer and JEPA learn dynamics to control. Genie (Bruce et al., DeepMind, 2024) pushes the world model in a different and striking direction: learn dynamics to generate playable worlds. Trained on 200,000+ hours of internet 2D-platformer gameplay videos — with no action labels at all — Genie produces, from a single image or text prompt, an environment you can then step through frame by frame with a controller, even though nobody ever told it what the buttons do.

The trick that makes label-free training possible is a latent action model. Genie has three learned pieces: a video tokenizer (compress frames to discrete tokens), an autoregressive dynamics model (predict the next frame's tokens), and — the key idea — a latent-action module trained to infer, for each pair of consecutive frames, a discrete latent action $a_t$ drawn from a small codebook that best explains the transition.

EQ MM5.4 — GENIE'S LATENT ACTION (INFERRED, NOT LABELLED) $$ a_t = \arg\min_{a \in \mathcal{A}}\ \big\Vert\, x_{t+1} - \mathrm{dyn}_\theta(x_{\le t},\, a)\, \big\Vert, \qquad |\mathcal{A}| = 8 $$

For each transition, the model picks the latent action — from a tiny codebook of just 8 learned actions — that lets the dynamics model best reconstruct the next frame. Because the codebook is small, it is forced to capture controllable, recurring changes (jump, move left, move right) rather than memorise pixels. At training time $a_t$ is inferred from the real next frame; at play time a human supplies $a_t$ and the dynamics model generates the next frame from it. This is how a controllable simulator is learned from passive video with zero action annotations — the single most important idea in the paper.

Why this matters: the binding constraint on training agents has always been the cost of interactive, action-labelled data. Genie loosens it dramatically — passive video is essentially unlimited. The follow-up, Genie 2 (late 2024), extended the recipe to action-controllable, 3D, minutes-long consistent worlds generated from a single image, positioning learned simulators as potential training grounds for embodied agents (the bridge to the next chapter). Related lines — Google's GameNGen reproducing DOOM as a neural simulator, and the broad family of video-diffusion-as-world-model systems — point at the same convergence: a sufficiently good video predictor is an interactive environment.

The caveats are real and current. These simulators hallucinate and drift over long horizons; physical consistency (object permanence, conservation) is approximate, not guaranteed; frame rates and resolutions remain well below real-time photorealism for long rollouts; and inferred latent actions are not guaranteed to align with any human control scheme. "Learned interactive simulator" in 2026 means an impressive, improving research artifact — not a drop-in replacement for a physics engine.

5.5

World models for planning & RL

The payoff of all this machinery is that a world model turns reinforcement learning from a problem of acting into a problem of imagining. There are two dominant ways to spend a learned model.

Background planning (Dreamer-style). Use imagined rollouts to train a fast reactive policy, then act with the policy alone. This is the actor–critic-in-imagination loop of §5.2: cheap to run at deployment because the world model is only used during training. Decision-time planning (MuZero / MPC-style). Use the model at the moment of acting to search over action sequences and execute the first action of the best plan. The canonical objective is to choose the action sequence whose imagined return is greatest:

EQ MM5.5 — PLANNING AS IMAGINED-RETURN MAXIMISATION $$ a_{t:t+H}^{\star} = \arg\max_{a_{t:t+H}}\ \mathbb{E}\!\left[\, \sum_{k=0}^{H-1} \gamma^{k}\, \hat r_{t+k} \;+\; \gamma^{H} \hat V(\hat z_{t+H}) \,\right], \qquad \hat z_{t+k+1} = f_\theta(\hat z_{t+k}, a_{t+k}) $$

Roll the learned transition $f_\theta$ forward over a horizon $H$, sum the imagined rewards $\hat r$ discounted by $\gamma$, and add a learned value $\hat V$ at the horizon to account for everything past it (the same bootstrapping idea as RL 03). The agent picks the plan with the highest imagined return and executes only its first action, then re-plans — model-predictive control. MuZero is the celebrated instance: it learns $f_\theta, \hat r, \hat V$ and runs Monte-Carlo Tree Search over them, mastering Go, chess, shogi and Atari without being given the rules. The deep caveat: this is only as good as the model — search amplifies model error, so a plan can confidently exploit dynamics the world does not actually have.

The recurring tension across §5.1–5.5 is the same one the latent-rollout instrument showed: compounding error. A one-step prediction can be excellent and an $H$-step rollout still useless, because each small error feeds the next step's input. This is why horizons are short (Dreamer's $H \approx 15$), why uncertainty-aware models that know when to distrust themselves matter, and why background planning (which only needs the model to be locally right) is often more robust than long decision-time search (which needs it globally right). The frontier question for 2026 — pursued by V-JEPA 2, Genie 2 and the broader video-world-model crowd — is whether a single large pretrained world model can be accurate enough, over long enough horizons, to plan real-world behavior. It is genuinely open.

INSTRUMENT MM5.3 — IMAGINED-TRAJECTORY PLANNERREACH THE GOAL · SAMPLE PLANS IN IMAGINATION · EQ MM5.5

CANDIDATE PLANS SAMPLED 24

HORIZON H 10

MODEL ERROR (drift / step) 0.04

BEST IMAGINED RETURN

—

CHOSEN PLAN'S MISS

—

PLANS EVALUATED

—

The agent (mint dot) wants to reach the goal (blue ring). It samples many candidate action sequences, imagines each one forward with its world model (faint grey trajectories), scores them by imagined return — closeness to the goal, EQ MM5.5 — and executes the winner (bright mint). Raise "plans sampled" and the chosen path improves: this is the random-shooting flavour of model-predictive control. Now raise model error and watch the winner's real miss grow even as its imagined return still looks great — search amplifies model error, the core danger of planning in a flawed dream. Hit RE-IMAGINE to resample.

A world model that can imagine and plan is only half an agent — the other half has a body. Chapter 06 turns to embodied AI: vision-language-action models, sim-to-real transfer, and how the latent dynamics of this chapter become the inner loop of robots that act in the physical world.

5.R

References

Ha, D. & Schmidhuber, J. (2018). World Models. NeurIPS 2018 — the foundational demonstration: train an agent inside its own learned dream and transfer to the real environment.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview — the JEPA position paper; predict in representation space, not pixel space (EQ MM5.3).
Hafner, D., Pasukonis, J., Ba, J. & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv — one fixed hyperparameter set across 150+ tasks; first to mine diamonds in Minecraft from scratch (EQ MM5.2).
Bruce, J. et al. (2024). Genie: Generative Interactive Environments. ICML 2024 — latent-action world model learned from unlabelled gameplay video; playable worlds from one prompt (EQ MM5.4).
Hafner, D. et al. (2025). V-JEPA 2: Self-Supervised Video World Models — see also Assran et al., I-JEPA (arXiv:2301.08243). embedding-prediction self-supervision scaled to video as a world model for planning.
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero). Nature 2020 — decision-time planning with a learned latent model and MCTS, without being given the rules (EQ MM5.5).
Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines (GameNGen). arXiv — a neural network simulating DOOM interactively; a video predictor used as a playable environment.