AI // ENCYCLOPEDIA / MULTIMODAL / 06 / EMBODIED AI INDEX NEXT: OPEN MODELS · 01 →
MULTIMODAL & WORLD MODELS · CHAPTER 06 / 06

Embodied AI & Robotics

Every modality so far in this volume describes the world: text, images, audio, video. Action is the modality that acts on it, and vision-language-action models put a transformer in control of a robot, given enough data. The architecture is the tractable part: a pretrained vision-language model, motor commands written as tokens, and a policy mapping pixels and an instruction to the next move. The constraint is data. The internet holds no demonstrations of folding laundry, and a robot learning from its own mistakes can damage itself in the process.

LEVELADVANCED READING TIME≈ 24 MIN BUILDS ONMM 01–05 · RL 04 INSTRUMENTSACTION TOKENS · SIM-TO-REAL · IL vs RL
6.1

From perception to action

A language model and a robot policy are the same shape of object. Both consume a context and emit the next symbol — for the LLM a word piece, for the robot a motor command. The difference is the consequence: the LLM's mistake costs a token, the robot's mistake costs a dropped cup or a stripped gear. Formally, control is a partially observed Markov decision process. At each step the agent receives an observation \(o_t\) (camera frames, joint encoders, an instruction), maintains a belief, and emits an action \(a_t\); the environment transitions and pays a reward.

EQ MM6.1 — THE CONTROL POLICY $$ a_t \sim \pi_\theta\!\left(a_t \mid o_{\le t},\, \ell\right), \qquad o_t = (\text{image}_t,\ \text{proprioception}_t), \quad \ell = \text{language instruction} $$
A policy \(\pi_\theta\) maps the history of observations \(o_{\le t}\) and a goal \(\ell\) to a distribution over actions. Swap "next token" for "next action" and a decoder-only transformer is a policy — that single substitution is the whole bet of embodied foundation models. The action \(a_t\) is usually a low-dimensional continuous vector: end-effector pose deltas, gripper open/close, sometimes joint torques.

Three properties make action harder than text and force everything that follows.

  • The output is continuous. A 7-DoF arm command is seven real numbers, not a choice from a fixed vocabulary. To reuse the cross-entropy machinery of an LLM you must discretize the action into tokens (§6.2) — or replace the head with a continuous generator (a diffusion or flow model).
  • Errors compound. Each action changes the world the next observation is drawn from, so a small per-step mistake drifts the robot into states the policy never trained on. This covariate shift is the central pathology of imitation learning (§6.4).
  • Real data is brutally expensive. A web crawl yields trillions of text tokens for free; a robot demonstration is a human teleoperating a physical arm in real time. The entire field is organized around this scarcity (§6.5).

The reward, transition, and value machinery underneath EQ MM6.1 is the subject of the Reinforcement Learning volume; here we treat the MDP as given and focus on what is unique to embodiment — turning a perception model into something that moves.

6.2

Vision-language-action models (RT-2, π0)

A vision-language-action (VLA) model is a vision-language model whose output space has been extended to include motor commands. The provocation of RT-2 (Brohan et al., 2023) was to make that extension almost free: take a VLM already trained on web images and text, and represent each robot action as a short string of tokens drawn from the model's existing vocabulary. The model then generates an action the exact way it generates a sentence — autoregressively, one token at a time — and is co-trained on web vision-language data and robot trajectories together, so internet-scale semantics leak into the robot's behavior.

Action tokenization

The bridge from continuous control to a token model is discretization. Clip each action dimension to a working range \([\,a_{\min}, a_{\max}]\), split that range into \(B\) uniform bins, and map a value to the index of its bin. RT-2 used \(B = 256\) bins per dimension, repurposing 256 of the language model's least-used token ids as the "action vocabulary".

EQ MM6.2 — ACTION TOKENIZATION (UNIFORM BINNING) $$ \Delta = \frac{a_{\max} - a_{\min}}{B}, \qquad t = \mathrm{clip}\!\left(\left\lfloor \frac{a - a_{\min}}{\Delta} \right\rfloor,\ 0,\ B-1\right), \qquad \hat{a} = a_{\min} + \left(t + \tfrac{1}{2}\right)\Delta $$
Encode a real action \(a\) to a discrete token \(t \in \{0,\dots,B-1\}\); decode by returning the center of bin \(t\). The round-trip is lossy: the worst-case error is half a bin, \(|a - \hat a| \le \Delta/2\). With \(B = 256\) over a normalized range \([-1, 1]\), \(\Delta = 2/256\) and the maximum error is \(\Delta/2 = 1/256 \approx 0.0039\) — under one part in 256, fine for end-effector deltas but the reason fine-grained policies later moved to continuous heads.
An action dimension is normalized to \([-1, 1]\) and tokenized with \(B = 256\) uniform bins (EQ MM6.2). What is the worst-case quantization error \(\Delta/2\) (the largest possible \(|a - \hat a|\))?
Bin width \(\Delta = \dfrac{a_{\max}-a_{\min}}{B} = \dfrac{1-(-1)}{256} = \dfrac{2}{256} = 0.0078125\). Decoding to the bin center, the worst case is half a bin: \(\Delta/2 = 0.0078125 / 2 = \) 0.00390625. That is \(1/256\) of full range — the resolution ceiling a 256-bin tokenizer imposes on every action dimension.
True or false: RT-2 outputs robot actions as tokens emitted by a vision-language model — the same model, the same autoregressive decoding, with a slice of the vocabulary repurposed as discretized action bins. (Answer true or false.)
This is RT-2's defining design. Each action dimension is binned (EQ MM6.2) into one of 256 levels, those levels are mapped onto 256 existing token ids, and the VLM generates an action string token-by-token exactly as it would generate a caption. Co-training on web vision-language data and robot trajectories lets internet semantics transfer to control. The statement is true.
PYTHON · RUNNABLE IN-BROWSER
# EQ MM6.2: discretize a continuous action into tokens, then round-trip back
import numpy as np
rng = np.random.default_rng(0)

a_min, a_max, B = -1.0, 1.0, 256          # range and number of bins (RT-2 used 256)
delta = (a_max - a_min) / B               # bin width

def encode(a):                            # continuous -> token id
    a = np.clip(a, a_min, a_max)
    return np.clip(((a - a_min) / delta).astype(int), 0, B - 1)

def decode(t):                            # token id -> bin CENTER
    return a_min + (t + 0.5) * delta

a = rng.uniform(a_min, a_max, 7)          # a 7-DoF end-effector command
tok = encode(a)
a_hat = decode(tok)
err = np.abs(a - a_hat)

np.set_printoptions(precision=4, suppress=True)
print("action   :", a)
print("tokens   :", tok)                  # the 7 ints RT-2 would emit
print("decoded  :", a_hat)
print(f"max error: {err.max():.6f}   (theory bound delta/2 = {delta/2:.6f})")
print("within bound:", bool(err.max() <= delta / 2 + 1e-12))
edits are live — break it on purpose
INSTRUMENT MM6.1 — ACTION-TOKENIZATION EXPLAINERCONTINUOUS → TOKEN → DECODED · EQ MM6.2
BIN WIDTH Δ
TOKEN ID t
DECODED â (CENTER)
QUANT ERROR |a−â|
The bar is one action dimension's range \([-1,1]\) sliced into \(B\) bins; the white tick is your continuous value \(a\), the mint cell is the bin it lands in, and the mint dot is the decoded center \(\hat a\). Drop \(B\) to 4 and the error gets coarse and visible; push \(B\) toward 256 (the slider is in powers of two) and the decoded value snaps onto \(a\) — the round-trip error halves every time you double the bins, the exact \(\Delta/2\) law of EQ MM6.2.

π0 and the move to continuous actions

π0 (Black et al., 2024) keeps the VLM backbone but rejects discretization for fine manipulation. Instead of emitting binned tokens, it attaches a separate action expert that produces continuous action chunks — short horizons of future actions — using a flow-matching objective borrowed from modern image and video generators. The policy learns a velocity field that transports noise to an action sequence; sampling integrates that field. This buys two things discretization cannot: smooth, high-frequency control (π0 runs up to ~50 Hz) and the ability to commit to a coherent multi-step motion rather than re-deciding every frame.

EQ MM6.3 — FLOW-MATCHING ACTION HEAD (π0-STYLE) $$ \mathcal{L}(\theta) = \mathbb{E}_{\tau,\, a_0,\, a_1}\Big[\big\lVert\, v_\theta(a_\tau,\, o,\, \ell;\, \tau) - (a_1 - a_0)\,\big\rVert^2\Big], \qquad a_\tau = (1-\tau)\,a_0 + \tau\, a_1 $$
\(a_1\) is the expert action chunk, \(a_0 \sim \mathcal{N}(0, I)\) is noise, and \(a_\tau\) interpolates between them at flow time \(\tau \in [0,1]\). The network \(v_\theta\) is trained to predict the constant velocity \((a_1 - a_0)\) that carries noise to data; at inference you integrate \(v_\theta\) from \(a_0\) to recover a continuous action. No bins, no quantization floor — the cost is a small iterative sampler instead of a single argmax. This is the same conditional-flow-matching objective used for image generation, now conditioned on pixels and an instruction.

The honest caveat. Whether tokenized (RT-2, OpenVLA) or continuous (π0, diffusion policies), VLAs in 2026 are real but narrow: they generalize impressively across objects and phrasing they were broadly exposed to, yet remain brittle to genuinely novel scenes, long horizons, and lighting they have not seen. The benchmarks are not yet standardized, success rates are reported on small task suites, and "zero-shot" claims deserve scrutiny — the field's own researchers say so.

6.3

Sim-to-real transfer

If real demonstrations are scarce (§6.5), simulation is the obvious escape: a physics engine can generate millions of trajectories overnight, with perfect labels and no hardware to break. The catch has a name — the reality gap. A simulator is an approximation: contact dynamics, friction, sensor noise, latency, lighting, and the exact mass of every object differ from the real world. A policy that overfits to the simulator's quirks excels in sim and fails on the robot.

True or false: sim-to-real transfer addresses the reality gap — the mismatch between a simulator's dynamics, sensing, and appearance and those of the physical world. (Answer true or false.)
Yes. Sim-to-real is the discipline of training a policy in simulation and deploying it on hardware despite the reality gap. Every technique below — domain randomization, system identification, real-world fine-tuning — is a way to shrink or paper over that mismatch. The statement is true.

The dominant fix is domain randomization: rather than try to match reality precisely, randomize the simulator's parameters — masses, frictions, textures, lighting, sensor delays, camera pose — so widely that the real world looks like just another sample from the training distribution. If the policy is robust across thousands of simulated "physics", it has no reason to depend on the specific physics it will eventually meet.

EQ MM6.4 — DOMAIN RANDOMIZATION OBJECTIVE $$ \theta^\star = \arg\max_\theta\ \mathbb{E}_{\,\xi \sim p(\xi)}\Big[\, \mathbb{E}_{\tau \sim \pi_\theta,\, \xi}\big[\,R(\tau)\,\big] \Big], \qquad \xi = (\text{mass},\ \text{friction},\ \text{lighting},\ \text{latency},\ \dots) $$
\(\xi\) is a vector of simulator parameters drawn from a chosen distribution \(p(\xi)\); the policy maximizes expected return averaged over the whole family of simulators, not one. The wager: if \(p(\xi)\) is wide enough to contain reality, the reality gap collapses into ordinary in-distribution generalization. Too narrow and the policy still overfits; too wide and it learns an over-conservative average policy that is mediocre everywhere — randomization range is the central knob.

Two complements sharpen this. System identification measures the real robot to center \(p(\xi)\) on the truth, narrowing the randomization to a useful band. Real-world fine-tuning takes the sim-trained policy and adapts it on a small batch of physical episodes — often the highest-leverage hour of the whole pipeline. The trade-off is a U-shaped curve in randomization width: too little and sim performance does not transfer; too much and the policy sacrifices competence to robustness.

PYTHON · RUNNABLE IN-BROWSER
# Sim-to-real toy: a sim-tuned gain vs a domain-randomized one, on a shifted "real" plant
import numpy as np
rng = np.random.default_rng(1)

# A 1-D plant: optimal control gain depends on an unknown friction parameter mu.
# Cost of using gain g on a plant with friction mu (minimized when g == mu):
def cost(g, mu): return (g - mu) ** 2 + 0.05

mu_sim = 1.0                                   # the simulator's (wrong) friction
g_naive = mu_sim                               # policy that overfits the single sim

# Domain randomization: train over a band of frictions, keep the gain that is best ON AVERAGE
band = rng.uniform(0.6, 1.4, 400)              # p(xi): randomized frictions
grid = np.linspace(0.5, 1.5, 101)
avg_cost = np.array([cost(g, band).mean() for g in grid])
g_dr = grid[avg_cost.argmin()]                 # the robust gain

# "Reality" is shifted away from the nominal sim:
mu_real = 1.3
print(f"naive  (sim-only) gain {g_naive:.2f} -> real cost {cost(g_naive, mu_real):.4f}")
print(f"domain-randomized gain {g_dr:.2f} -> real cost {cost(g_dr,    mu_real):.4f}")
print("randomization survives the reality gap:",
      cost(g_dr, mu_real) < cost(g_naive, mu_real))
edits are live — break it on purpose
INSTRUMENT MM6.2 — SIM-TO-REAL GAP VISUALIZERRANDOMIZATION WIDTH vs REAL-WORLD RETURN · EQ MM6.4
SIM RETURN
REAL RETURN
SIM−REAL DROP
The curves are real-world return as a function of randomization width, for your chosen reality gap. With zero width the policy is a sim specialist: high sim return, but it falls off a cliff on the shifted real plant. Widen randomization and real return climbs to a peak, then declines as the policy grows over-conservative — the U-shaped curve of EQ MM6.4. Increase the reality gap and the whole real-return curve sinks and its optimum shifts to wider randomization: the further reality is from your nominal sim, the more you must randomize to cover it.
6.4

Imitation & RL for control

Two recipes turn data into a policy, and they sit at opposite ends of a sample-efficiency / safety trade-off.

Imitation learning (behavior cloning). Collect expert demonstrations \(\{(o_i, a_i)\}\) and fit the policy by supervised learning — minimize the discrepancy between the policy's action and the expert's on the same observations. It is exactly LLM pretraining with actions for tokens: stable, sample-efficient, and the workhorse behind RT-2, π0, and ALOHA's bimanual policies.

EQ MM6.5 — BEHAVIOR CLONING $$ \theta^\star = \arg\min_\theta\ \frac{1}{N}\sum_{i=1}^{N} \big\lVert\, \pi_\theta(o_i) - a_i \,\big\rVert^2 \qquad\text{(continuous actions; cross-entropy for tokenized)} $$
Pure supervised regression of expert actions onto observations. Its fatal flaw is covariate shift: the policy is trained only on states the expert visited, but at test time its own small errors push it into states it never saw, where errors are larger still — a drift that compounds quadratically in the horizon. DAgger patches this by iteratively querying the expert on states the learner actually reaches; action chunking (ACT, π0) reduces the number of decision points and so the number of chances to drift.
Behavior-clone a 1-D policy \(a = w\,o\) (no intercept) on four expert pairs \((o,a)\): \((1,1),\ (2,2),\ (3,2),\ (4,3)\). The least-squares slope is \(w = \dfrac{\sum o_i a_i}{\sum o_i^2}\). Compute \(w\) (round to two decimals).
\(\sum o_i a_i = 1\!\cdot\!1 + 2\!\cdot\!2 + 3\!\cdot\!2 + 4\!\cdot\!3 = 1 + 4 + 6 + 12 = 23\). \(\sum o_i^2 = 1 + 4 + 9 + 16 = 30\). So \(w = \dfrac{23}{30} = 0.7\overline{6} \approx\) 0.77 — the cloned policy's slope, recovered from demonstrations alone (EQ MM6.5). The pycell below fits the same kind of line and prints its imitation error.

Reinforcement learning. When you have a reward instead of (or in addition to) demonstrations, RL lets the policy improve past the demonstrator by trial and error — the only route to genuinely superhuman control. The price is sample efficiency: RL can need orders of magnitude more interaction than imitation, and on physical hardware every interaction is slow, costly, and potentially destructive. The pragmatic stack is therefore imitation first, RL second: behavior-clone a competent base policy from demonstrations, then fine-tune with RL (often in simulation, then transferred per §6.3) to squeeze out the last reliability.

PYTHON · RUNNABLE IN-BROWSER
# Behavior cloning on a toy trajectory: fit a linear policy, print imitation error (EQ MM6.5)
import numpy as np
rng = np.random.default_rng(2)

# An "expert" policy is roughly linear in the observation, with a little noise.
N = 60
o = np.linspace(-2, 2, N)                       # 1-D observation along a trajectory
expert = 0.8 * o + 0.3                          # the true expert action
a = expert + rng.normal(0, 0.05, N)             # noisy demonstrations

# Behavior cloning = least-squares fit of pi(o) = w*o + b to the demos.
X = np.column_stack([o, np.ones(N)])            # design matrix [o, 1]
w, b = np.linalg.lstsq(X, a, rcond=None)[0]     # closed-form BC solution
pred = w * o + b

imit_err = np.sqrt(np.mean((pred - a) ** 2))    # imitation (training) error, RMSE
print(f"fitted policy : a_hat = {w:.3f} * o + {b:.3f}")
print(f"true expert   : a     = 0.800 * o + 0.300")
print(f"imitation error (RMSE vs demos): {imit_err:.4f}")
print("BC recovers the expert's slope and intercept from demonstrations alone.")
plot_scatter(o, a)                              # demos vs the fitted line's domain
edits are live — break it on purpose
INSTRUMENT MM6.3 — IMITATION vs RL SAMPLE EFFICIENCYSUCCESS RATE vs EPISODES · DETERMINISTIC
IMITATION (CEILING)
PURE RL @ BUDGET
IL→RL @ BUDGET
Three learning curves on the same task. Imitation jumps to a high success rate almost immediately but plateaus at the demonstrator's ceiling — it cannot exceed its teacher. Pure RL starts at zero and crawls up the sample-efficiency curve, eventually surpassing imitation but only after a large exploration budget. IL→RL behavior-clones first, then fine-tunes — inheriting imitation's fast start and RL's higher ceiling. Add more demonstrations and both IL curves rise; cut the RL budget and pure RL never catches up. This is exactly why production robotics warm-starts RL with imitation.
6.5

The data bottleneck in robotics

Every chapter in this volume has ridden the same wave: a modality unlocks once enough paired data exists. Text had the web; images had alt-text and captions; video had YouTube. Robotics has no such corpus. The internet records what the world looks like, never the torques and grasps that act on it. A robot demonstration must be produced in real time, usually by a human teleoperating physical hardware — there is no equivalent of "scrape it for free".

The scale of the gap is stark. Frontier language models train on the order of \(10^{13}\) tokens. The largest open robot dataset to date, Open X-Embodiment (2023), pooled the field's efforts into roughly one million trajectories across 22 robot embodiments — and that pooling was itself the headline contribution. Counted in the action-token currency of §6.2, all of robotics' shared data is many orders of magnitude smaller than a single LLM's pretraining set.

Data sourceHow it scalesCost per unitCatch
Teleoperationhuman-hoursminutes of human time per demoGold-standard quality; does not scale to internet size.
Simulationcomputecheap, parallel, perfectly labeledThe reality gap (§6.3) — must be bridged, never free.
Cross-embodiment poolingcommunityamortizes everyone's collectionDifferent robots, sensors, action spaces; hard to unify (Open X-Embodiment).
Human / web videoabundantbillions of hours existNo action labels and an embodiment mismatch — actions must be inferred.
Web VLM pretrainingalready donefree semantic priorCarries no motor knowledge; only the perception/language half transfers.

The strategies that define 2026-era robotics are all responses to this scarcity. Co-training (RT-2, π0) leans on web VLM data so the robot data only has to teach motor skills, not vision and language from scratch. Cross-embodiment training pools data across robot types so a policy learns from arms it will never run on. Learning from human video tries to recover the missing action labels from unlabeled footage. And simulation plus sim-to-real trades the data problem for a transfer problem. None of these has produced a robotics "GPT-3 moment", and whether scaling alone will — or whether embodiment needs a different ingredient — is the field's central open question. The honest summary: the architecture caught up to language models; the data did not.

A robot dataset has \(10^6\) trajectories, each \(100\) timesteps long. How many trajectory-steps of supervision is that — i.e. \(10^6 \times 100\)?
\(10^6 \times 100 = 10^{8} = \) 100000000 decision steps. Even if each step is a \(100\)-token action chunk, that is only \(10^{10}\) action tokens — roughly a thousand times smaller than a frontier LLM's \(\sim\!10^{13}\) text tokens. The data gap, not the model, is the bottleneck.
CONTESTED

Will scaling fix robotics? One camp argues VLAs are pre-GPT-3 and only need a robotics-scale data engine; another argues that action is qualitatively harder than perception — closed-loop, safety-critical, embodiment-specific — and that no amount of demonstration data substitutes for better world models, on-robot learning, or new architectures. As of 2026 the question is genuinely open; treat confident predictions in either direction with suspicion.

NEXT

This volume built models from the modalities up; the next asks who gets to build them. Open Models, Chapter 01: open-weight versus closed-API foundation models — the licenses, the economics, and what "open" actually means when the weights ship but the data and training code do not.

6.R

References

  1. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind — action tokenization over a VLM vocabulary and web/robot co-training (§6.2, EQ MM6.2).
  2. Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence — flow-matching continuous action chunks at high frequency (§6.2, EQ MM6.3).
  3. Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS 2023 — ACT (action chunking with transformers) and the ALOHA teleoperation platform (§6.4).
  4. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024 — an open-weight tokenized VLA, the reproducible counterpart to RT-2 (§6.2).
  5. Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA 2024 — pooling ~1M trajectories across 22 embodiments; the cross-embodiment data effort (§6.5).
  6. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017 — the foundational sim-to-real randomization technique (§6.3, EQ MM6.4).
  7. Ross, S., Gordon, G. & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011 — DAgger and the formal account of covariate shift in behavior cloning (§6.4, EQ MM6.5).