From perception to action
A language model and a robot policy are the same shape of object. Both consume a context and emit the next symbol — for the LLM a word piece, for the robot a motor command. The difference is the consequence: the LLM's mistake costs a token, the robot's mistake costs a dropped cup or a stripped gear. Formally, control is a partially observed Markov decision process. At each step the agent receives an observation \(o_t\) (camera frames, joint encoders, an instruction), maintains a belief, and emits an action \(a_t\); the environment transitions and pays a reward.
Three properties make action harder than text and force everything that follows.
- The output is continuous. A 7-DoF arm command is seven real numbers, not a choice from a fixed vocabulary. To reuse the cross-entropy machinery of an LLM you must discretize the action into tokens (§6.2) — or replace the head with a continuous generator (a diffusion or flow model).
- Errors compound. Each action changes the world the next observation is drawn from, so a small per-step mistake drifts the robot into states the policy never trained on. This covariate shift is the central pathology of imitation learning (§6.4).
- Real data is brutally expensive. A web crawl yields trillions of text tokens for free; a robot demonstration is a human teleoperating a physical arm in real time. The entire field is organized around this scarcity (§6.5).
The reward, transition, and value machinery underneath EQ MM6.1 is the subject of the Reinforcement Learning volume; here we treat the MDP as given and focus on what is unique to embodiment — turning a perception model into something that moves.
Vision-language-action models (RT-2, π0)
A vision-language-action (VLA) model is a vision-language model whose output space has been extended to include motor commands. The provocation of RT-2 (Brohan et al., 2023) was to make that extension almost free: take a VLM already trained on web images and text, and represent each robot action as a short string of tokens drawn from the model's existing vocabulary. The model then generates an action the exact way it generates a sentence — autoregressively, one token at a time — and is co-trained on web vision-language data and robot trajectories together, so internet-scale semantics leak into the robot's behavior.
Action tokenization
The bridge from continuous control to a token model is discretization. Clip each action dimension to a working range \([\,a_{\min}, a_{\max}]\), split that range into \(B\) uniform bins, and map a value to the index of its bin. RT-2 used \(B = 256\) bins per dimension, repurposing 256 of the language model's least-used token ids as the "action vocabulary".
# EQ MM6.2: discretize a continuous action into tokens, then round-trip back
import numpy as np
rng = np.random.default_rng(0)
a_min, a_max, B = -1.0, 1.0, 256 # range and number of bins (RT-2 used 256)
delta = (a_max - a_min) / B # bin width
def encode(a): # continuous -> token id
a = np.clip(a, a_min, a_max)
return np.clip(((a - a_min) / delta).astype(int), 0, B - 1)
def decode(t): # token id -> bin CENTER
return a_min + (t + 0.5) * delta
a = rng.uniform(a_min, a_max, 7) # a 7-DoF end-effector command
tok = encode(a)
a_hat = decode(tok)
err = np.abs(a - a_hat)
np.set_printoptions(precision=4, suppress=True)
print("action :", a)
print("tokens :", tok) # the 7 ints RT-2 would emit
print("decoded :", a_hat)
print(f"max error: {err.max():.6f} (theory bound delta/2 = {delta/2:.6f})")
print("within bound:", bool(err.max() <= delta / 2 + 1e-12))
π0 and the move to continuous actions
π0 (Black et al., 2024) keeps the VLM backbone but rejects discretization for fine manipulation. Instead of emitting binned tokens, it attaches a separate action expert that produces continuous action chunks — short horizons of future actions — using a flow-matching objective borrowed from modern image and video generators. The policy learns a velocity field that transports noise to an action sequence; sampling integrates that field. This buys two things discretization cannot: smooth, high-frequency control (π0 runs up to ~50 Hz) and the ability to commit to a coherent multi-step motion rather than re-deciding every frame.
The honest caveat. Whether tokenized (RT-2, OpenVLA) or continuous (π0, diffusion policies), VLAs in 2026 are real but narrow: they generalize impressively across objects and phrasing they were broadly exposed to, yet remain brittle to genuinely novel scenes, long horizons, and lighting they have not seen. The benchmarks are not yet standardized, success rates are reported on small task suites, and "zero-shot" claims deserve scrutiny — the field's own researchers say so.
Sim-to-real transfer
If real demonstrations are scarce (§6.5), simulation is the obvious escape: a physics engine can generate millions of trajectories overnight, with perfect labels and no hardware to break. The catch has a name — the reality gap. A simulator is an approximation: contact dynamics, friction, sensor noise, latency, lighting, and the exact mass of every object differ from the real world. A policy that overfits to the simulator's quirks excels in sim and fails on the robot.
The dominant fix is domain randomization: rather than try to match reality precisely, randomize the simulator's parameters — masses, frictions, textures, lighting, sensor delays, camera pose — so widely that the real world looks like just another sample from the training distribution. If the policy is robust across thousands of simulated "physics", it has no reason to depend on the specific physics it will eventually meet.
Two complements sharpen this. System identification measures the real robot to center \(p(\xi)\) on the truth, narrowing the randomization to a useful band. Real-world fine-tuning takes the sim-trained policy and adapts it on a small batch of physical episodes — often the highest-leverage hour of the whole pipeline. The trade-off is a U-shaped curve in randomization width: too little and sim performance does not transfer; too much and the policy sacrifices competence to robustness.
# Sim-to-real toy: a sim-tuned gain vs a domain-randomized one, on a shifted "real" plant
import numpy as np
rng = np.random.default_rng(1)
# A 1-D plant: optimal control gain depends on an unknown friction parameter mu.
# Cost of using gain g on a plant with friction mu (minimized when g == mu):
def cost(g, mu): return (g - mu) ** 2 + 0.05
mu_sim = 1.0 # the simulator's (wrong) friction
g_naive = mu_sim # policy that overfits the single sim
# Domain randomization: train over a band of frictions, keep the gain that is best ON AVERAGE
band = rng.uniform(0.6, 1.4, 400) # p(xi): randomized frictions
grid = np.linspace(0.5, 1.5, 101)
avg_cost = np.array([cost(g, band).mean() for g in grid])
g_dr = grid[avg_cost.argmin()] # the robust gain
# "Reality" is shifted away from the nominal sim:
mu_real = 1.3
print(f"naive (sim-only) gain {g_naive:.2f} -> real cost {cost(g_naive, mu_real):.4f}")
print(f"domain-randomized gain {g_dr:.2f} -> real cost {cost(g_dr, mu_real):.4f}")
print("randomization survives the reality gap:",
cost(g_dr, mu_real) < cost(g_naive, mu_real))
Imitation & RL for control
Two recipes turn data into a policy, and they sit at opposite ends of a sample-efficiency / safety trade-off.
Imitation learning (behavior cloning). Collect expert demonstrations \(\{(o_i, a_i)\}\) and fit the policy by supervised learning — minimize the discrepancy between the policy's action and the expert's on the same observations. It is exactly LLM pretraining with actions for tokens: stable, sample-efficient, and the workhorse behind RT-2, π0, and ALOHA's bimanual policies.
Reinforcement learning. When you have a reward instead of (or in addition to) demonstrations, RL lets the policy improve past the demonstrator by trial and error — the only route to genuinely superhuman control. The price is sample efficiency: RL can need orders of magnitude more interaction than imitation, and on physical hardware every interaction is slow, costly, and potentially destructive. The pragmatic stack is therefore imitation first, RL second: behavior-clone a competent base policy from demonstrations, then fine-tune with RL (often in simulation, then transferred per §6.3) to squeeze out the last reliability.
# Behavior cloning on a toy trajectory: fit a linear policy, print imitation error (EQ MM6.5)
import numpy as np
rng = np.random.default_rng(2)
# An "expert" policy is roughly linear in the observation, with a little noise.
N = 60
o = np.linspace(-2, 2, N) # 1-D observation along a trajectory
expert = 0.8 * o + 0.3 # the true expert action
a = expert + rng.normal(0, 0.05, N) # noisy demonstrations
# Behavior cloning = least-squares fit of pi(o) = w*o + b to the demos.
X = np.column_stack([o, np.ones(N)]) # design matrix [o, 1]
w, b = np.linalg.lstsq(X, a, rcond=None)[0] # closed-form BC solution
pred = w * o + b
imit_err = np.sqrt(np.mean((pred - a) ** 2)) # imitation (training) error, RMSE
print(f"fitted policy : a_hat = {w:.3f} * o + {b:.3f}")
print(f"true expert : a = 0.800 * o + 0.300")
print(f"imitation error (RMSE vs demos): {imit_err:.4f}")
print("BC recovers the expert's slope and intercept from demonstrations alone.")
plot_scatter(o, a) # demos vs the fitted line's domain
The data bottleneck in robotics
Every chapter in this volume has ridden the same wave: a modality unlocks once enough paired data exists. Text had the web; images had alt-text and captions; video had YouTube. Robotics has no such corpus. The internet records what the world looks like, never the torques and grasps that act on it. A robot demonstration must be produced in real time, usually by a human teleoperating physical hardware — there is no equivalent of "scrape it for free".
The scale of the gap is stark. Frontier language models train on the order of \(10^{13}\) tokens. The largest open robot dataset to date, Open X-Embodiment (2023), pooled the field's efforts into roughly one million trajectories across 22 robot embodiments — and that pooling was itself the headline contribution. Counted in the action-token currency of §6.2, all of robotics' shared data is many orders of magnitude smaller than a single LLM's pretraining set.
| Data source | How it scales | Cost per unit | Catch |
|---|---|---|---|
| Teleoperation | human-hours | minutes of human time per demo | Gold-standard quality; does not scale to internet size. |
| Simulation | compute | cheap, parallel, perfectly labeled | The reality gap (§6.3) — must be bridged, never free. |
| Cross-embodiment pooling | community | amortizes everyone's collection | Different robots, sensors, action spaces; hard to unify (Open X-Embodiment). |
| Human / web video | abundant | billions of hours exist | No action labels and an embodiment mismatch — actions must be inferred. |
| Web VLM pretraining | already done | free semantic prior | Carries no motor knowledge; only the perception/language half transfers. |
The strategies that define 2026-era robotics are all responses to this scarcity. Co-training (RT-2, π0) leans on web VLM data so the robot data only has to teach motor skills, not vision and language from scratch. Cross-embodiment training pools data across robot types so a policy learns from arms it will never run on. Learning from human video tries to recover the missing action labels from unlabeled footage. And simulation plus sim-to-real trades the data problem for a transfer problem. None of these has produced a robotics "GPT-3 moment", and whether scaling alone will — or whether embodiment needs a different ingredient — is the field's central open question. The honest summary: the architecture caught up to language models; the data did not.
Will scaling fix robotics? One camp argues VLAs are pre-GPT-3 and only need a robotics-scale data engine; another argues that action is qualitatively harder than perception — closed-loop, safety-critical, embodiment-specific — and that no amount of demonstration data substitutes for better world models, on-robot learning, or new architectures. As of 2026 the question is genuinely open; treat confident predictions in either direction with suspicion.
This volume built models from the modalities up; the next asks who gets to build them. Open Models, Chapter 01: open-weight versus closed-API foundation models — the licenses, the economics, and what "open" actually means when the weights ship but the data and training code do not.
References
- Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
- Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control.
- Zhao, T. Z., Kumar, V., Levine, S. & Finn, C. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
- Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model.
- Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
- Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.
- Ross, S., Gordon, G. & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.