Curiosity · AI Model

Google RT-2

RT-2 (Robotics Transformer 2) is Google DeepMind's vision-language-action foundation model — a co-fine-tuned PaLI-X / PaLM-E backbone that treats robot actions as text tokens. By training on both internet-scale image–text pairs and real robot trajectories, RT-2 inherits web-scale semantic knowledge while emitting executable end-effector deltas, producing strong chain-of-thought-style generalisation on novel objects and multi-step tasks.

Model specs

Vendor: Google DeepMind
Family: Robotics Transformer
Released: 2023-07
Context window: 1 tokens
Modalities: text, vision, code

Strengths

Transfers web-scale semantics (e.g. 'pick up the extinct animal') to motor control
Chain-of-thought multi-stage reasoning prior to acting
Strong generalisation to unseen objects, backgrounds, and instructions
Unified VLM/action tokenisation simplifies the stack

Limitations

Not publicly released — weights available only via Google research collaborations
Low control frequency — unsuitable for high-speed dynamic tasks
Trained largely on table-top manipulation — limited locomotion skill
Inference requires TPU-class hardware for real-time use

Use cases

Mobile manipulation research — pick-and-place with novel objects
Instruction-following household and warehouse robots
Benchmarking vision-language-action generalisation
Co-training recipes for multi-embodiment policy learning

Benchmarks

Benchmark	Score	As of
Emergent skill generalisation (RT-2 paper)	≈62% on novel tasks vs 32% RT-1	2023-07
Symbol understanding (new objects)	≈90% success	2023-07

Frequently asked questions

What is RT-2?

RT-2 is a vision-language-action (VLA) model from Google DeepMind that treats 6-DoF robot end-effector commands as tokens in a vision-language model, so the same network can describe an image in text or emit a motor command.

How is RT-2 different from RT-1?

RT-1 was trained only on robot demonstrations. RT-2 co-fine-tunes a PaLI-X / PaLM-E backbone on web image–text data and robot trajectories, letting it inherit semantic priors from the internet — which roughly doubles generalisation to novel objects and instructions.

Is RT-2 open-source?

No — RT-2 weights are not publicly released. Related open follow-ups like OpenVLA replicate the recipe with permissive licensing.

Sources

RT-2 paper (DeepMind, 2023) — accessed 2026-04-20
Google DeepMind blog — RT-2 — accessed 2026-04-20