Curiosity · AI Model
Google RT-2
RT-2 (Robotics Transformer 2) is Google DeepMind's vision-language-action foundation model — a co-fine-tuned PaLI-X / PaLM-E backbone that treats robot actions as text tokens. By training on both internet-scale image–text pairs and real robot trajectories, RT-2 inherits web-scale semantic knowledge while emitting executable end-effector deltas, producing strong chain-of-thought-style generalisation on novel objects and multi-step tasks.
Model specs
- Vendor
- Google DeepMind
- Family
- Robotics Transformer
- Released
- 2023-07
- Context window
- 1 tokens
- Modalities
- text, vision, code
Strengths
- Transfers web-scale semantics (e.g. 'pick up the extinct animal') to motor control
- Chain-of-thought multi-stage reasoning prior to acting
- Strong generalisation to unseen objects, backgrounds, and instructions
- Unified VLM/action tokenisation simplifies the stack
Limitations
- Not publicly released — weights available only via Google research collaborations
- Low control frequency — unsuitable for high-speed dynamic tasks
- Trained largely on table-top manipulation — limited locomotion skill
- Inference requires TPU-class hardware for real-time use
Use cases
- Mobile manipulation research — pick-and-place with novel objects
- Instruction-following household and warehouse robots
- Benchmarking vision-language-action generalisation
- Co-training recipes for multi-embodiment policy learning
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| Emergent skill generalisation (RT-2 paper) | ≈62% on novel tasks vs 32% RT-1 | 2023-07 |
| Symbol understanding (new objects) | ≈90% success | 2023-07 |
Frequently asked questions
What is RT-2?
RT-2 is a vision-language-action (VLA) model from Google DeepMind that treats 6-DoF robot end-effector commands as tokens in a vision-language model, so the same network can describe an image in text or emit a motor command.
How is RT-2 different from RT-1?
RT-1 was trained only on robot demonstrations. RT-2 co-fine-tunes a PaLI-X / PaLM-E backbone on web image–text data and robot trajectories, letting it inherit semantic priors from the internet — which roughly doubles generalisation to novel objects and instructions.
Is RT-2 open-source?
No — RT-2 weights are not publicly released. Related open follow-ups like OpenVLA replicate the recipe with permissive licensing.
Sources
- RT-2 paper (DeepMind, 2023) — accessed 2026-04-20
- Google DeepMind blog — RT-2 — accessed 2026-04-20