Curiosity · AI Model

Google RT-2

RT-2 (Robotics Transformer 2) is Google DeepMind's vision-language-action foundation model — a co-fine-tuned PaLI-X / PaLM-E backbone that treats robot actions as text tokens. By training on both internet-scale image–text pairs and real robot trajectories, RT-2 inherits web-scale semantic knowledge while emitting executable end-effector deltas, producing strong chain-of-thought-style generalisation on novel objects and multi-step tasks.

Model specs

Vendor
Google DeepMind
Family
Robotics Transformer
Released
2023-07
Context window
1 tokens
Modalities
text, vision, code

Strengths

  • Transfers web-scale semantics (e.g. 'pick up the extinct animal') to motor control
  • Chain-of-thought multi-stage reasoning prior to acting
  • Strong generalisation to unseen objects, backgrounds, and instructions
  • Unified VLM/action tokenisation simplifies the stack

Limitations

  • Not publicly released — weights available only via Google research collaborations
  • Low control frequency — unsuitable for high-speed dynamic tasks
  • Trained largely on table-top manipulation — limited locomotion skill
  • Inference requires TPU-class hardware for real-time use

Use cases

  • Mobile manipulation research — pick-and-place with novel objects
  • Instruction-following household and warehouse robots
  • Benchmarking vision-language-action generalisation
  • Co-training recipes for multi-embodiment policy learning

Benchmarks

BenchmarkScoreAs of
Emergent skill generalisation (RT-2 paper)≈62% on novel tasks vs 32% RT-12023-07
Symbol understanding (new objects)≈90% success2023-07

Frequently asked questions

What is RT-2?

RT-2 is a vision-language-action (VLA) model from Google DeepMind that treats 6-DoF robot end-effector commands as tokens in a vision-language model, so the same network can describe an image in text or emit a motor command.

How is RT-2 different from RT-1?

RT-1 was trained only on robot demonstrations. RT-2 co-fine-tunes a PaLI-X / PaLM-E backbone on web image–text data and robot trajectories, letting it inherit semantic priors from the internet — which roughly doubles generalisation to novel objects and instructions.

Is RT-2 open-source?

No — RT-2 weights are not publicly released. Related open follow-ups like OpenVLA replicate the recipe with permissive licensing.

Sources

  1. RT-2 paper (DeepMind, 2023) — accessed 2026-04-20
  2. Google DeepMind blog — RT-2 — accessed 2026-04-20