For decades, the idea of helpful robots has been a staple of science fiction, but the technology has remained elusive. Today, Google’s DeepMind is taking us one step closer to that future with the introduction of Robotics Transformer 2 (RT-2), a vision-language-action (VLA) model that can directly output robotic actions based on text and images from the web.
Robots capable of general tasks in the real world need to handle complex, abstract tasks in highly variable environments, which requires grounding in the real world and their abilities. Unlike chatbots, robots need to recognise objects in context, distinguish them from similar objects, understand what they look like, and know how to interact with them. This has historically required training robots on billions of data points, which is time-consuming and costly.
RT-2 is a Transformer-based model trained on text and images from the web, enabling it to transfer knowledge from web data to inform robot behaviour. This means that RT-2 can “speak robot”, recognising objects and understanding how to interact with them based on its training data.
RT-2 is a significant advancement in robotics because it removes the complexity of previous systems that relied on high-level reasoning and low-level manipulation systems playing an imperfect game of telephone to operate the robot. Instead, RT-2 enables a single model to perform complex reasoning and output robot actions, even for abstract tasks like identifying and disposing of rubbish.
RT-2’s ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and learn from their experiences. In testing, RT-2 functioned as well as previous models on tasks in its training data and almost doubled its performance on novel, unseen scenarios. This means that robots with RT-2 are able to learn more like we do, transferring learned concepts to new situations.
Google’s RT-2 is a significant advancement in robotics, bringing us closer to a future of helpful robots that can interact with the world in complex and abstract ways. Whilst there is still much work to be done, RT-2 shows enormous promise for more general-purpose robots that can learn from their experiences and adapt to novel situations.