Google’s DeepMind has recently unveiled a groundbreaking AI model, RT-2, which translates vision and language into robotic actions. This development marks a significant milestone in the realm of robotics, bringing us closer to a future of helpful robots. RT-2, a Transformer-based model trained on text and images from the web, can directly output robotic actions, effectively enabling it to “speak robot.”
The pursuit of helpful robots has long been a herculean effort, as robots need to handle complex, abstract tasks in highly variable environments. Unlike chatbots, robots require grounding in the real world and their abilities. This grounding is not just about learning everything there is to know about an object, but also understanding how to interact with it in context. RT-2 addresses these challenges by transferring knowledge from web data to inform robot behavior, making it possible for robots to recognize objects in context and understand how to manipulate them.
Recent advancements in robotics, such as improved reasoning and the use of chain-of-thought prompting, have enabled robots to better handle multi-step problems. Vision models, like PaLM-E, have also helped robots make better sense of their surroundings. RT-1 demonstrated that Transformers could help different types of robots learn from each other. RT-2 builds upon these advancements by removing the complexity of separate high-level reasoning and low-level manipulation systems, enabling a single model to perform complex reasoning and output robot actions. This development allows robots to transfer concepts embedded in their language and vision training data to direct robot actions, even for abstract tasks like identifying and disposing of trash.
RT-2’s ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and learn from new experiences. In testing, RT-2 models functioned as well as previous models on tasks in their training data and almost doubled their performance on novel, unseen scenarios. This improvement indicates that RT-2 enables robots to learn more like humans, transferring learned concepts to new situations. The advancements demonstrated by RT-2 show enormous promise for more general-purpose robots, paving the way for a brighter future in robotics.
Google’s RT-2 model represents a transformative leap in AI-powered robotics, bringing us closer to a future where robots can effectively interact with the world around them. By enabling robots to transfer concepts from their language and vision training data to direct robot actions, RT-2 opens up new possibilities for robots to learn and adapt to novel situations. As the field of robotics continues to advance, we can look forward to a future where helpful robots are no longer confined to the realm of science fiction.