EN
CN
JP
The Unity of Reasoning and Action: AGIBOT Unveils Genie Operator-2 (GO-2) Next-Gen Embodied Foundation Model

AGIBOT today intoduced GO-2, its next-generation foundation model for embodied AI. For the first time, GO-2 bridges the "last mile" from logical reasoning to precise execution within a unified architecture. Leveraging tens of thousands of hours of interaction data, GO-2 has set new SOTA (State-of-the-Art) records across multiple robotic benchmarks, marking a transition from "black-box exploration" to true "Unity of Reasoning and Action."

 

GO-2 introduces a unified architecture that integrates logical reasoning and action execution within a single system, enabling robots not only to plan correctly, but to execute reliably in real-world environments.

 

Core technical contributions of GO-2 have been accepted to leading conferences CVPR 2026 and ACL 2026, underscoring its significance across both computer vision and natural language processing communities.

 

1775703914641445.jpeg

 

The Evolution of the GO Series: From Perception to Actuation

 

A year ago, AGIBOT released the Genie Operator-1 (GO-1) foundation model. Featuring the innovative ViLLA architecture, it achieved the first unified modeling of Vision, Language, and Action. It was a landmark breakthrough—GO-1 received the Best Paper Nomination at IROS, was accepted by the top robotics journal TRO, and won the SAIL Star at the World Artificial Intelligence Conference (WAIC). Today, it is integrated into our one-stop embodied development platform, Genie Studio, empowering users to deploy models and validate them in large-scale real-world applications.

 

GO-1 taught robots to "understand." It could interpret instructions, recognize scenes, and plan tasks. However, as systems entered more complex real-world environments, a critical issue emerged: even with a reasonable plan, the robot’s actions did not always strictly adhere to it.

 

This is not a failure of planning; it is a fracture between "Reasoning" and "Execution." The core cause is a long-standing challenge in robotics: the Semantic-Actuation Gap. In traditional VLA models, the high-level reasoning signals and real-world motor commands remain disconnected. During execution, control modules often bypass reasoning signals, leading to accumulated errors in long-horizon tasks and decreased system stability.

GO-2 is designed specifically to bridge this gap. Its goal is clear: to enable robots not just to reason about the world, but to act upon it with consistent stability.

 

Core Philosophy of GO-2: Achieving True "Unity of Reasoning and Action"

To achieve the unity of reasoning and action, a system must solve two key problems simultaneously:

1. How to generate "executable" action plans through deep spatial reasoning;

2. How to ensure "stable execution" of those plans in real environments.

GO-2 addresses these through a comprehensive architecture built on two key innovations:

 

1. Action Chain-of-Thought - Reasoning in Action Space

GO-2 first performs reasoning within the Action Space using Action Chain-of-Thought.

Unlike traditional models that map instructions directly to raw motor commands, GO-2 generates a high-level sequence of action intents as a macro-plan. Similar to how a human mentally simulates the arc of a basketball shot before releasing the ball, GO-2 makes this process explicit. Through Action-level Reasoning, the robot plans a complete behavioral path and executes it step-by-step. Complex tasks are naturally decomposed into ordered stages, ensuring that execution is built upon a foundation of clear, logical reasoning.

 

This pivotal capability has been accepted by CVPR 2026, marking a major advancement in embodied AI.

 

1775703970538957.png


2. Asynchronous Dual-System - Low-Frequency Planning, High-Frequency Following

High-level reasoning alone cannot guarantee stable execution in real-world environments filled with noise and disturbances. To solve this, GO-2 introduces an Asynchronous Dual-System architecture to translate high-level reasoning into precise robotic movements.

 

 Semantic Planning Module (System 2): Operates at a lower frequency. Acting as a "General Commander," it generates structured high-level action sequences. These are presented through Progressive Refinement, ensuring that the reasoning itself is inherently "executable," providing stable geometric anchors for control.

 Action Following Module (System 1): Operates at a higher frequency. Acting as an "Agile Executor," it continuously receives high-level intents and combines them with real-time observations to generate specific control signals, performing Residual Refinement to compensate for environmental noise.

 

1775704020835145.jpeg


Crucially, these two systems are deeply aligned. To ensure execution strictly adheres to reasoning, GO-2 utilizes a Teacher Forcing mechanism during training, teaching the model to perform robustly even under "approximately correct but imperfect" reasoning conditions.

 

This asynchronous architecture has been accepted by ACL 2026.


1775704071445598.png

 

Performance: State-of-the-Art Across Benchmarks

By bridging "Reasoning" and "Action," GO-2 achieves a paradigm shift in behavioral performance, significantly outperforming current mainstream models like π0.5 and NVIDIA GR00T:

 

 LIBERO Benchmark: GO-2 ranks 1st across Spatial, Object, Goal, and Long tasks, with an average success rate of 98.5%.

 LIBERO-Plus Benchmark: In environments with various disturbances, GO-2 achieved an 86.6% zero-shot success rate.

 VLABench Benchmark: In rigorous tests for cross-category and texture generalization, GO-2 achieved an average score of 47.4, notably outperforming existing methods in handling diverse object textures and unseen categories.

 Genie Sim 3.0 (Sim-to-Real): Trained solely on simulation data, GO-2 achieved an 82.9% success rate in real-world testing.


1775704126855926.jpeg

 

From Model to Deployment: Enabling Continuous Learning in the Real World

Beyond model performance, AGIBOT is extending GO-2 into real-world deployment through a pre-training + post-training + data feedback loop paradigm.

 

Integrated with Genie Studio, the system enables:

 Continuous data collection across fleets of robots

 Cloud-based collaborative training

 Online post-training in real-world environments

 

This infrastructure supports large-scale deployment and ongoing improvement:

 Supports thousands of robots in distributed training

 Achieves ~10× improvement in training efficiency

 Reduces task startup time to minutes

 Enables minute-level convergence in industrial tasks

 Improves success rates by 2–4× while reducing data requirements by 50%+

 

This transforms GO-2 from a static model into a continuously evolving embodied system.

 

 

Toward Embodied Agents with Memory 

By combining Action Reasoning, Hierarchical Execution, and Long-term Memory, we are forming a complete intelligent loop: Perception -> Reasoning -> Action -> Memory.

 

From GO-1 to GO-2, AGIBOT has achieved a critical leap: from "understanding the world" to "acting upon the world." The release of GO-2 marks the moment when embodied foundation models truly achieve the Unity of Reasoning and Action.

 

As embodied models continue to evolve, AGIBOT aims to accelerate the transition from research breakthroughs to real-world impact—unlocking the next phase of scalable, intelligent robotics.