SOP: Scaling General-Purpose Robots in the Real World

For general-purpose robots to operate at scale in the real world, mere task feasibility is far from enough. The real challenge is that general-purpose robots working in the physical world must maintain high stability and reliability in complex, ever-changing environments while retaining exceptional generalization across vastly different tasks. At the same time, these robots should not have their capabilities frozen upon deployment; instead, they should be able to rapidly adapt to environmental changes and continuously learn from real-world physical experience after deployment.

We introduce SOP (Scalable Online Post-training), a framework designed to enable online updates of Vision-Language-Action (VLA) models across robot fleets. By shifting the learning paradigm from offline to distributed online training, SOP establishes a mechanism for efficient reuse and rapid iteration of individual experience across the fleet. This framework achieves significant performance improvements on complex tasks within hours, laying the technical foundation for large-scale real-world deployment of general-purpose robots.

Foundation Models for General-Purpose Robots and the Performance Gap

Over the past few years, pre-trained VLA models have provided general-purpose robots with remarkable generalization capabilities. Through pre-training on internet-scale data, VLA models have demonstrated a degree of general capability in executing different types of tasks, manipulating different objects, and adapting to different embodiments. However, pre-training alone cannot efficiently yield high performance on specific tasks. To address performance issues on specific tasks, post-training has become the primary solution. In Large Language Models (LLMs), post-training methods combined with reinforcement learning have achieved tremendous success. Recently, mainstream LLMs have reached or even exceeded human expert levels on various tasks while maintaining generalization.

State space trajectories under the SOP framework. Compared to the base VLA model, SOP online post-training significantly improves state space coverage through distributed exploration, enabling more efficient policy optimization.

However, such success has not yet emerged in VLA post-training, because post-training in the physical world faces numerous challenges. First, there is significant distribution shift between high-quality data collected by humans in advance and actual robot deployment states. Second, since real-robot post-training has not yet reached scale, the speed of robot exploration and learning remains severely limited. Additionally, single-task post-training often leads to degraded generalization. We have seen considerable progress in addressing these issues, but a post-training method that can simultaneously solve all three challenges has yet to emerge.

Continual Learning in the Real World with Distributed Robot Fleets

To our knowledge, SOP is the first to integrate online, distributed, and multi-task learning in real-world post-training. Our key finding is that these three components are not independent but complementary: the online mechanism alleviates distribution shift, the distributed architecture enables more efficient exploration of the state space, and multi-task learning effectively preserves generalization. By combining these advantages, SOP enables robots to rapidly improve performance across multiple tasks. SOP changes more than just a specific training technique—it systematically transforms the learning paradigm for general-purpose robots. Under this paradigm, robots can go online with an initially imperfect policy. Deployment is no longer the end of development, but a new starting point for continual learning. As the distributed fleet scales up, we observe near-linear growth in performance.

In SOP, real-world experience becomes a sustainable and scalable training resource. General-purpose robots should not be static products, but systems that continuously improve during operation. After SOP training, our robots operated autonomously on target tasks for over 36 hours without requiring human intervention.

SOP: A Scalable Online Post-training Method

SOP transforms VLA post-training from offline, single-machine, and sequential to online, fleet-based, and parallel. To put it vividly, it forms a closed loop of Multi-robot "Parallel Realities" → Centralized Cloud Learning → Instant Model Synchronization.

SOP Workflow

Multi-robot Parallel Execution. Under the multi-robot parallel execution architecture, multiple robots share a single VLA policy while simultaneously handling a wide variety of tasks and instructions. This approach significantly broadens the coverage of state-action distributions in the real world, allowing the system to encounter a wider range of scenarios and overcoming the limitations in data coverage inherent to single-machine online learning.

Centralized Cloud Online Updates. Meanwhile, through centralized cloud online updates, all execution trajectories, reward signals, and human corrections are streamed in real-time. Within cloud GPU clusters, the policy model undergoes continuous online updates, with optimized parameters synchronized back to all robots within minutes. This ensures that the learning process is always based on the latest "current policy," thereby maintaining the stability and consistency of online training.

Preserving Generalization While Improving Performance. SOP improves performance while preserving the robot's general capabilities. Traditional single-machine online training often causes the model to degrade into a "specialist" that only excels at a single task. However, SOP achieves parallelism in space rather than sequentiality in time, allowing multi-task learning to occur simultaneously across a broader distribution. This ensures that VLA's generality is not compromised by targeted performance improvements.

SOP Architecture

SOP Performance and its Relation with Pre-training

To validate the effectiveness of SOP, we considered three questions:
1. How much does SOP improve the performance of pre-trained VLAs? And how does it compare to previous offline approaches?
2. How does scaling the number of robots in a distributed fleet affect performance?
3. Can SOP provide consistent performance gains across pre-trained models of varying quality?

Comparison of Success Rate and Throughput

First, we needed to confirm the effectiveness of SOP. Since SOP focuses on system-level optimization, we selected two representative algorithms for fair comparison: HG-DAgger and RECAP. In their original implementations, HG-DAgger was limited to single-machine operation, while RECAP used an offline approach. First, we tested the baseline model. Second, we tested the results after iterating the baseline model with each of these two algorithms separately. Then, we integrated HG-DAgger and RECAP into SOP, creating two online variants (SOP w/ HG-DAgger and SOP w/ RECAP), and conducted the same tests. The experimental results show that performance improved consistently across all test scenarios when combined with SOP. We also found that for cloth folding and box assembly tasks, certain recovery behaviors introduced during the SOP process can significantly improve task throughput.

For the second question, we compared three fleet sizes (single, dual, and quad-robot configurations) with the same total amount of data transmitted. The experimental results show that for the same total training time, more robots lead to higher performance. Under a 3-hour training limit, the final success rate for the 4-robot fleet reached 92.5%, which is 12% higher than the single-robot configuration. We believe that multi-robot data collection effectively prevents the model from overfitting to specific features of a single machine. Additionally, SOP translates hardware scaling into a significant reduction in learning time: reaching 80% performance took 174 minutes with a single robot, but only 72 minutes with four robots—a 2.4x speedup.

Fleet Size	Success Rate (3h)	Time to 80%	Speedup
1 Robot	80.5%	173.6 min	1.0×
2 Robots	88.7%	126.5 min	1.4×
4 Robots	92.5%	71.7 min	2.4×

Finally, we explored the relationship between SOP and pre-training data. We divided 160 hours of multi-task pre-training data into three groups: 20, 80, and 160 hours, trained separate base models, and then applied SOP to each. We found that the scale of pre-training determines both the base model's performance and the trajectory of post-training improvements. While SOP consistently improved all models, the final performance remained correlated with pre-training scale. This suggests that online learning after deployment primarily optimizes the model's existing knowledge rather than replacing the role of large-scale pre-training. At the same time, comparing the 80-hour and 160-hour results, we clearly noticed that in solving specific failure cases, on-policy experience yields remarkably significant marginal gains. SOP achieved approximately 30% performance improvement with just three hours of on-policy experience, while 80 hours of additional pre-training data only provided a 4% gain. This indicates that when pre-training reaches diminishing marginal returns, SOP is clearly the better solution for bridging performance gaps.

Performance across different pre-training data scales

Rapid Performance Gains in Novel Real-World Scenarios

At the outset, we discussed our core motivation: scaling general-purpose robots in the real world. We believe that practice leads to true knowledge, so we deployed our robot fleet into novel real-world environments unseen during pre-training and used SOP for online training. When robots are placed in different environments, their success rates and throughput for even the same tasks can drop. After just a few hours of learning, robot performance improved significantly, enabling them to robustly execute relatively complex practical tasks. On the path toward large-scale real-world deployment of general-purpose robots, this is undoubtedly a solid step forward.

Towards Large-Scale Real-World Deployment

SOP changes more than just technical techniques—it redefines the lifecycle of robotic systems. We believe that robots should not be "standardized products with fixed performance," but rather "living entities that continuously improve in the real world." Robot deployment is not the end of technical iteration, but the starting point for learning at a larger scale. If VLA gave robots their first general understanding and action capabilities, then what SOP does is let the shared experience of many robots collectively drive the rapid growth of intelligence. Training is no longer locked in the past; intelligence grows in the present. This is a critical step for general-purpose robots toward large-scale real-world deployment.