CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Source: arXiv:2606.12352 · Published 2026-06-10 · By Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

TL;DR

CHORUS addresses the challenge of decentralized multi-robot collaboration without requiring joint observations or inter-robot communication at inference time. Prior centralized methods scale poorly with team size and require communication to gather global state, while decentralized approaches often need per-robot policies, alignment procedures, or shared sensors during inference. The key insight is that pretrained vision-language-action (VLA) models, which encode rich visuomotor priors, can enable fully decentralized collaboration by conditioning a single shared policy on each robot's identity prompt and local observation alone. CHORUS finetunes a pretrained VLA backbone with a novel training procedure using multi-robot demonstration data, enabling deployment of one policy copy per robot running independently. Experimental results on real-world tasks—like tape measurement, book handover, basket lifting, and 3-robot transport—show CHORUS outperforms decentralized from-scratch diffusion policies by 64 percentage points in success rate, improves teammate reactivity by 40% points, and exceeds centralized methods despite lacking joint state access. The shared policy also scales seamlessly to heterogeneous teams with three robots, achieving 90% task success without architectural changes. These findings demonstrate that strong VLA priors and a single shared policy with robot identity conditioning can unlock scalable, reactive decentralized multi-robot collaboration without inter-robot communication or per-robot policy training.

Key findings

CHORUS improves mean task success rate by 64 percentage points over decentralized diffusion policies trained from scratch on multi-robot collaboration tasks (Figure 4).
In teammate reactivity experiments on mobile book handover, CHORUS recovers from teammate perturbations 40% more often than per-robot policies trained without weight sharing (17/20 vs 9/20 success) (Figure 5).
CHORUS outperforms a centralized VLA policy that conditions on the full team observation and outputs joint actions, despite the centralized model having strictly more information (Figure 6).
The single shared CHORUS policy scales to three diverse robots in a real-world basket lifting and transport task, achieving 90% success without architecture modifications.
CHORUS maintains constant context window and parameter count as team size grows, unlike centralized (context scales linearly) and per-robot decentralized (parameters scale linearly) baselines (Table 1).
Decentralized execution with CHORUS tolerates asynchronous sensor rates and variations in control frequency via per-robot action chunking proportional to control rates.
Pretrained VLA priors provide robust visuomotor understanding enabling multi-robot collaboration out-of-distribution from single-robot pretraining, as demonstrated by large performance gaps over from-scratch imitation models.
CHORUS requires only local robot observations plus an identity prompt, enabling fully decentralized inference with no inter-robot communication or shared cameras, improving deployability and scalability.

Threat model

The adversary is an environment or system setting where robots operate independently with access only to their local observations, without inter-robot communication or shared proprioceptive information at inference. Robots must react in real-time to teammates solely based on visual input. There is no assumption that robots can centrally coordinate or share state summaries. Adversarial attacks or malicious interference are not directly modeled here, but challenges include partial observability, asynchronous control rates, and heterogeneous embodiment differences among teammates.

Methodology — deep read

The threat model assumes decentralized multi-robot teams where each robot is controlled independently and only has access to local visual observations; robots cannot share observations, proprioception, or communicate at inference. The adversary is not explicitly modeled but constraints arise from partial observability and heterogeneity of robot embodiments. Data: The team collects multi-robot demonstration data via teleoperation using TidyBot++ controlling Kinova, ARX, and YAM robot arms. Each episode logs synchronized trajectories across all robots: observations or and action sequences Ar for each robot r over time. Data includes 25-45 demos per task for multiple collaborative tasks (e.g., tape measure, book handover, basket lift). Per-robot training tuples (ot_r, At_r, cr) are extracted with a horizon H of future actions At_r and a robot-identifying prompt cr naming the embodiment and role. These tuples contain only the single robot's local observations, no joint observations across robots. The dataset D pools all robot tuples. Policy architecture: CHORUS fine-tunes a pretrained vision-language-action (VLA) model π0.5 as a shared parameter policy πθ(Ar | or, cr) that takes local robot observation or and a robot ID prompt cr and outputs padded action vectors accommodating multiple embodiments. The policy uses a flow-matching loss adapted from the pretrained backbone. Robot ID prompts prepended to the input name the robot type, helping the policy to specialize within one shared network. Actions are predicted as chunks over time aligned with varying control rates of robots. Training uses Low-Rank Adaptation (LoRA) with rank 16 and 32 adapters on the VLM and action decoder respectively and AdamW optimizer under a cosine learning rate schedule. The robot sampler balances sampling frequencies of each robot's data during training to handle differing control rates. Training batches draw single-robot (observation, action chunk, ID) tuples independently from D, so the model never sees joint states or actions during training. Evaluation protocol: Tasks tested include real-world multi-robot collaborations for mobile tape measurement, book handover, laundry basket lifting, and 3-robot moving through doorways. Multiple baselines compare CHORUS to decentralized diffusion policies trained from scratch per robot (MIMIC-D), CHORUS without weight sharing (independent per-robot policies initialized from the same backbone), and a centralized VLA policy conditioned on all robots' joint observations and actions. Metrics include task success rate and reactivity to perturbed teammates. Ablations isolate the value of pretrained VLA priors and weight sharing. Cross-embodiment generalization and scaling to 3 robots assess robustness. Deployment is fully decentralized: at inference each robot runs its own CHORUS copy conditioned only on local observations and ID prompt, no communication or global observation aggregation. The asynchronous control frequencies are handled via chunk size scaling. Reproducibility: The pretrained VLA π0.5 backbone is publicly released (OpenVLA), but the multi-robot demonstration dataset is private. Training details and hyperparameters are in the appendix, and code for CHORUS finetuning is available on the project site. Exact seeds or hardware details are not fully specified in the source, limiting exact reproduction, but the methodology is described in detail to enable replication.

Technical innovations

Demonstration that a single pretrained vision-language-action (VLA) policy can be finetuned to control multi-robot teams in a fully decentralized manner without inter-robot communication at inference time.
Introduction of robot-identifying prompts as a conditioning mechanism that allows one shared policy to handle diverse robot embodiments and roles within the same model.
Use of flow-matching loss and LoRA adapters to efficiently finetune a large pretrained VLA backbone on multi-robot demonstration data while maintaining cross-embodiment generalization.
An action chunking method proportional to each robot’s control frequency enabling asynchronous execution across heterogeneous robots in decentralized deployment.
Empirical finding that the shared weight policy improves teammate reactivity and collaboration success compared to independent per-robot policies trained from the same initialization.

Datasets

Multi-robot demonstration data — approx. 25-45 demos per multi-robot task — collected with human teleoperation via TidyBot++ interface (not public)
Pretrained vision-language-action backbone π0.5 — publicly released as OpenVLA [7]

Baselines vs proposed

Decentralized diffusion policy (MIMIC-D): mean success rate ≈ 0.3 vs CHORUS mean success rate ≈ 0.94 (64 percentage point improvement) on multi-robot tasks
CHORUS (w/o Weight Sharing): success rate ~0.74 vs CHORUS full model ~0.94, demonstrating 40% more recovery in teammate reactivity perturbation test
VLA Centralized policy (condition on full team observation): success rate ~0.75 vs CHORUS ~0.94, indicating CHORUS outperforms despite less input information
Scaling to 3-robot teams: CHORUS achieves 90% task success without architectural changes

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12352.

Fig 1

Fig 1: We introduce CHORUS, a single VLA policy trained for decentralized, multi-embodiment collabo-

Fig 2

Fig 2: CHORUS overview. Training: a single π0.5 VLA is finetuned using LoRA on multi-robot

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 3

Fig 3: Evaluation tasks. We evaluate on a suite of multi-embodiment collaboration tasks: basket

Limitations

CHORUS cannot handle tasks requiring strict instantaneous synchronization of actions (e.g., perfectly simultaneous gripper closure), which necessitates centralized control.
The approach relies on demonstration data where the task can be completed in a decentralized manner with sufficiently informative local observations; tasks violating this are out of scope.
The available large-scale collaborative multi-robot datasets are limited; scaling CHORUS to more complex or diverse collaboration will require sustained community data collection efforts.
Centralized policies trained in this setup perform worse, hinting at a distribution shift and input dimensionality challenges when adapting pretrained single-robot VLA models to joint observations.
Although CHORUS tolerates minor desynchronization via chunked actions, large latency discrepancies or major asynchronous behaviors could degrade performance.
Exact reproducibility is limited due to private datasets and incomplete training seed/hardware details.

Open questions / follow-ons

Can CHORUS be extended or modified to handle tasks requiring strict simultaneous synchronized actions across multiple robots?
How would CHORUS performance scale beyond three robots in larger heterogeneous teams with greater diversity in embodiments and sensors?
Can pretrained VLA backbones be further adapted to explicitly model latent teammate states to improve robustness under high-latency or noisy observations?
What is the potential for combining CHORUS decentralized control with high-level multi-robot task decomposition and dynamic role allocation frameworks?

Why it matters for bot defense

For bot-defense or CAPTCHA-related applications, CHORUS demonstrates that strong pretrained vision-language-action priors enable decentralized decision-making from partial observations without communication or centralized control. From a security standpoint, this idea could influence design of distributed bot detection or coordination tasks that must operate under partial observability and asynchronous inputs, avoiding costly global state sharing. The approach emphasizes flexibility and scalability when agents have heterogeneous capabilities or sensing modalities, which parallels real-world multi-agent bot defense contexts. Moreover, the demonstrated ability of a single shared policy conditioned on agent identity to coordinate collaborative behaviors may inspire designs for unified multi-agent defenses that are more sample- and compute-efficient. However, challenges remain in tasks requiring strict synchronization or where adversaries attempt to exploit partial observability gaps, so further work would be needed to secure such decentralized systems under adversarial conditions.

Cite

bibtex

@article{arxiv2606_12352,
  title={ CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy },
  author={ Ria Doshi and Tian Gao and Annie Chen and Chelsea Finn and Jeannette Bohg },
  journal={arXiv preprint arXiv:2606.12352},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12352}
}

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​