Cross-Modal Navigation with Multi-Agent Reinforcement Learning

Source: arXiv:2605.06595 · Published 2026-05-07 · By Shuo Liu, Xinzichen Li, Christopher Amato

TL;DR

CRONA addresses the challenge of multi-modal embodied navigation by decomposing the problem across multiple specialized agents rather than forcing a single monolithic model to jointly align and reason over heterogeneous sensor streams. The core insight is that modality misalignment in joint optimization often causes dominant modalities to crowd out weaker but informative ones; by assigning each agent a distinct sensory specialization (e.g., one audio-only, one vision-only), the per-agent representation-learning problem is simplified and decentralized execution is preserved. The framework adds two key mechanisms on top of standard CTDE-MARL: auxiliary belief predictors that extract control-relevant signals (target location and category) from noisy audio spectrograms, and a centralized multi-modal critic that consumes joint histories, auxiliary beliefs, and privileged global state during training only.

The work is evaluated on a newly constructed benchmark spanning five Matterport3D scenes of increasing size and complexity (Studio, Corridor, Apartment, Ranch, Maze), with visual perception deliberately degraded (16×16 depth maps, 10° HFoV, 5 m range) to prevent trivially easy visual localization. CRONA is compared against a single-agent monolith and three homogeneous multi-agent baselines (VLA-Collab, ALA-Collab, AVLA-Collab) across task success rate, detection rate, average agent-to-target distance, steps per episode, and timeout rate.

The main result is that no single modality or configuration dominates across all environments. The authors identify five empirical patterns of modality dominance and show that CRONA achieves the best aggregated success rate (Fig. 3f) while being substantially more robust to low visual resolution and small model capacity than homogeneous full-modality baselines. However, in the most complex scene (Maze), the homogeneous full-modality baseline AVLA-Collab outperforms CRONA, revealing a capacity-versus-specialization trade-off that the paper does not fully resolve.

Key findings

CRONA achieves an aggregated success rate that leads all baselines (Fig. 3f), but in the Maze scene specifically, AVLA-Collab achieves 26.16% vs. CRONA's 12.13%, indicating that full-modality homogeneous agents outperform cross-modal specialists in the largest, most complex environment.
In Ranch (cross-modal dominance), CRONA achieves 64.62% success vs. AVLA-Collab's 18.93% and VLA-Collab's 38.97%, a gap of ~45.7 percentage points over the full-modality homogeneous baseline at identical embedding size (100, ~28–38 MiB models).
AVLA-Collab is highly sensitive to embedding size on Ranch: success rate rises from 0.06% at embedding size 60 to 73.33% at size 180 (+73.27 pp), while CRONA remains stable from 11.38% to 68.75% across the same range — a 4.13 pp gain vs. 73.27 pp gain, respectively.
Removing global state from the centralized critic is catastrophically harmful: both AVLA-Collab (0.06%) and CRONA (0.13%) nearly fail to learn on Ranch without it, versus 18.93% and 64.62% with state — a drop of ~18.87 pp and ~64.49 pp respectively.
Location belief is the dominant auxiliary component: removing it alone drops CRONA from 64.62% to 26.16% (−38.46 pp), whereas removing only category belief costs just 2.04 pp, suggesting spatial grounding is far more policy-critical than semantic classification of audio targets.
CRONA degrades only from 64.62% to 42.76% success when visual resolution is cut from 16×16 to 4×4 pixels (−21.86 pp), while VLA-Collab drops from 38.97% to 12.76% (−26.21 pp) and AVLA-Collab from 18.93% to 15.43% (−3.50 pp), demonstrating CRONA's modality-specialization robustness.
In the Corridor scene (audio dominance), 87.36% or more of episodes timeout for VLA-Collab and AVLA-Collab, while ALA-Collab achieves 25.31% success with only 74.68% timeout — illustrating that visual inputs can actively harm performance when the task geometry is audio-favorable.
Single-agent monolith baseline fails catastrophically on the hardest scene (Maze): 0.00% success rate with only 15.32 average steps, suggesting premature stopping, versus AVLA-Collab's 26.16% success and 624.50 average steps.

Methodology — deep read

Threat model and assumptions: This is not a security paper; there is no adversarial threat model in the classical sense. The adversarial pressures are environmental — noisy and stochastic audio, degraded visual inputs, partial observability, and mixed sound sources from multiple targets. Agents have no inter-agent communication during execution; each operates purely on local observation-action history. The global state (all agent/target positions, task completion flags) is treated as privileged information available only during centralized training, consistent with the CTDE paradigm.

Data provenance and benchmark construction: The paper uses Matterport3D [80] as the scene source, with observations simulated via Habitat and libsora. Five scenes are manually selected to span difficulty levels: Studio (1 room, 1 target), Corridor (1 hallway, 1 target), Apartment (3 rooms, 2 targets), Ranch (7 rooms, 2 targets), Maze (large complex layout, 3 targets). Episode horizons H are set to 70, 150, 500, 1000, 1500 respectively, scaled to scene complexity. At episode start, agent positions and orientations are randomly initialized on the navigable mesh. Sound sources are mixed additively and removed once a target is found. Exact dataset sizes (number of episodes per scene) are deferred to Appendix B, which is not included in the truncated text — this is a reproducibility gap. Visual inputs are intentionally restricted to 16×16 depth maps with 10° HFoV and 5 m range to prevent trivially easy visual navigation; this is a deliberate benchmark design choice, not a limitation of the sensor hardware.

Architecture and novel components: Each agent has a modality-specific convolutional encoder. Vision agents use ResNet-18 for RGB-D encoding; audio agents use a shallower CNN applied to binaural magnitude spectrograms computed via STFT (Eq. 1, with left/right channels, hop size δ, FFT size Nfft). Encoded features are concatenated with pose and language goal embedding to form a per-timestep observation embedding. A fixed-size memory cache stores the current and k previous embeddings plus k previous actions; this sequence is processed by transformer blocks (multi-head attention) to produce a compact history representation zh_i,t. The novel auxiliary belief predictor (Section 4.1) is audio-agent-specific: a location head predicts a 2D global sound-source coordinate, which is converted to relative frame via a 2D rotation matrix (Eq. 2); a category head outputs a multi-label probability vector over C categories. Both predictions are smoothed with exponential moving average with coefficient α (Eq. 3). The predictor is supervised with L2 location loss plus binary cross-entropy category loss (Eq. 4) using privileged training-time labels. The centralized critic (Section 4.3) takes the concatenated joint history zh_t, all auxiliary beliefs bt, and global state st as input, producing a joint value estimate. Encoder and transformer weights are shared between actors and critic; only the policy head and value head are separate. This sharing is motivated by computational efficiency and is claimed not to introduce value estimation bias (proved in Appendix A, not in truncated text).

Training regime: All agents are trained with a CTDE variant of PPO. Policy updates use a clipped surrogate objective (Eq. 7) with entropy regularization coefficient β and clipping range ε. Value updates use a clipped value surrogate (Eq. 6) with clip range ξ. Advantages are computed via GAE (Eq. 5) with discount γ and λ. The shared encoder is updated by a weighted combination of policy gradient and averaged TD loss with weight μ (Eq. 8). Training runs for 0.5 million environment steps (x-axis in Fig. 3). Results are averaged over 5 independent runs with 90% bootstrapped confidence intervals shown as shaded regions. Specific hyperparameter values (ε, ξ, β, γ, λ, μ, α, k, batch size, learning rate, random seeds) are stated to be in Appendix C/G, which are not in the truncated text — exact values are not reproducible from the main paper alone.

Evaluation protocol: Metrics include task success rate (all targets found before horizon), average agent-to-nearest-target distance (Dist, in meters), target detection rate (Detect, %), average steps per episode, and episode timeout rate (%). Baselines: Single-Agent monolith (same modalities, one agent, same horizon), VLA-Collab (two vision+language agents), ALA-Collab (two audio+language agents), AVLA-Collab (two full-modality agents). All baselines use identical hyperparameters and per-agent parameter counts. No held-out test split is described — evaluation appears to occur on the training distribution (random initializations within the same scenes), which is a notable limitation. No statistical significance tests beyond bootstrapped CI are reported. Ablations (Table 2) vary embedding size (60/100/140/180), visual resolution (4×4, 8×8, 16×16, 32×32), and component removal (no category belief, no location belief, no any belief, critic without global state), all run on the Ranch scene only.

Concrete end-to-end example: In Ranch, a vision agent (green) and an audio agent (blue) are randomly initialized in the scene. The audio agent computes a binaural spectrogram from the mixed camera-shutter and silverware-dropping sounds, runs it through its CNN encoder, and its belief predictor outputs a smoothed relative direction toward the nearest sound source — in this case, the picture with camera-shutter sound. The vision agent receives a 16×16 depth map with 10° HFoV and uses ResNet-18 to encode spatial structure, locating the large table in the open dining room. Each agent maintains a transformer-encoded history of its last k embeddings and actions. The centralized critic, during training, receives both agents' history embeddings, the audio agent's location and category beliefs, and the full global state (both target positions, both agent poses, completion flags) to estimate the joint value. After convergence, the audio agent navigates toward the picture, stops within the required proximity, and the vision agent navigates toward the table independently — achieving 64.62% task success.

Reproducibility: Code is not mentioned as released in the truncated text. Exact dataset episode counts, most hyperparameters, the bias-free critic proof, and compute resources are all deferred to appendices not available in the provided text. The Matterport3D scenes are identified by their public IDs (e.g., JeFG25nYj2p for Ranch), which allows partial reproduction. Frozen weights are not mentioned.

Technical innovations

Auxiliary belief predictor for audio agents: a jointly trained location head (L2 loss over 2D sound-source coordinates) and category head (binary cross-entropy over C categories), both smoothed via exponential moving average, that extract control-relevant signals from noisy spectrograms without requiring inter-agent communication — distinct from prior work that inputs raw features directly into policy networks (e.g., AV-Nav [15]).
Centralized multi-modal critic with global state augmentation: the critic conditions on joint agent histories, auxiliary beliefs, and privileged global state simultaneously, with a theoretical proof (Appendix A) that this augmentation does not introduce bias into value estimation, extending standard CTDE critics (e.g., MAPPO [78]) that typically condition only on joint observations.
Cross-modal heterogeneous MARL benchmark over Matterport3D: five scenes with deliberate visual degradation (16×16 depth, 10° HFoV) and additive mixed audio from multiple simultaneous sound sources, constructed specifically to force cross-modal complementarity — prior collaborative navigation benchmarks (CoNav [39], CoNavBench [38]) use centralized information or homogeneous agents.
Empirical taxonomy of five modality-dominance patterns (no clear dominance, vision dominance, audio dominance, cross-modal, multi-modal dominance) derived from systematic comparison across scenes, offering a practical framework for predicting when heterogeneous vs. homogeneous agent configurations are preferable.
Shared encoder architecture between decentralized actors and centralized critic: modality-specific encoders and transformer history blocks are jointly optimized via a weighted sum of policy gradient and TD loss (Eq. 8, weight μ), reducing parameter redundancy while maintaining decentralized execution.

Datasets

Matterport3D (5 selected scenes) — episode counts not specified in main text — publicly available scenes (GdvgFV5R1Z5, ac26ZMwG7aT, 17DRP5sb8fy, JeFG25nYj2p, B6ByNegPMKs); observations simulated via Habitat and libsora

Baselines vs proposed

Single-Agent (monolith, all modalities): Ranch success rate = 12.34% vs. CRONA: 64.62%
VLA-Collab (homogeneous vision+language, 2 agents): Ranch success rate = 38.97% vs. CRONA: 64.62%
ALA-Collab (homogeneous audio+language, 2 agents): Ranch success rate = 42.15% vs. CRONA: 64.62%
AVLA-Collab (homogeneous full-modality, 2 agents): Ranch success rate = 18.93% vs. CRONA: 64.62%
AVLA-Collab: Maze success rate = 26.16% vs. CRONA: 12.13% (CRONA loses on hardest scene)
VLA-Collab: Studio success rate = 93.65% vs. CRONA: 95.72%
ALA-Collab: Corridor success rate = 25.31% vs. CRONA: 21.50% (CRONA loses on audio-dominant scene)
VLA-Collab: Apartment success rate = 78.96% vs. CRONA: 68.52% (CRONA loses on vision-dominant scene)
CRONA at 4×4 visual resolution (Ranch): success rate = 42.76% vs. VLA-Collab at 4×4: 12.76%
AVLA-Collab at embedding size 60: success rate = 0.06% vs. CRONA at embedding size 60: 11.38%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06595.

Fig 1

Fig 1: A collaborative navigation task in a Ranch scene

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

No generalization test: evaluation is entirely within the five training scenes using random initializations from the same navigable mesh — there is no held-out scene, held-out object category, or distribution shift evaluation, making it impossible to assess generalization to unseen environments.
Benchmark scale is small: only five hand-selected Matterport3D scenes with episode counts not reported in the main text; the Ranch and Maze ablations are performed on single scenes, raising concerns about cherry-picking and result variance.
CRONA underperforms on three of five individual scenes (Corridor, Apartment, Maze), winning only on Studio and Ranch — the aggregated success rate metric in Fig. 3f obscures this inconsistency and risks overstating the method's generality.
No inter-agent communication is studied: the paper explicitly excludes communication as a design choice, but this means CRONA cannot benefit from coordination mechanisms (e.g., role negotiation, target assignment signaling) that are standard in more capable MARL systems; the framework's performance ceiling under this constraint is unclear.
Auxiliary belief predictor relies on privileged training labels (ground-truth target positions and category labels per timestep) that may not be available in real deployment scenarios; the sim-to-real gap for binaural audio simulation via libsora is not discussed.
Key hyperparameters (ε, ξ, β, γ, λ, μ, α, k, learning rate, batch size, seeds) are deferred entirely to appendices not included in the paper submission reviewed here, making independent reproduction difficult.
The theoretical no-bias proof for the augmented critic (Appendix A) is not available in the truncated text and cannot be verified; it is a non-trivial claim given that auxiliary beliefs are themselves learned and potentially biased estimates.

Open questions / follow-ons

Under what conditions does modality specialization scale to more than two agents and more than two modalities (e.g., RGB + depth + audio + LiDAR + language), and does the five-pattern taxonomy remain stable or collapse into new dominance regimes?
Can the auxiliary belief predictor be replaced by a self-supervised or unsupervised audio representation (e.g., contrastive audio pretraining) to remove the dependency on privileged per-timestep ground-truth location and category labels during training?
How does CRONA perform under deliberate acoustic or visual adversarial perturbation — for example, when an audio agent receives adversarially crafted sound mixtures designed to misdirect its belief predictor — given that the framework has no robustness mechanism against such inputs?
The paper identifies that AVLA-Collab at embedding size 180 outperforms CRONA at the same size on Ranch (73.33% vs. 68.75%): is there a principled criterion for choosing between modality specialization and full-modality homogeneity given a fixed compute/parameter budget, or must this be determined empirically per scene?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the core relevance of this paper is indirect but non-trivial: it demonstrates empirically that decomposing a complex multi-modal sensing problem across specialized agents can be more robust to degraded or noisy individual modalities than a monolithic fusion model. In a bot-detection context, this maps onto the idea of running modality-specialized anomaly detectors (e.g., one model for mouse movement, one for keystroke timing, one for browser fingerprint) in parallel under a centralized training signal, rather than forcing a single model to jointly align and weight heterogeneous behavioral signals. The finding that location belief (spatial grounding) contributes far more than category belief (semantic classification) to policy performance suggests that, in detection pipelines, precise behavioral localization signals (e.g., exact cursor trajectory geometry) may matter more than coarse behavioral category labels (e.g., 'likely bot' vs. 'likely human').

The modality-dominance taxonomy is directly actionable: just as audio dominates in the Corridor scene and vision dominates in the Apartment scene, different bot-attack vectors may be more legible in some behavioral channels than others, and a static multi-modal fusion model may be systematically misled by whichever channel the attacker has learned to spoof. The CRONA finding that mixing unreliable modalities into a capacity-constrained model can hurt performance (AVLA-Collab failing on Ranch at small embedding size) is a concrete warning against naively concatenating all available behavioral signals without considering their noise characteristics per attack scenario. The lack of any adversarial evaluation in the paper, however, means these analogies are speculative — CRONA's robustness under deliberate signal manipulation is entirely untested.

Cite

bibtex

@article{arxiv2605_06595,
  title={ Cross-Modal Navigation with Multi-Agent Reinforcement Learning },
  author={ Shuo Liu and Xinzichen Li and Christopher Amato },
  journal={arXiv preprint arXiv:2605.06595},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06595}
}

Cross-Modal Navigation with Multi-Agent Reinforcement Learning ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​