Slow Brain, Fast Planner — Latency-Resilient VLM-Augmented Urban Navigation

Source: arXiv:2606.20458 · Published 2026-06-18 · By Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

TL;DR

This paper addresses a critical bottleneck in learned local planners for urban sidewalk navigation: the inability of their scoring functions to select the best trajectory under semantically challenging conditions. Although planners generate diverse and dynamically feasible candidate trajectories rapidly (5–20 Hz), their internal scoring often fails to choose safe, socially compliant paths in complex scenarios such as junctions or pedestrian encounters. The authors identify and quantify this “trajectory scoring gap,” showing that the planner's top-ranked trajectory has considerably higher error than the oracle-best candidate within the same proposal set. To bridge this gap without retraining or replacing the planner with a slow Vision-Language-Action (VLA) model, the paper proposes a modular interface where an off-the-shelf Vision-Language Model (VLM) selects among the planner's candidates. However, VLM inference latency (1–3 seconds) is incompatible with real-time control loops. To overcome this, the authors develop a latency-resilient, training-free fusion mechanism that continuously biases the planner’s realtime scoring based on the stale VLM selection via geometric similarity weighted by exponential decay, enabling continuous robot control without blocking.

Evaluated on ~2,000 challenging real-world sidewalk navigation scenarios, the VLM selector reduces average displacement error (ADE) by 30% compared to planner-only selection on hard cases, approaching oracle performance, while remaining competitive on easier routes. Closed-loop simulation under up to 5 s VLM delay shows their proposed Score Fusion maintains >80% task success, dramatically outperforming naive stale execution. Real-world deployment on a wheeled campus robot demonstrates that the fusion approach reduces human interventions by roughly 75% compared to planner-only navigation and avoids the safety issues of stale VLM trajectory execution. This work advances practical vision-language augmented navigation by enabling effective semantic scoring augmentation despite slow VLM inference, without retraining or unsafe direct control.

Key findings

The trajectory scoring gap on ~2,000 real-world hard sidewalk scenarios is quantitatively large: planner argmax ADE = 1.64 m vs. oracle best candidate ADE = 0.39 m, a recoverable gap of 1.25 m.
Off-the-shelf VLM selection without training reduces ADE by ~30% on hard scenarios (1.16 m ADE) but underperforms the planner argmax on normal scenarios (Fig. 2).
Score Fusion enables continuous control under VLM latencies of 1–5 s, maintaining >80% success rate in simulation (Fig. 6), whereas naive stale trajectory execution (VLM Hold) drops below 20% success beyond 3 s latency.
In real-world robot experiments over varied cellular latency (1.5–3 s), Probability Fusion with streaming reduces human interventions by 75% vs. planner-only, achieving 0.87 takeovers per 100 m compared to 3.49 (Tab. 3).
Among candidate generation methods, top-K score filtering outperforms geometric diversification for selecting candidates to prompt the VLM (Fig. 4A).
Increasing candidate set size K improves VLM selection quality up to ~18 candidates, after which benefits plateau (Fig. 4B).
Hiding planner scores and goal info from the VLM improves zero-shot selection quality by encouraging VLM semantic reasoning rather than deference to planner ranking (Fig. 4C).
Gemini 3 Flash VLM variant achieves best quality–latency tradeoff (1.16 m ADE at 8.1 s latency overall; Gemini 2.5 Flash Lite offers 1.21 m ADE at 1.7 s median latency, preferred for real-time deployment).

Methodology — deep read

Threat Model and Assumptions: The adversary is not explicitly modeled since this is a robustness/performance paper rather than a security one. The system assumes a learned local planner that generates multiple feasible trajectory candidates and an off-the-shelf VLM (e.g., Gemini, GPT-5, Qwen) with high-level semantic understanding but slow (1–3 s) inference. The planner outputs candidate trajectories at 5–20 Hz. The VLM selects a candidate index asynchronously, with latency mismatch requiring a fusion strategy.
Data: The study uses two real-world urban sidewalk datasets: a "normal" pool (~3,000 snapshots) with routine navigation scenarios and a "hard" pool (~2,000 snapshots) curated from university campus logs with semantically challenging factors (junctions, pedestrians, terrain boundaries). Each snapshot contains an RGB camera frame, 64 anchor-based candidate trajectories generated by the planner, planner scores, and human teleoperator executed ground truth paths.
Architecture / Algorithm: The system composes two loosely coupled loops:
- The fast loop runs a learned local planner (S2E) producing K candidate trajectories with dynamic feasibility and collision avoidance, each scored internally by the planner.
- The slow loop asynchronously queries an off-the-shelf VLM with a visual prompt overlaying the K candidate trajectories projected onto the current camera image with colored polylines and index labels. The VLM outputs the selected best candidate index in JSON.

Because VLM inference latency is slow (∼1–3 s), the core novelty is a latency-resilient, training-free fusion layer that turns stale VLM selection into a continuous bias on planner scoring. The fusion computes horizon-aware geometric similarity between the current planner candidates and the stale VLM-chosen trajectory, after motion compensation and remaining path alignment. This similarity is exponentially decayed with time elapsed since selection to reduce stale influence. Two fusion methods are proposed: Score Fusion (adds a weighted similarity term to planner scores) and Probability Fusion (mixes planner and VLM probability distributions derived from scores and geometric similarities).

Training Regime: The planner (S2E) is pre-trained (details in appendix) on local trajectory prediction. No fine-tuning of any model is done. VLM selection uses zero-shot prompting of off-the-shelf VLMs without any training. This decoupling avoids costly joint optimization.
Evaluation Protocol: Trajectory selection performance is evaluated offline on the real-world log dataset using ADE metric against human teleoperator paths. Closed-loop simulation uses a delayed oracle VLM selector and a corrupted planner score to test success rate vs. latency (0–5 s) comparing no fusion, simple stale execution, and fusion policies. Real robot deployment is evaluated on 5 campus sidewalk routes under cellular latency, measuring safety via human operator takeovers per 100 m and intervention metrics. Ablations explore candidate filtering, prompt engineering, and VLM model variants.
Reproducibility: Code and dataset specifics are not explicitly stated as publicly released; the planner model is standard (S2E). The VLM models are commercial/closed APIs (Gemini, GPT-5). The fusion mechanism is described in detail for replication. Exact seeds and hyperparameters for score/probability fusion are reported in the appendix.

Concrete example: At each 200 ms control tick, the planner generates 64 candidate (x,y) waypoint trajectories. From these, top-K=18 are selected by highest planner score and overlaid on the RGB image with colored polylines labeled by index. The VLM is asynchronously queried with this visual prompt and outputs a chosen candidate index after ~1.7 seconds. Meanwhile, the planner continues running. The fusion layer computes geometric similarity between each fresh candidate and the stale VLM-chosen trajectory (adjusted by odometry compensation and remaining arc length segment), applies exponential decay based on time elapsed since VLM selection, and biases the planner scores. The candidate with highest fused score is executed to drive continuous control at 5 Hz without blocking for the slow VLM. This approach allows semantic high-level pruning of candidates with safe fallback to the fast planner’s own scoring when VLM guidance is unavailable or stale.

Technical innovations

A training-free interface that uses off-the-shelf VLMs zero-shot to select among planner-generated candidate trajectories via visual overlay prompts.
A latency-resilient trajectory-level fusion layer that geometrically matches stale VLM-selected trajectories to live planner candidates, combining their scores via exponential temporal decay to enable continuous control.
Two fusion strategies: Score Fusion (adding weighted similarity to planner scores) and Probability Fusion (mixing normalized distributions) to balance influence and ensure robustness under varying latency.
A streaming query protocol to keep multiple asynchronous VLM requests in flight, avoiding blocking in the robot’s fast control loop.

Datasets

Urban Sidewalk Normal Pool — ~3,000 snapshots — real-world campus and city sidewalks, routine navigation
Urban Sidewalk Hard Pool — ~2,000 snapshots — real-world campus sidewalks, curated for semantic complexity (junctions, pedestrians, terrain boundaries)

Baselines vs proposed

Planner Argmax: Hard scenarios ADE = 1.64 m vs. VLM Selector (Gemini 3 Flash) ADE = 1.16 m
Planner Argmax: Normal scenarios ADE = 0.60 m vs. VLM Selector ADE = 0.70 m (planner better here)
Simulation success rate at 5 s VLM latency: Score Fusion >80% vs. VLM Hold <20%
Real robot takeovers per 100 m: Local-only (planner) = 3.49, VLM Hold = 8.15, VLM Stream = 4.40, Score Fusion (Stream) = 1.31, Prob Fusion (Stream) = 0.87

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20458.

Fig 1

Fig 1: Slow Brain, Fast Planner. A fast planner

Fig 2

Fig 2: VLM selection performance.

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The approach is limited by the candidate trajectories generated by the planner; if no good candidate exists, the VLM cannot improve selection.
VLM selection does not consistently outperform the planner in normal, routine scenarios where learned scoring suffices.
Closed-loop simulation uses a delayed oracle VLM that cannot model real-world perception drift or errors.
Latency and network conditions affect VLM query times; extreme delays beyond 5 s are not evaluated in real robot tests.
No adversarial or intentional attack scenarios to stress test robustness are considered.
The system does not adaptively query the VLM only when the planner is uncertain, leading to some unnecessary VLM computation.

Open questions / follow-ons

Can adaptive querying strategies selectively invoke the VLM only during planner uncertainty, improving efficiency?
How can the interaction between VLM and planner be extended beyond trajectory selection to joint planning or trajectory refinement?
Can similar latency-resilient fusion methods be applied to end-to-end vision-language-action models rather than candidate selection?
How does the system perform under more diverse urban environments, weather conditions, or with different robot platforms?

Why it matters for bot defense

This work demonstrates a practical way to enhance learned local navigation planners for real-world robot operation by leveraging off-the-shelf vision-language models despite their inherent latency. For bot-defense and CAPTCHA practitioners, the paper’s approach to manage a high-quality but slow semantic reasoning oracle asynchronously, without blocking or retraining, offers useful insights into integrating large foundation models into low-latency control systems. The latency-resilient fusion technique could be analogously applied to security mechanisms that combine fast heuristic scoring with slower, more expensive semantic or contextual validators. It also exemplifies leveraging multi-hypothesis generation with a slow semantic assessor for more reliable decision-making. However, the domain specificity to physical robot navigation warrants careful adaptation before applying to CAPTCHAs or web bot-detection that involve different modalities and threat dynamics.

Cite

bibtex

@article{arxiv2606_20458,
  title={ Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation },
  author={ Zhenghao "Mark'' Peng and Honglin He and Quanyi Li and Yukai Ma and Bolei Zhou },
  journal={arXiv preprint arXiv:2606.20458},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20458}
}

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation