StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei^{*, 1, 2}, Chenyang Wan^{*, 1, 3}, Xiqian Yu^{*, 1}, Tai Wang^{*†, 1}, Yuqiang Yang¹, Xiaohan Mao^{1, 4}, Chenming Zhu^{1, 2}, Wenzhe Cai¹, Hanqing Wang¹, Yilun Chen¹, Xihui Liu^{2, ✉}, Jiangmiao Pang^{1, ✉},

¹Shanghai AI Laboratory, ²The University of Hong Kong, ³Zhejiang University, ⁴Shanghai Jiao Tong University,
^†Project Lead, ^*Equal Contribution, ^✉Corresponding Author

arXiv Video Code Data

TL; DR

StreamVLN is a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.

Approach

SlowFast Context Modeling

Overview of the StreamVLN framework. Hover over the image to see a static illustration.

StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache; and (2) a slow-updating memory via token pruning.

Data Collection

We incorporates both navigation-specific data samples and general multi-modal data samples in our training data.

For navigation-specific data, we collect 450K video clips from R2R, R2R-EnvDrop, RxR and ScaleVLN as expert training data, and 240K DAgger data samples as augmentation.

For general multi-modal data, we incorporate 248K video-based VQA samples from LLaVA-Video-178K and ScanQA, and 230K interleaved image-text samples from MMC4.

The table below shows details of our data compositiion.

Our data recipe.

Data Type	Source	Samples	Purpose
Navigation (Oracle)	R2R R2R-EnvDrop RxR ScaleVLN (subset)	450K	General navigation skills
Navigation (DAgger)	-	240K	Error correction
Video QA	LLaVA-Video-178K ScanQA	248K	Spatiotemporal reasoning
Interleaved Image-Text	MMC4	230K	Multi-turn dialog

Evaluation in Simulation

Quantitative Results

Evaluation results on R2R-CE and RxR-CE benchmarks.

We evaluate our method on two public VLN-CE benchmarks collected from Matterport3D scenes using the Habitat simulator: R2R-CE and RxR-CE. Our StreamVLN model achieves state-of-the-art performance among RGB-only methods both without and with extra navigation datasets, reaching 56.9% SR and 51.9% SPL on R2R (Val-Unseen), and 52.9% SR and 46.0% SPL on RxR (Val-Unseen).

Vision-Language Navigation in Habitat Simulator

Real-World Experiments

Accurate Instruction Following

Living Room

Bedroom

Extreme Long-Horizon VLN Tasks

Office Lobby and Work Zones

Workspace

Generalization across Diverse Scenes

Mall

Outdoor Sidewalk

Outdoor Walkways

Outdoor Lawn and Sidewalk

Visual Question Answering and Navigation

Q: "Describe the picture on the right.

A: "Picture of the Mona Lisa."

Distinguish between Mona Lisa and Einstein

Citation

@article{wei2025streamvln,
    title={StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling},
    author={Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and others},
    journal={arXiv preprint arXiv:2507.05240},
    year={2025}
  }