StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

1Shanghai AI Laboratory, 2The University of Hong Kong, 3Zhejiang University, 4Shanghai Jiao Tong University,
Project Lead,
*Equal Contribution, Corresponding Author

TL; DR

StreamVLN is a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.

Approach

Context Modeling

method

StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache; and (2) a slow-updating memory via token pruning.



Data Collection

We incorporates both navigation-specific data samples and general multi-modal data samples in our training data.

For navigation-specific data, we collect 450K video clips from R2R, R2R-EnvDrop, RxR and ScaleVLN as expert training data, and 240K DAgger data samples as augmentation.

For general multi-modal data, we incorporate 248K video-based VQA samples from LLaVA-Video-178K and ScanQA, and 230K interleaved image-text samples from MMC4.

The table below shows details of our data compositiion.

data
Data Type Source Samples Purpose
Navigation (Expert)
R2R
R2R-EnvDrop
RxR
ScaleVLN (subset)
450K General navigation skills
Navigation (DAgger) - 240K Error correction
Video QA
LLaVA-Video-178K
ScanQA
248K Spatiotemporal reasoning
Interleaved Image-Text
MMC4
230K Multi-turn dialog

Evaluation in Simulation

Quantitative Results

method

We evaluate our method on two public VLN-CE benchmarks collected from Matterport3D scenes using the Habitat simulator: R2R-CE and RxR-CE. Our StreamVLN model achieves state-of-the-art performance among RGB-only methods both without and with extra navigation datasets, reaching 56.9% SR and 51.9% SPL on R2R (Val-Unseen), and 52.9% SR and 46.0% SPL on RxR (Val-Unseen).


Vision-Language Navigation in Habitat Simulator

Real-World Experiments

Accurate Instruction Following

Living Room

Bedroom


Extreme Long-Horizon VLN Tasks

Office Lobby and Work Zones

Workspace


Generalization across Diverse Scenes

Mall

Outdoor Sidewalk

Outdoor Walkways

Outdoor Lawn and Sidewalk


Visual Question Answering and Navigation

method

Distinguish between Mona Lisa and Einstein