StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

1Shanghai AI Laboratory, 2The University of Hong Kong, 3Zhejiang University, 4Shanghai Jiao Tong University,
Project Lead,
*Equal Contribution, Corresponding Author

TL; DR

StreamVLN is a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.

Approach

SlowFast Context Modeling

overview Overview of the StreamVLN framework. Hover over the image to see a static illustration.

StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache; and (2) a slow-updating memory via token pruning.



Data Collection

We incorporates both navigation-specific data samples and general multi-modal data samples in our training data.

For navigation-specific data, we collect 450K video clips from R2R, R2R-EnvDrop, RxR and ScaleVLN as expert training data, and 240K DAgger data samples as augmentation.

For general multi-modal data, we incorporate 248K video-based VQA samples from LLaVA-Video-178K and ScanQA, and 230K interleaved image-text samples from MMC4.

The table below shows details of our data compositiion.

data Our data recipe.
Data Type Source Samples Purpose
Navigation (Oracle)
R2R
R2R-EnvDrop
RxR
ScaleVLN (subset)
450K General navigation skills
Navigation (DAgger) - 240K Error correction
Video QA
LLaVA-Video-178K
ScanQA
248K Spatiotemporal reasoning
Interleaved Image-Text
MMC4
230K Multi-turn dialog

Evaluation in Simulation

Quantitative Results

method
Evaluation results on R2R-CE and RxR-CE benchmarks.

We evaluate our method on two public VLN-CE benchmarks collected from Matterport3D scenes using the Habitat simulator: R2R-CE and RxR-CE. Our StreamVLN model achieves state-of-the-art performance among RGB-only methods both without and with extra navigation datasets, reaching 56.9% SR and 51.9% SPL on R2R (Val-Unseen), and 52.9% SR and 46.0% SPL on RxR (Val-Unseen).


Vision-Language Navigation in Habitat Simulator

Real-World Experiments

Accurate Instruction Following

Living Room

Bedroom


Extreme Long-Horizon VLN Tasks

Office Lobby and Work Zones

Workspace


Generalization across Diverse Scenes

Mall

Outdoor Sidewalk

Outdoor Walkways

Outdoor Lawn and Sidewalk


Visual Question Answering and Navigation

method

Q: "Describe the picture on the right.

A: "Picture of the Mona Lisa."

Distinguish between Mona Lisa and Einstein