StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache; and (2) a slow-updating memory via token pruning.
We incorporates both navigation-specific data samples and general multi-modal data samples in our training data.
For navigation-specific data, we collect 450K video clips from R2R, R2R-EnvDrop, RxR and ScaleVLN as expert training data, and 240K DAgger data samples as augmentation.
For general multi-modal data, we incorporate 248K video-based VQA samples from LLaVA-Video-178K and ScanQA, and 230K interleaved image-text samples from MMC4.The table below shows details of our data compositiion.
Data Type | Source | Samples | Purpose |
---|---|---|---|
Navigation (Expert) |
R2R
R2R-EnvDrop
RxR
ScaleVLN (subset)
|
450K | General navigation skills |
Navigation (DAgger) | - | 240K | Error correction |
Video QA |
LLaVA-Video-178K
ScanQA
|
248K | Spatiotemporal reasoning |
Interleaved Image-Text |
MMC4
|
230K | Multi-turn dialog |
We evaluate our method on two public VLN-CE benchmarks collected from Matterport3D scenes using the Habitat simulator: R2R-CE and RxR-CE. Our StreamVLN model achieves state-of-the-art performance among RGB-only methods both without and with extra navigation datasets, reaching 56.9% SR and 51.9% SPL on R2R (Val-Unseen), and 52.9% SR and 46.0% SPL on RxR (Val-Unseen).
Living Room
Bedroom
Office Lobby and Work Zones
Workspace
Mall
Outdoor Sidewalk
Outdoor Walkways
Outdoor Lawn and Sidewalk
Distinguish between Mona Lisa and Einstein