Think-as-You-See: Streaming Chain-of-Thought Reasoning for LVLMs

Jialiang Zhang1,2*, Junlong Tong1,3*, Junyan Lin1,4*, Hao Wu1,
Yirong Sun1, Yunpu Ma5, Xiaoyu Shen1,6†
1Institute of Digital Twin, Eastern Institute of Technology, Ningbo
2Ocean University of China
3Shanghai Jiao Tong University
4The Hong Kong Polytechnic University
5LMU Munich
6Ningbo Key Laboratory of Spatial Intelligence
*Equal Contribution    Corresponding Author
Video Stream
🧠
TaYS
Reasoning Output

TaYS enables real-time streaming Chain-of-Thought reasoning for large vision-language models.

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in video reasoning, yet most existing systems rely on a batch inference paradigm that processes the entire video before reasoning begins. This "wait-and-see" approach neglects the inherently streaming nature of real-world video, introducing substantial latency and exacerbating temporal drift. In this paper, we propose Think-as-You-See (TaYS), a framework that shifts LVLMs toward a streaming reasoning paradigm, enabling continuous, incremental inference synchronized with the visual stream. We introduce three key innovations: (1) a streaming attention mask to enforce temporal causality; (2) a decoupled positional encoding strategy to resolve cross-modal index conflicts; and (3) a parallel dual KV-cache mechanism that decouples visual encoding from reasoning generation, enabling concurrent frame ingestion and token decoding. Empirical evaluations on the VideoEspresso benchmark using the Qwen2.5-VL family demonstrate that TaYS improves reasoning accuracy by 2.9%, reduces Time-to-First-Token (TTFT) from 10.6s to near-zero, and cuts reasoning-event deviation by 55%. Our results suggest that aligning LVLM reasoning with the streaming nature of video is a vital step toward responsive, real-time multimodal intelligence.

Highlights

Method

Overview of the TaYS framework. Parallel video reasoning KV caches enable concurrent visual encoding and reasoning generation.

TaYS is a supervised fine-tuning framework that integrates streaming video CoT generation with streaming training and inference mechanisms. To overcome the intrinsic serialization bottleneck of naive interleaving strategies, we introduce a parallel streaming paradigm termed Think-as-You-See (TaYS).

  • Streaming Attention Mask: We design a streaming-aware attention mask to enforce temporal causality. A reasoning step at time t strictly attends to visual evidence accumulated up to t, remaining agnostic to future frames.
  • Decoupled Positional Encoding: We propose a modality-decoupled positional indexing scheme to resolve index conflicts arising from the concurrent growth of visual and reasoning streams. This assigns independent positional axes for vision and reasoning tokens.
  • Parallel Dual KV-Cache: We maintain two modality-specific caches: a read-heavy video cache and a dynamic text cache. This decouples visual encoding from reasoning generation, establishing a recursive merge–generate–split loop where perception and reasoning evolve simultaneously.

Results

Comparison with baselines on the extended VIDEOESPRESSO benchmark.

We instantiate TaYS on the Qwen2.5-VL family and evaluate its efficacy across tasks requiring complex event dynamics and causal reasoning. On the extended VideoEspresso benchmark:

  • Reasoning Quality: TaYS improves reasoning accuracy by 2.9% over batch CoT baselines and achieves a 43.7% win rate in human-aligned GPT-5 evaluations.
  • Latency: TaYS reduces the Time-to-First-Token (TTFT) from 10.6s in batch mode to nearly zero, maintaining a stable end-to-end delay of ~12s across all frame rates.
  • Temporal Grounding: TaYS improves temporal grounding by reducing reasoning-event deviation from 1.52s to 0.69s (a 55% reduction), ensuring that reasoning is tightly synchronized with visual evidence.

BibTeX

@article{zhang2026think,
  title={Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models},
  author={Zhang, Jialiang and Tong, Junlong and Lin, Junyan and Wu, Hao and Sun, Yirong and Ma, Yunpu and Shen, Xiaoyu},
  journal={arXiv preprint arXiv:2603.02872},
  year={2026}
}