Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Zhang, Jialiang; Tong, Junlong; Lin, Junyan; Wu, Hao; Sun, Yirong; Ma, Yunpu; Shen, Xiaoyu

Think-as-You-See: Streaming Chain-of-Thought Reasoning for LVLMs

Jialiang Zhang^1,2*, Junlong Tong^1,3*, Junyan Lin^1,4*, Hao Wu¹,
Yirong Sun¹, Yunpu Ma⁵, Xiaoyu Shen^1,6†

¹Institute of Digital Twin, Eastern Institute of Technology, Ningbo
²Ocean University of China
³Shanghai Jiao Tong University
⁴The Hong Kong Polytechnic University
⁵LMU Munich
⁶Ningbo Key Laboratory of Spatial Intelligence
^*Equal Contribution ^†Corresponding Author

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in video reasoning, yet most existing systems rely on a batch inference paradigm that processes the entire video before reasoning begins. This "wait-and-see" approach neglects the inherently streaming nature of real-world video, introducing substantial latency and exacerbating temporal drift. In this paper, we propose Think-as-You-See (TaYS), a framework that shifts LVLMs toward a streaming reasoning paradigm, enabling continuous, incremental inference synchronized with the visual stream. We introduce three key innovations: (1) a streaming attention mask to enforce temporal causality; (2) a decoupled positional encoding strategy to resolve cross-modal index conflicts; and (3) a parallel dual KV-cache mechanism that decouples visual encoding from reasoning generation, enabling concurrent frame ingestion and token decoding. Empirical evaluations on the VideoEspresso benchmark using the Qwen2.5-VL family demonstrate that TaYS improves reasoning accuracy by 2.9%, reduces Time-to-First-Token (TTFT) from 10.6s to near-zero, and cuts reasoning-event deviation by 55%. Our results suggest that aligning LVLM reasoning with the streaming nature of video is a vital step toward responsive, real-time multimodal intelligence.

Highlights

🔥 Streaming CoT Paradigm: We introduce a principled streaming reasoning paradigm for LVLMs, enabling incremental, temporally grounded inference aligned with unfolding visual evidence.

⚡ Near-Zero Latency: Reduces Time-to-First-Token (TTFT) from 10.6s (Batch) to near-zero (~10^-6s) by decoupling visual encoding and reasoning generation.

🧠 Parallel Dual KV-Cache: Efficiently manages visual and textual states independently via a dual-cache system, enabling concurrent frame ingestion and token decoding.

📈 Superior Performance: Improves reasoning accuracy by 2.9% over batch CoT baselines and cuts reasoning-event deviation by 55% (from 1.52s to 0.69s).

Method

Overview of the TaYS framework. Parallel video reasoning KV caches enable concurrent visual encoding and reasoning generation.

TaYS is a supervised fine-tuning framework that integrates streaming video CoT generation with streaming training and inference mechanisms. To overcome the intrinsic serialization bottleneck of naive interleaving strategies, we introduce a parallel streaming paradigm termed Think-as-You-See (TaYS).

Streaming Attention Mask: We design a streaming-aware attention mask to enforce temporal causality. A reasoning step at time t strictly attends to visual evidence accumulated up to t, remaining agnostic to future frames.
Decoupled Positional Encoding: We propose a modality-decoupled positional indexing scheme to resolve index conflicts arising from the concurrent growth of visual and reasoning streams. This assigns independent positional axes for vision and reasoning tokens.
Parallel Dual KV-Cache: We maintain two modality-specific caches: a read-heavy video cache and a dynamic text cache. This decouples visual encoding from reasoning generation, establishing a recursive merge–generate–split loop where perception and reasoning evolve simultaneously.

Results

Comparison with baselines on the extended VIDEOESPRESSO benchmark.

We instantiate TaYS on the Qwen2.5-VL family and evaluate its efficacy across tasks requiring complex event dynamics and causal reasoning. On the extended VideoEspresso benchmark:

Reasoning Quality: TaYS improves reasoning accuracy by 2.9% over batch CoT baselines and achieves a 43.7% win rate in human-aligned GPT-5 evaluations.
Latency: TaYS reduces the Time-to-First-Token (TTFT) from 10.6s in batch mode to nearly zero, maintaining a stable end-to-end delay of ~12s across all frame rates.
Temporal Grounding: TaYS improves temporal grounding by reducing reasoning-event deviation from 1.52s to 0.69s (a 55% reduction), ensuring that reasoning is tightly synchronized with visual evidence.

BibTeX

@article{zhang2026think, title={Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models}, author={Zhang, Jialiang and Tong, Junlong and Lin, Junyan and Wu, Hao and Sun, Yirong and Ma, Yunpu and Shen, Xiaoyu}, journal={arXiv preprint arXiv:2603.02872}, year={2026} }