Vidi2.5: Large Multimodal Models for Video Understanding and Creation

Intelligent Creation
ByteDance Inc.

San Jose/Seattle, US

Spatio-temporal grounding and temporal retrieval accuracy on the proposed benchmarks.

Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot understanding, automatic view switching, etc. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers critical improvements over existing STG datasets in video duration, query format, annotation quality, and evaluation metric. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced duration and query distribution. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro Preview and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks. The latest Vidi2.5 offers significantly stronger STG capability and slightly better TR and Video QA performance over Vidi2 by RL training. This update also introduces a Vidi2.5-Think model to handle plot understanding with complex plot reasoning. To comprehensively evaluate the performance of plot understanding, we propose a new benchmark named VUE-PLOT with two tracks, Character and Reasoning. Notably, Vidi2.5-Think outperforms Gemini 3 Pro Preview on fine-grained character understanding with comparable performance on complex plot reasoning. Furthermore, we demonstrate the effectiveness of Vidi2.5 on a challenging real-world application, video editing planning. The post-trained variant Vidi-Edit can generate structured editing plans specifying narrative structure, audio attributes, and visual editing intent, illustrating its strong potential for complex real-world tasks.

Benchmark

VUE-STG (Spatio-Temporal Grounding)

Duration distribution of videos in the proposed VUE-STG evaluation benchmark.


VUE-TR-V2 (Temporal Retrieval)

Duration distribution of videos in the proposed VUE-TR-V2 evaluation benchmark.

The distribution of query modality and format in the VUE-TR-V2 benchmark.


VUE-PLOT (Plot Understanding)

Duration distribution of videos in the proposed VUE-PLOT evaluation benchmark.

The speech duration distribution (left) and the bounding box area distribution (middle) of the Character track of the VUE-PLOT benchmark and task type distribution (right) of the Reasoning track.

Qualitative Results

Spatio-Temporal Grounding

Input VideoInput QueryDurationResult Clips with Tubes
The man wearing a brown suit who is playing drums in an indoor setting00:04:12
The gorilla which is driving with two men.00:16:19
a woman in glasses who is walking on street 00:04:11
The boy who stands outside a charming house with warm lights, beneath a starry night sky featuring a full moon.00:02:00
The glowing blue water beads in which the mango seed is placed, with its germination into a root and shoot visualized through a time-lapse sequence against a dark background00:15:50

Temporal Retrieval

Input VideoInput QueryDurationGround TruthPredictionIOUResult Clips
basketball statue00:00:4600:00:13-00:00:3000:00:13-00:00:310.94
gymnasium00:01:2900:00:00-00:01:2800:00:00-00:01:290.99
people assembling sculptures on beach00:01:3000:00:24-00:00:3300:00:24-00:00:331.00
Euripides, has most surviving work like 'Medea' and 'The Bacchae', debut in 455 BC. He is a corner stone of greek education in the Hellenistic period.00:05:1000:04:22-00:04:4100:04:20-00:04:350.62
Jennifer Nagel self-introduction00:10:0200:00:07-00:00:1600:00:07-00:00:170.90
divine wind00:20:1800:05:06-00:05:0800:05:05-00:05:080.67
FTC resources00:59:3500:43:41-00:44:1700:43:31-00:44:150.74
North Devon Marine Pioneer01:24:4500:00:00-00:02:31, 00:03:58-00:06:20, 00:13:32-00:16:01, 00:16:57-00:17:00, 00:47:21-01:03:1000:00:00-00:02:33, 00:46:56-01:03:080.77

Applications

Video Editing Planning

Editing plan generation is an application designed to support practical video creation workflows. It takes raw video assets and optional user intents as input and produces a structured editing plan that specifies the narrative structure, voiceover content, audio attributes, and intended visual edits. The plan serves as a high-level, machine-readable specification for downstream video editing and rendering systems.

News

BibTeX


        @article{Vidi2026vidi2.5,
          title={Vidi2.5: Large Multimodal Models for Video 
                  Understanding and Creation},
          author={Vidi Team, Chia-Wen Kuo, Chuang Huang, Dawei Du, 
                  Fan Chen, Fanding Lei, Feng Gao, Guang Chen, 
                  Haoji Zhang, Haojun Zhao, Jin Liu, Jingjing Zhuge,
                  Lili Fang, Lingxi Zhang, Longyin Wen, Lu Guo,
                  Lu Xu, Lusha Li, Qihang Fan, Rachel Deng, 
                  Shaobo Fang, Shu Zhang, Sijie Zhu, Stuart Siew, 
                  Weiyan Tao, Wen Zhong, Xiaohui Shen, Xin Gu, 
                  Ye Yuan, Yicheng He, Yiming Cui, Zhenfang Chen,
                  Zhihua Wu, Zuhua Lin},
          journal={arXiv preprint arXiv:2511.19529},
          year={2026}
        }

        @article{Vidi2025vidi,
          title={Vidi: Large Multimodal Models for Video 
                  Understanding and Editing},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
                  Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
                  Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
                  Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
                  Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
                  Xueqiong Qu, Zhenfang Chen},
          journal={arXiv preprint arXiv:2504.15681},
          year={2025}
        }