Vidi2: Large Multimodal Models for Video Understanding and Creation

Intelligent Creation
ByteDance Inc.

San Jose/Seattle, US

Spatio-temporal grounding and temporal retrieval accuracy on the proposed benchmarks.

Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reason- ing. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability en- ables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme for multi-segment spatio-temporal evaluation. In addition, we upgrade the previous VUE-TR bench- mark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

Benchmark

VUE-STG (Spatio-Temporal Grounding)

Duration distribution of videos in the proposed VUE-STG evaluation benchmark.


VUE-TR-V2 (Temporal Retrieval)

Duration distribution of videos in the proposed VUE-TR-V2 evaluation benchmark.

The distribution of query modality and format in the VUE-TR-V2 benchmark.

Qualitative Results

Spatio-Temporal Grounding

Input VideoInput QueryDurationResult Clips with Tubes
The man wearing a brown suit who is playing drums in an indoor setting00:04:12
The gorilla which is driving with two men.00:16:19
a woman in glasses who is walking on street 00:04:11
The boy who stands outside a charming house with warm lights, beneath a starry night sky featuring a full moon.00:02:00
The glowing blue water beads in which the mango seed is placed, with its germination into a root and shoot visualized through a time-lapse sequence against a dark background00:15:50

Temporal Retrieval

Input VideoInput QueryDurationGround TruthPredictionIOUResult Clips
basketball statue00:00:4600:00:13-00:00:3000:00:13-00:00:310.94
gymnasium00:01:2900:00:00-00:01:2800:00:00-00:01:290.99
people assembling sculptures on beach00:01:3000:00:24-00:00:3300:00:24-00:00:331.00
Euripides, has most surviving work like 'Medea' and 'The Bacchae', debut in 455 BC. He is a corner stone of greek education in the Hellenistic period.00:05:1000:04:22-00:04:4100:04:20-00:04:350.62
Jennifer Nagel self-introduction00:10:0200:00:07-00:00:1600:00:07-00:00:170.90
divine wind00:20:1800:05:06-00:05:0800:05:05-00:05:080.67
FTC resources00:59:3500:43:41-00:44:1700:43:31-00:44:150.74
North Devon Marine Pioneer01:24:4500:00:00-00:02:31, 00:03:58-00:06:20, 00:13:32-00:16:01, 00:16:57-00:17:00, 00:47:21-01:03:1000:00:00-00:02:33, 00:46:56-01:03:080.77

Applications

BibTeX


        @article{Vidi2025vidi2,
          title={Vidi2: Large Multimodal Models for Video 
                  Understanding and Creation},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, 
                  Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, 
                  Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, 
                  Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng,
                  Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao,
                  Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin},
          journal={arXiv preprint arXiv:2511.19529},
          year={2025}
        }

        @article{Vidi2025vidi,
          title={Vidi: Large Multimodal Models for Video 
                  Understanding and Editing},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
                  Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
                  Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
                  Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
                  Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
                  Xueqiong Qu, Zhenfang Chen},
          journal={arXiv preprint arXiv:2504.15681},
          year={2025}
        }