Vidi: Large Multimodal Models for Video Understanding and Editing

Intelligent Creation
ByteDance Inc.

San Jose/Seattle, US

Temporal retrieval performance of different models on the proposed VUE-TR benchmark.

Abstract

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high- quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements: 1) Video duration: spans from 20 seconds to over an hour, which is significantly longer than existing temporal/moment retrieval datasets. 2) Audio support: includes audio-based queries for temporal retrieval. 3) Query format: accommodates three different query lengths/formats, i.e., keyword, phrase, and sentence. 4) Annotation quality: all ground-truth time ranges are manually annotated with high accuracy. 5) Evaluation metric: a refined IoU metric designed to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Benchmark

Duration distribution of videos in the proposed VUE-TR evaluation benchmark.

The distribution of query modality and format in the VUE-TR benchmark.

Qualitative Results

Input VideoInput QueryDurationGround TruthPredictionIOUResult Clips
basketball statue00:00:4600:00:13-00:00:3000:00:13-00:00:310.94
gymnasium00:01:2900:00:00-00:01:2800:00:00-00:01:290.99
people assembling sculptures on beach00:01:3000:00:24-00:00:3300:00:24-00:00:331.00
Euripides, has most surviving work like 'Medea' and 'The Bacchae', debut in 455 BC. He is a corner stone of greek education in the Hellenistic period.00:05:1000:04:22-00:04:4100:04:20-00:04:350.62
Jennifer Nagel self-introduction00:10:0200:00:07-00:00:1600:00:07-00:00:170.90
divine wind00:20:1800:05:06-00:05:0800:05:05-00:05:080.67
FTC resources00:59:3500:43:41-00:44:1700:43:31-00:44:150.74
North Devon Marine Pioneer01:24:4500:00:00-00:02:31, 00:03:58-00:06:20, 00:13:32-00:16:01, 00:16:57-00:17:00, 00:47:21-01:03:1000:00:00-00:02:33, 00:46:56-01:03:080.77

BibTeX


        @article{Vidi2025vidi,
          title={Vidi: Large Multimodal Models for Video 
                  Understanding and Editing},
          author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
                  Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
                  Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
                  Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
                  Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
                  Xueqiong Qu},
          journal={arXiv preprint arXiv:2504.15681},
          year={2025}
        }