Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reason- ing. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability en- ables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme for multi-segment spatio-temporal evaluation. In addition, we upgrade the previous VUE-TR bench- mark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
Duration distribution of videos in the proposed VUE-STG evaluation benchmark.
Duration distribution of videos in the proposed VUE-TR-V2 evaluation benchmark.
The distribution of query modality and format in the VUE-TR-V2 benchmark.
| Input Video | Input Query | Duration | Result Clips with Tubes |
| The man wearing a brown suit who is playing drums in an indoor setting | 00:04:12 | ||
| The gorilla which is driving with two men. | 00:16:19 | ||
| a woman in glasses who is walking on street | 00:04:11 | ||
| The boy who stands outside a charming house with warm lights, beneath a starry night sky featuring a full moon. | 00:02:00 | ||
| The glowing blue water beads in which the mango seed is placed, with its germination into a root and shoot visualized through a time-lapse sequence against a dark background | 00:15:50 |
| Input Video | Input Query | Duration | Ground Truth | Prediction | IOU | Result Clips |
| basketball statue | 00:00:46 | 00:00:13-00:00:30 | 00:00:13-00:00:31 | 0.94 | ||
| gymnasium | 00:01:29 | 00:00:00-00:01:28 | 00:00:00-00:01:29 | 0.99 | ||
| people assembling sculptures on beach | 00:01:30 | 00:00:24-00:00:33 | 00:00:24-00:00:33 | 1.00 | ||
| Euripides, has most surviving work like 'Medea' and 'The Bacchae', debut in 455 BC. He is a corner stone of greek education in the Hellenistic period. | 00:05:10 | 00:04:22-00:04:41 | 00:04:20-00:04:35 | 0.62 | ||
| Jennifer Nagel self-introduction | 00:10:02 | 00:00:07-00:00:16 | 00:00:07-00:00:17 | 0.90 | ||
| divine wind | 00:20:18 | 00:05:06-00:05:08 | 00:05:05-00:05:08 | 0.67 | ||
| FTC resources | 00:59:35 | 00:43:41-00:44:17 | 00:43:31-00:44:15 | 0.74 | ||
| North Devon Marine Pioneer | 01:24:45 | 00:00:00-00:02:31, 00:03:58-00:06:20, 00:13:32-00:16:01, 00:16:57-00:17:00, 00:47:21-01:03:10 | 00:00:00-00:02:33, 00:46:56-01:03:08 | 0.77 |
🎬 Smart Split
Automatically clips, reframes, captions, and transcribes long videos into short, TikTok-ready moments. (Inc. Article, Mashable Article)
🧠AI Outline
Helps creators turn a simple prompt or trending topic into structured titles, hooks, and outlines. (Inc. Article, Mashable Article)
@article{Vidi2025vidi2,
title={Vidi2: Large Multimodal Models for Video
Understanding and Creation},
author={Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang,
Dawei Du, Fan Chen, Guang Chen, Haoji Zhang,
Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li,
Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng,
Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao,
Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin},
journal={arXiv preprint arXiv:2511.19529},
year={2025}
}
@article{Vidi2025vidi,
title={Vidi: Large Multimodal Models for Video
Understanding and Editing},
author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du,
Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen,
Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin,
Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei,
Xueqiong Qu, Zhenfang Chen},
journal={arXiv preprint arXiv:2504.15681},
year={2025}
}