By treating a reference video with the wanted semantic as a video prompt and achieving plug-and-play in-context generation via mixture-of-transformers structure, we can generate videos that are semantically consistent with the reference videos.
Our Video-as-Prompt model supports various downstream applications:
(1) Different reference videos (different semantic) + same reference images → generate the video aligned with each semantics consistently;
(2) Different reference videos (same semantic) + same reference images → generate the video aligned with the provided semantics consistently;
(3) Same reference videos + different reference images → transfer the same semantic (concept/style/motion/camera) to different reference images;
(4) Same reference video & image + user-modified prompt → preserve semantics and identity while using prompt to adjust fine-grained attributes.
Given different reference videos with the different semantic and a reference image, our model can also generate videos aligned with each semantics in the given reference videos.
Given different reference videos with the same semantic and a reference image, our model can consistently generate videos aligned with the provided semantics in the given reference videos.
Given a reference video, our model can generate new videos based on different reference images that are semantically consistent with the reference video.
Given a reference video and a reference image, our model can preserve semantics and identity while using prompt to adjust some fine-grained attributes.
Given reference videos with unseen semantics, our model can generate videos that are semantically consistent with the reference videos in a zero-shot manner.
Generates videos that share a high-level concept semantic, such as entity transformation (e.g., the target becomes a ladudu doll / Minecraft character) or entity interaction (e.g., an AI lover approaches the target, the target is covered by some liquid metal).
Our model Video-As-Prompt takes reference videos as prompts (left one) and generates videos that are semantically consistent with the reference videos (right one).
Generates videos in a reference style (e.g., Ghibli, Simpsons, etc.).
Our model Video-As-Prompt takes reference videos as prompts (left one) and generates videos that are semantically consistent with the reference videos (right one).
Generates videos with a reference motion, including Non-Human Motion (e.g., float like balloons) and Human Motion (e.g., dance in a shaking style).
Our model Video-As-Prompt takes reference videos as prompts (left one) and generates videos that are semantically consistent with the reference videos (right one).
Generates videos that follow reference camera motion, from basic translations (up, down, left, right, zoom-in, zoom-out) to the complex Hitchcock dolly zoom.
Our model Video-As-Prompt takes reference videos as prompts (left one) and generates videos that are semantically consistent with the reference videos (right one).
@inproceedings{video_as_prompt,
title={Video-As-Prompt: Unified Semantic Control for Video Generation},
author={Bian, Yuxuan and Chen, Xin and Li, Zenan and Zhi, Tiancheng and Sang, Shen and Luo, Linjie and Xu, Qiang},
year={2025},
booktitle={arXiv},
}