UNO
Less-to-More Generalization:
Unlocking More Controllability by In-Context Generation

Shaojin Wu Mengqi Huang^* Wenxu Wu Yufeng Cheng Fei Ding⁺ Qian He
Intelligent Creation Team, ByteDance

Paper arXiv Code Model Demo

We introduce UNO, a universal framework that evolves from single-subject to multi-subject customization. UNO demonstrates strong generalization capabilities and is capable of unifying diverse tasks under one model.

What can UNO do?

Abstract

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

How does it work?

It introduces two pivotal enhancements to the model: progressive cross-modal alignment and universal rotary position embedding(UnoPE). The progressive cross-modal alignment is divided into two stages. In the Stage I, we use single-subject in-context generated data to finetune the pretrained T2I model into an S2I model. In the Stage II, we continue training on generated multiple-subject data pairs. The UnoPE can effectively equip UNO with the capability of mitigating the attribute confusion issue when scaling visual subject controls.

Generalization Capabilities

Comparison with State-of-the-Art Methods

Disclaimer

We open-source this project for academic research. The vast majority of images used in this project are either generated or licensed. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. Our code is released under the Apache 2.0 License,, while our models are under the CC BY-NC 4.0 License. Any models related to FLUX.1-dev base model must adhere to the original licensing terms.

This research aims to advance the field of generative AI. Users are free to create images using this tool, provided they comply with local laws and exercise responsible usage. The developers are not liable for any misuse of the tool by users.

BibTex

@article{wu2025less,
  title={Less-to-More Generalization: Unlocking More Controllability by In-Context Generation},
  author={Wu, Shaojin and Huang, Mengqi and Wu, Wenxu and Cheng, Yufeng and Ding, Fei and He, Qian},
  journal={arXiv preprint arXiv:2504.02160},
  year={2025}
}

UNO Less-to-More Generalization: Unlocking More Controllability by In-Context Generation Shaojin Wu Mengqi Huang* Wenxu Wu Yufeng Cheng Fei Ding+ Qian He Intelligent Creation Team, ByteDance Paper arXiv Code Model Demo