Figure 1: XVerse's capability in single/multi-subject personalization and semantic attribute control (pose, style, lighting). Input conditions are highlighted with red dots.
In the field of text-to-image generation, achieving fine-grained control over multiple subject identities and semantic attributes (such as pose, style, lighting) while maintaining high quality and consistency has been a significant challenge. Existing methods often introduce artifacts or suffer from attribute entanglement issues, especially when handling multiple subjects.
To overcome these challenges, we propose XVerse, a novel multi-subject control generation model. XVerse enables precise and independent control of specific subjects without interfering with image latent variables or features by transforming reference images into token-specific text flow modulation offsets. As a result, XVerse provides:
This advancement significantly improves the capability for personalization and complex scene generation.
The core of XVerse is to achieve consistent control over multiple subject identities and semantic attributes by learning offsets in the text flow modulation mechanism of Diffusion Transformers (DiT). Our method consists of four key components:
Figure 2: XVerse framework overview. Reference images are processed through the T-Mod resampler and subsequently injected into each token modulation adapter.
Where I_ref
is the reference image, T
is the text prompt, Δ_shared
is the shared offset, and Δ_block_i
is the specific offset for each DiT block.
This mechanism allows XVerse to make fine adjustments to specific subjects while maintaining the overall structure of the image.
Enhanced detail preservation
To further improve generation quality and consistency, XVerse introduces two key regularization techniques:
Region Preservation Loss: Enforces consistency in non-modulated regions by randomly preserving modulation injection on one side
Text-Image Attention Loss: Aligns the cross-attention dynamics between the modulated model and the reference T2I branch
XVerse's training employs a carefully designed data construction process:
Figure 4: Training data construction process. Using Florence2 for image description and phrase localization, and SAM2 for face extraction.
To comprehensively evaluate multi-subject control image generation capabilities, we propose the XVerseBench benchmark:
Figure 5: XVerseBench data distribution and sample examples
XVerseBench employs multi-dimensional evaluation metrics:
Evaluation Metric | Description |
---|---|
DPG Score | Evaluates the model's editing capability |
Face ID Similarity | Evaluates the model's ability to maintain human identity |
DINOv2 Similarity | Evaluates the model's ability to maintain object features |
Aesthetic Score | Evaluates the aesthetic quality of generated images |
XVerse significantly outperforms existing methods on XVerseBench:
Figure 6: Qualitative comparison of XVerse with other methods on XVerseBench
Figure 7: Demonstration of the effects of text flow modulation resampler and VAE-encoded image features
XVerse demonstrates exceptional capability in controlling single-subject identity and semantic attributes. By leveraging the text flow modulation mechanism, our model achieves:
Figure 9: XVerse's single-subject control capabilities, demonstrating precise identity preservation while manipulating various attributes such as pose, clothing, and environmental context.
One of XVerse's most significant innovations is its ability to maintain consistency across multiple subjects in a single generated image:
Figure 10: XVerse's multi-subject control capabilities, showing consistent identity preservation across multiple subjects while maintaining natural interaction and scene coherence.
Beyond subject identity control, XVerse excels in manipulating semantic attributes such as lighting, pose, and style. This capability enables unprecedented creative control:
Figure 11: XVerse's semantic attributes control capabilities, demonstrating fine - grained manipulation of lighting, pose, and style.
If you use XVerse in your research, please cite our paper:
We thank all researchers and engineers who contributed to this project. This research was supported by the ByteDance Intelligent Creation Team.