XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Bowen Chen*Mengyi Zhao*Haomiao Sun*Li Chen*†Xu WangKang DuXinglong Wu
*Equal Contribution    †Project Lead
Intelligent Creation Team, ByteDance
XVerse Demonstration

Figure 1: XVerse's capability in single/multi-subject personalization and semantic attribute control (pose, style, lighting). Input conditions are highlighted with red dots.

📝 Abstract

In the field of text-to-image generation, achieving fine-grained control over multiple subject identities and semantic attributes (such as pose, style, lighting) while maintaining high quality and consistency has been a significant challenge. Existing methods often introduce artifacts or suffer from attribute entanglement issues, especially when handling multiple subjects.

To overcome these challenges, we propose XVerse, a novel multi-subject control generation model. XVerse enables precise and independent control of specific subjects without interfering with image latent variables or features by transforming reference images into token-specific text flow modulation offsets. As a result, XVerse provides:

This advancement significantly improves the capability for personalization and complex scene generation.


🔍 Method

The core of XVerse is to achieve consistent control over multiple subject identities and semantic attributes by learning offsets in the text flow modulation mechanism of Diffusion Transformers (DiT). Our method consists of four key components:

XVerse Framework

Figure 2: XVerse framework overview. Reference images are processed through the T-Mod resampler and subsequently injected into each token modulation adapter.

1. T-Mod Adapter

  • 🔹 Employs a perceiver resampler as the text flow modulation adapter
  • 🔹 Combines CLIP-encoded image features with text prompt features to generate cross offsets
  • 🔹 Implements token-specific text flow modulation, enabling the model to precisely control multiple subjects
T-Mod(I_ref, T) → Δ_shared, {Δ_block_i}

Where I_ref is the reference image, T is the text prompt, Δ_shared is the shared offset, and Δ_block_i is the specific offset for each DiT block.

2. Text Flow Modulation Mechanism

  • 🔹 Transforms reference images into text flow modulation offsets
  • 🔹 Adds offsets to the corresponding token embeddings injected into the model
  • 🔹 Adjusts original modulation parameters (scaling and shift parameters) to achieve precise control
This mechanism allows XVerse to make fine adjustments to specific subjects while maintaining the overall structure of the image.

3. VAE-Encoded Image Feature Module

  • 🔹 Incorporates VAE-encoded image features as an auxiliary module into a single block of FLUX
  • 🔹 Enhances detail preservation capability, making generated images more realistic
  • 🔹 Minimizes the occurrence of artifacts and distortions
🔍

Enhanced detail preservation

4. Regularization Techniques

To further improve generation quality and consistency, XVerse introduces two key regularization techniques:

Region Preservation Loss: Enforces consistency in non-modulated regions by randomly preserving modulation injection on one side

Text-Image Attention Loss: Aligns the cross-attention dynamics between the modulated model and the reference T2I branch


📊 Training Data

XVerse's training employs a carefully designed data construction process:

Training Data Construction Process

Figure 4: Training data construction process. Using Florence2 for image description and phrase localization, and SAM2 for face extraction.


🧪 XVerseBench

To comprehensively evaluate multi-subject control image generation capabilities, we propose the XVerseBench benchmark:

XVerseBench Data Distribution

Figure 5: XVerseBench data distribution and sample examples

Dataset Characteristics

  • Rich and Diverse Subjects:
  • 👤 20 different human identities
  • 🏺 74 unique objects
  • 🐾 45 different animal species/individuals
  • 📝 A total of 300 unique test prompts
  • Test Scenarios:
  • Single-subject control scenarios
  • Dual-subject control scenarios
  • Triple-subject control scenarios

Evaluation Methods

XVerseBench employs multi-dimensional evaluation metrics:

Evaluation Metric Description
DPG Score Evaluates the model's editing capability
Face ID Similarity Evaluates the model's ability to maintain human identity
DINOv2 Similarity Evaluates the model's ability to maintain object features
Aesthetic Score Evaluates the aesthetic quality of generated images

📈 Experimental Results

Comparison with Existing Methods

XVerse significantly outperforms existing methods on XVerseBench:

Comparison with Existing Methods

Figure 6: Qualitative comparison of XVerse with other methods on XVerseBench

Ablation Study Results

Figure 7: Demonstration of the effects of text flow modulation resampler and VAE-encoded image features


👤 Single-Subject Control Demonstrations

Precise Control of Individual Subject Identity and Attributes

XVerse demonstrates exceptional capability in controlling single-subject identity and semantic attributes. By leveraging the text flow modulation mechanism, our model achieves:

Figure 9: XVerse's single-subject control capabilities, demonstrating precise identity preservation while manipulating various attributes such as pose, clothing, and environmental context.


👥 Multi-Subject Control Demonstrations

Consistent Control of Multiple Subjects in Complex Scenes

One of XVerse's most significant innovations is its ability to maintain consistency across multiple subjects in a single generated image:

Figure 10: XVerse's multi-subject control capabilities, showing consistent identity preservation across multiple subjects while maintaining natural interaction and scene coherence.

✨ Semantic Attributes Control Demonstrations

Fine-Grained Control of Lighting, Pose, and Style

Beyond subject identity control, XVerse excels in manipulating semantic attributes such as lighting, pose, and style. This capability enables unprecedented creative control:

Semantic Attributes Control Example

Figure 11: XVerse's semantic attributes control capabilities, demonstrating fine - grained manipulation of lighting, pose, and style.


📄 Citation

If you use XVerse in your research, please cite our paper:

📝

BibTeX Citation

@article{chen2025xverse,
  title={XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation},
  author={Chen, Bowen and Zhao, Mengyi and Sun, Haomiao and Chen, Li and Wang, Xu and Du, Kang and Wu, Xinglong},
  journal={arXiv preprint arXiv:2506.21416},
  year={2025}
}

🙏 Acknowledgements

We thank all researchers and engineers who contributed to this project. This research was supported by the ByteDance Intelligent Creation Team.