XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Paper GitHub Dataset Models

🎉 Paper Accepted by NeurIPS 2025 🎉

Bowen Chen* • Mengyi Zhao* • Haomiao Sun* • Li Chen*† • Xu Wang • Kang Du • Xinglong Wu

*Equal Contribution †Project Lead

Intelligent Creation Team, ByteDance

Figure 1: XVerse's capability in single/multi-subject personalization and semantic attribute control (pose, style, lighting). Input conditions are highlighted with red dots.

📝 Abstract

In the field of text-to-image generation, achieving fine-grained control over multiple subject identities and semantic attributes (such as pose, style, lighting) while maintaining high quality and consistency has been a significant challenge. Existing methods often introduce artifacts or suffer from attribute entanglement issues, especially when handling multiple subjects.

To overcome these challenges, we propose XVerse, a novel multi-subject control generation model. XVerse enables precise and independent control of specific subjects without interfering with image latent variables or features by transforming reference images into token-specific text flow modulation offsets. As a result, XVerse provides:

✅ High-fidelity, editable multi-subject image synthesis
✅ Powerful control over individual subject characteristics
✅ Fine-grained manipulation of semantic attributes

This advancement significantly improves the capability for personalization and complex scene generation.

🔍 Method

The core of XVerse is to achieve consistent control over multiple subject identities and semantic attributes by learning offsets in the text modulation mechanism of Diffusion Transformers (DiT). Our method consists of four key components:

Figure 2: XVerse framework overview. Reference images are processed through the T-Mod resampler and subsequently injected into each token modulation adapter.

1. T-Mod Adapter

🔹 Employs a perceiver resampler as the text modulation adapter
🔹 Combines CLIP-encoded image features with text prompt features to generate cross offsets
🔹 Implements token-specific text-stream modulation, enabling the model to precisely control multiple subjects

T-Mod(I_ref, T) → Δ_shared, {Δ_block_i}

Where I_ref is the reference image, T is the text prompt, Δ_shared is the shared offset, and Δ_block_i is the specific offset for each DiT block.

2. Text-stream Modulation Mechanism

🔹 Transforms reference images into text-stream modulation offsets
🔹 Adds offsets to the corresponding token embeddings injected into the model
🔹 Adjusts original modulation parameters (scaling and shift parameters) to achieve precise control

This mechanism allows XVerse to make fine adjustments to specific subjects while maintaining the overall structure of the image.

3. VAE-Encoded Image Feature Module

🔹 Incorporates VAE-encoded image features as an auxiliary module into a single block of FLUX
🔹 Enhances detail preservation capability, making generated images more realistic
🔹 Minimizes the occurrence of artifacts and distortions

🔍

Enhanced detail preservation

4. Regularization Techniques

To further improve generation quality and consistency, XVerse introduces two key regularization techniques:

Region Preservation Loss: Enforces consistency in non-modulated regions by randomly preserving modulation injection on one side

Text-Image Attention Loss: Aligns the cross-attention dynamics between the modulated model and the reference T2I branch

📊 Training Data

XVerse's training employs a carefully designed data construction process:

Figure 4: Training data construction process. Using Florence2 for image description and phrase localization, and SAM2 for face extraction.

🔸 Utilizes Florence2 for image description and phrase localization
🔸 Employs LLMs for label filtering
🔸 Employs SAM2 for precise face extraction and Facer for face extraction
🔸 Constructs a high-quality training dataset containing multi-subject control

🧪 XVerseBench

To comprehensively evaluate multi-subject control image generation capabilities, we propose the XVerseBench benchmark:

Figure 5: XVerseBench data distribution and sample examples

Dataset Characteristics

Rich and Diverse Subjects:
👤 20 different human identities
🏺 74 unique objects
🐾 45 different animal species/individuals
📝 A total of 300 unique test prompts

Test Scenarios:
Single-subject control scenarios
Dual-subject control scenarios
Triple-subject control scenarios

Evaluation Methods

XVerseBench employs multi-dimensional evaluation metrics:

Evaluation Metric	Description
DPG Score	Evaluates the model's editing capability
Face ID Similarity	Evaluates the model's ability to maintain human identity
DINOv2 Similarity	Evaluates the model's ability to maintain object features
Aesthetic Score	Evaluates the aesthetic quality of generated images

📈 Experimental Results

Comparison with Existing Methods

XVerse significantly outperforms existing methods on XVerseBench:

Figure 6: Qualitative comparison of XVerse with other methods on XVerseBench

Figure 7: Demonstration of the effects of text-stream modulation resampler and VAE-encoded image features

👤 Single-Subject Control Demonstrations

Precise Control of Individual Subject Identity and Attributes

XVerse demonstrates exceptional capability in controlling single-subject identity and semantic attributes. By leveraging the text-stream modulation mechanism, our model achieves:

🔸 High fidelity identity preservation across diverse scenarios and contexts
🔸 Fine-grained attribute control while maintaining identity consistency

Figure 9: XVerse's single-subject control capabilities, demonstrating precise identity preservation while manipulating various attributes such as pose, clothing, and environmental context.

👥 Multi-Subject Control Demonstrations

Consistent Control of Multiple Subjects in Complex Scenes

One of XVerse's most significant innovations is its ability to maintain consistency across multiple subjects in a single generated image:

🔸 Simultaneous control of multiple subject identities in one scene
🔸 Independent attribute manipulation for each subject
🔸 Reduced interference between different subjects compared to existing methods
🔸 Complex scene composition with multiple controlled elements

Figure 10: XVerse's multi-subject control capabilities, showing consistent identity preservation across multiple subjects while maintaining natural interaction and scene coherence.

✨ Semantic Attributes Control Demonstrations

Fine-Grained Control of Lighting, Pose, and Style

Beyond subject identity control, XVerse excels in manipulating semantic attributes such as lighting, pose, and style. This capability enables unprecedented creative control:

Figure 11: XVerse's semantic attributes control capabilities, demonstrating fine - grained manipulation of lighting, pose, and style.

📄 Citation

If you use XVerse in your research, please cite our paper:

📝

BibTeX Citation

@article{chen2025xverse,
  title={XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation},
  author={Chen, Bowen and Zhao, Mengyi and Sun, Haomiao and Chen, Li and Wang, Xu and Du, Kang and Wu, Xinglong},
  journal={arXiv preprint arXiv:2506.21416},
  year={2025}
}

View on arXiv →

🙏 Acknowledgements

We thank all researchers and engineers who contributed to this project. This research was supported by the ByteDance Intelligent Creation Team.