Stable Diffusion 3 can leverages the same $3 \times 3$ LatentUnfold mosaic—no extra data, no fine-tuning—yet still effectively preserves subject identity.
We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities—without any additional data, training, or inference-time fine-tuning. This “free lunch” approach is further enhanced by a cascade attention design and meta prompting technique, boosting fidelity and versatility.
Experimental results show that our Latent Unfoldd outperforms baselines across multiple metrics in benchmarks and human preference studies, with some trade-offs. It supports diverse edits such as logo insertion, virtual try-on, and subject replacement or insertion, demonstrating that a pre-trained text-to-image model can deliver high-quality, resource-efficient customization for downstream applications.
Visual overview of our mosaic-based generation.
Our method, called Latent Unfold, constructs a mosaic-formatted $M \times N$ grid (e.g., $3\times3$) for zero-shot subject-driven generation. One panel is left blank as the target area, while the rest are filled with repeated subject images (as shown conceptually in the teaser and pipeline figures). This design enhances identity consistency.
The reference image is first encoded into a latent code $\mathbf{L}_{r}$. This code is then tiled into the mosaic latent $\mathbf{L}$, where the target panel (e.g., top-left) can be initialized with zeros: $$ \mathbf{L} \;=\; \begin{pmatrix} \mathbf{0} & \mathbf{L}_r & \cdots & \mathbf{L}_r \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{L}_r & \mathbf{L}_r & \cdots & \mathbf{L}_r \end{pmatrix} $$ This Latent Unfold approach is highly flexible, supporting both single-view and multi-view input subject images by incorporating different perspectives into the mosaic.
Our goal is to edit the target panel within the mosaic while preserving the reference subject images in other panels. We adapt a denoiser-based completion pipeline. The key steps are:
After $T$ steps, decoding $\mathbf{x}_T$ yields the mosaic image with the target panel newly generated and other panels faithful to the subject reference. This process is summarized below:
Latent Unfold Pipeline
To further improve detail preservation and consistency, especially when the model needs to attend to features across multiple reference panels, we introduce Cascade Attention. This mechanism allows the model to capture subject information at multiple scales.
Attention visualization with and without Cascade Attention.
Note the improved detail on the toy's teeth and belly.
Video demonstrating the Cascade Attention mechanism.
We construct pooled (downsampled) versions of the queries ($\mathbf{Q}_i$) and keys ($\mathbf{K}_i$). The attention score maps from these pooled versions ($\mathbf{S}_i^P$) are upsampled and aggregated with the original fine-scale score map ($\mathbf{S}_1^U$): $$ \mathbf{S} = \operatorname{softmax} \left( \mathbf{S}_1^U + \sum_{i=2}^{I} \left(\mathbf{S}_i^U\left[Q_{\mathrm{tgt}},K_{\mathrm{ref}}\right]\right) \right) $$ This "coarse-to-fine" feedback strengthens subject identity representation and refines details, resulting in sharper and more faithful generations. The added computational cost is minimal.
Effective prompting is key for T2I models. We employ Meta Prompting, using an MLLM (e.g., GPT-4o) to translate a user's intent and reference image(s) into detailed and effective prompts for the FLUX model. The MLLM takes a "meta-prompt" and the subject image as input and outputs one or more tailored prompts for generation.
When given an image, you make a mosaic image consisting of a 3x3 grid of sub-images showing the exact same subject, describe each sub-image's appearance sequentially from top-left to bottom-right. Limit each description to 30 words. Describe details especially unique appearance like logos, colors, textures, shape, structure and material that can recreate the subject. Refrain from any speculative or guesswork.
Output Format Example:
{
"row1": {
"image1": "highlights the sneaker's white laces and textured sole, emphasizing its casual style.",
"image2": "captures the sneaker's unique color combination and material texture from a slightly angled view.",
"image3": "displays a close-up of the sneaker's mint green and lavender panels, focusing on the stitching details."
},
"row2": {
"image1": "presents the sneaker's side, showing the yellow stripe and layered design elements.",
"image2": "showcases the sneaker's rounded toe and smooth material finish, highlighting its modern aesthetic.",
"image3": "features the sneaker's interior lining and padded collar, emphasizing comfort and design."
},
"row3": {
"image1": "focuses on the sneaker's sole pattern and grip, showcasing its practical features.",
"image2": "captures the sneaker's overall shape and color scheme, providing a comprehensive view.",
"image3": "features the structure and texture of the subject."
},
"summary": "This set of full-frame photos captures an identical pastel-colored sneaker subject firmly positioned in the real scene, highlighting its unique design, color scheme, and material details from various perspectives (cinematic, epic, 4K, high quality)."
}
For content in summary: it should starts with "This set of full-frame photos captures an identical xxx subject" and include "firmly positioned in the real scene".
This approach significantly enhances the model's ability to follow nuanced instructions and accurately reflect desired attributes in the generated image.
We conducted a comprehensive set of experiments to evaluate LatentUnfold. Our evaluation includes performance on standard benchmarks against state-of-the-art methods, human preference studies, detailed ablation analyses of our components, and qualitative demonstrations of various applications.
We compared LatentUnfold with methods categorized as 'Extra Params', 'Extra Data', and 'All-free' (our category). The results, presented in Table 1, show that despite requiring no extra training or specialized data, LatentUnfold achieves competitive performance and outperforms other 'All-free' methods, highlighting its efficiency and effectiveness.
Table 1: Comparison with state-of-the-art methods.
Qualitative example from the DreamBooth benchmark evaluation.
Recognizing the limitations of numerical metrics, we conducted a human preference study comparing LatentUnfold against OmniControl and Diptych (re-implemented). With 1500 responses, our method was favored for subject identity preservation and overall image quality, and performed comparably or better in text alignment, as detailed in Table 2. This suggests our mosaic conditioning and Cascade Attention effectively enhance perceptual quality.
Table 2: Human preference comparison.
To understand the impact of different components, we performed ablation studies (Table 3) investigating mosaic grid shape, background removal (BGRM), Cascade Attention, and Meta Prompting. Key findings include: the 3x3 grid performs well; background removal aids text alignment; Cascade Attention significantly improves identity preservation; and Meta Prompting further boosts identity, with a slight trade-off in text alignment.
Table 3: Ablation studies on method components.
LatentUnfold supports a variety of applications, demonstrating its versatility and robustness in preserving subject identity across diverse scenarios and prompts. The following examples showcase its capabilities in different tasks.
Stable Diffusion 3 can leverages the same $3 \times 3$ LatentUnfold mosaic—no extra data, no fine-tuning—yet still effectively preserves subject identity.
@article{kang2025latentunfold,
title={Flux Already Knows - Activating Subject-Driven Image Generation without Training},
author={Kang, Hao and Fotiadis, Stathi and Jiang, Liming and Yan, Qing and Jia, Yumin and Liu, Zichuan and Chong, Min Jin and Lu, Xin},
journal={arXiv preprint},
volume={arXiv:2504.11478},
year={2025},
}