Elucidating the Design Space of
Multimodal Protein Language Models

1ByteDance Research    2Nanjing University    3Rutgers University   
*Equal Contribution   Project Lead   Core Contributor   #Corresponding Author
Design choices are essential: Our designs enable the 650M multimodal PLM to
outperform 3B-scale baselines and specialized structure folding models.

Abstract

Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations on structural modeling. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.


Design Space Overview

We systematically explore the key pitfalls and the design space of token-based multimodal protein language models to bridge their limitations on structural modeling. Building upon DPLM-2, we advance the designs spanning improved generative modeling, structure-aware architectures, representation learning, and data exploration. We provide a taxonomy table of our design methods in the paper.

Overview of design space
Method overview. Our design space spans improved generative modeling, structure-aware architectures, representation learning, and data exploration.

Major Contributions and Findings

Our main contributions are summarized as follows:

  • We conduct a comprehensive study revealing key pitfalls in structure token-based multimodal protein language models, and systematically elucidate their design space for robust structural modeling.
  • Utilizing improved approaches such as bit-wise discrete modeling offers finer-grained supervision, significantly improving structure generative capability.
  • Introducing representation-level learning and architectural innovations infuses geometric inductive biases and effectively refines generation diversity.
  • We find that multimer and monomer modeling are deeply interconnected and leveraging multimer data advances the structural modeling for both single and multi-chain proteins.
  • Our design methods allow multimodal PLMs to achieve robust structural understanding, improving the folding RMSD from 5.52 to 2.36 on the PDB date dataset, outperforming 3B folding baselines with only 650M parameters.

Advance Multimodal PLM

Pitfalls of Modeling over Tokenized Structures

We highlight key (O)bservations on the limitations of structure tokenization along with their implications as follows:

(O1): Structure tokenization results in information loss.
Table 1: Effects of feature quantization on structure tokenizer reconstruction.
Latent feature Struct token type Reconstruction
RMSD ↓ TMscore ↑
zcont (pre-quantized) continuous token 1.3127 0.9733
zindexzquant (quantized) discrete token 1.9806 0.9385

🧪 Analysis. Applying token quantization significantly amplifies reconstruction errors (RMSD 1.311.98 & TMscore 0.970.93) , inevitably resulting in loss of fidelity hence detailed structural accuracy.

📌 Implications: Learning to recover the lost residuals could enhance structure prediction accuracy.

(O2): High reconstruction accuracy does not guarantee better structure generative performance in language models.
Table 2: Tokenizer reconstruction vs. language model generation. Evaluation of folding on CAMEO 2022.
Tokenizer Reconstruction Generation
rRMSD ↓ rTMscore ↑ RMSD ↓ TMscore ↑
DPLM-2 1.9806 0.9385 7.7025 0.7936
ESM3 0.7248 0.9912 8.4424 0.7924

🧪 Analysis. We select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants using their respective structure token codebooks. Although the ESM3 tokenizer achieves superior reconstruction accuracy (RMSD:0.72, TMscore:0.99), the model trained with the DPLM-2 tokenizer’s codebook exhibits stronger protein folding performance.

📌 Implications: Improvements in reconstruction do not necessarily translate into better generation, and hence greater emphasis should be placed on improving generative modeling and architectural design.

(O3): Index-based structure tokens? Multimodal PLM gets them miserably wrong in structure prediction.
Table 3: Language model structure token prediction accuracy. Index-based vs. bits-based evaluation on structure folding.
Model Testset Struct Token Acc ↑ Struct Eval Metric
index bit RMSD ↓ TMscore ↑
DPLM-2 index-based CAMEO 2022 0.0864 0.7720 7.7025 0.7936
PDB date split 0.1188 0.7932 5.3071 0.8306
DPLM-2 Bit-based CAMEO 2022 0.1258 0.7958 6.4028 0.8380
PDB date split 0.2641 0.8648 3.2213 0.9043

🧪 Analysis. Direct index prediction of structure token is highly inaccurate. (0.0864 on CAMEO). The conventional learning process of index-based labels is highly challenging: since each index is derived from multiple quantized bits, even small changes at the bit level can result in drastically different indices, leading to suboptimal generation performance.

📌 Implications: While the model struggles to recover exact indices, it effectively captures structural patterns at the bit level.

Improved Structure Prediction

In this section, we present several improvements on generative modeling aimed at enhancing the accuracy and detail of protein structure modeling. These approaches build upon the initial structure tokenization and aim to improve predictions by introducing methods for recovering tokenization losses (ResDiff), bridging discrete and continuous tokens (Bit-based), and enabling direct data-space modeling (Hybrid).
These designs significantly improve the structural generative performance of DPLM-2, even outperforming the 3B-scale baseline. We highlight these results below.

Table 4. Evaluation of improved approaches for structure prediction based upon DPLM-2.
Overview of design space

Geometry-aware Architectures

We introduce geometric modules to capture higher-order relationship between residues beyond simple sequence-based architectures, essential for the intricate nature of protein structures as evidenced in protein folding. Our component-wise analysis reveals a balanced configuration that significantly improves both structural modeling and generation diversity, without reducing much training efficiency—a common limitation of models with geometric layers.

Training efficiency of geometric designs.
Component-wise training efficiency analysis allows us to arrive at a balanced configuration that improves structural modeling and generation diversity, without reducing much training efficiency.

Structure-aware Representation Alignment

We adopt representation alignment (REPA) to enhance structure generation by addressing two core challenges: the limitations of discrete tokens and the challenge of training diffusion models in learning high-quality representations. Unlike sharp discrete supervision, REPA enables smooth, high-dimensional learning that preserves subtle structural details. To transfer meaningful structural semantics, we align representations from the protein language model with those from a specialized folding model—using ESMFold for its efficient inference, though other models like AlphaFold are compatible. Originally designed for vision, we demonstrate that REPA improves the structural diversity of generated proteins and is compatible with both sequence- and geometry-based architectures.

Table 6. Representation alignment improves folding prediction and is compatible with both language model-based architectures and geometric design.
Representation alignment improves structure predic-
          tion.
Effects on generation diversity.
Effects on generation diversity. Representation alignment significantly improves the low generation diversity of the multimodal PLM.

Analysis of Orthogonality

Building on the individual analysis of each design method, we examine the interactions of these designs by combining them in a unified setting. This analysis completes our blueprint, enabling us to recommend a final configuration and discuss the orthogonality between each design.

Table 7. Analysis of orthogonality. We analyze the compatibility of design methods when combined, with the recommended setting highlighted.
Analysis of orthogonality.

Multimer Data Exploration

Multimer (multi-chain protein) data presents diverse structural arrangements and interaction scenarios, which are essential for developing a more general multimodal model. Notably, most existing protein language models have been trained solely on single-chain proteins (monomer). We conduct a series of analysis to examine the relevance and gap between monomer and multimer data. Our findings suggest that multimer and monomer data are deeply interconnected, and incorporating multimer data could effectively improves the structure folding for both multimer and monomer.

Table 5: Fine-tuning with multimer and monomer data. We evaluate the effects of fine-tuning with PDB-Multimer and Swissprot on structure prediction. Incorporating multimer data improves both monomer and multimer folding.
Training Data SFT PDB-Multimer CAMEO 2022
PDB-Multimer Swissprot RMSD ↓ TMscore ↑ RMSD ↓ TMscore ↑
17.966 0.771 7.703 0.793
19.615 0.799 6.612 0.823
16.146 0.775 10.989 0.686
16.674 0.798 6.410 0.831

Conclusion

In this work, we identify the limitations in structural modeling for multimodal protein language models and propose an effective design space to bridge the gap. We demonstrate that tokenization quantization loss can be effectively mitigated with bit-label supervision and flow-matching, which significantly improve the structure prediction accuracy. We introduce geometric inductive biases through architectural design and leverage representation learning to refine generation diversity. Building on the strengths of each component, we further investigate their orthogonality, which informs the final recommended setting. Lastly, to tackle the scarcity of structure data, we explore the data coverage to include multimers, ensuring broader 3D structural understanding. Our results show that these effective designs allow multimodal models to achieve on-par or even superior folding accuracy compared to larger, specialized folding models. We believe this work will contribute to advancing the development of more effective multimodal protein language models.

BibTeX


      @article{hsieh2025dplm2-1,
          title={Elucidating the Design Space of Multimodal Protein Language Models},
          author={Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu},
          journal={arXiv preprint arXiv:2504.11454},
          year={2025},
          url={https://arxiv.org/abs/2504.11454}, 
      }