Limited Structural Modeling
Strong Sequence Modeling
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations on structural modeling. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
We systematically explore the key pitfalls and the design space of token-based multimodal protein language models to bridge their limitations on structural modeling. Building upon DPLM-2, we advance the designs spanning improved generative modeling, structure-aware architectures, representation learning, and data exploration. We provide a taxonomy table of our design methods in the paper.
Our main contributions are summarized as follows:
We highlight key (O)bservations on the limitations of structure tokenization along with their implications as follows:
Latent feature | Struct token type | Reconstruction | |
---|---|---|---|
RMSD ↓ | TMscore ↑ | ||
zcont | (pre-quantized) continuous token | 1.3127 | 0.9733 |
zindex ↔ zquant | (quantized) discrete token | 1.9806 | 0.9385 |
🧪 Analysis. Applying token quantization significantly amplifies reconstruction errors
(
📌 Implications: Learning to recover the lost residuals could enhance structure prediction accuracy.
Tokenizer | Reconstruction | Generation | ||
---|---|---|---|---|
rRMSD ↓ | rTMscore ↑ | RMSD ↓ | TMscore ↑ | |
DPLM-2 | 1.9806 | 0.9385 | 7.7025 | 0.7936 |
ESM3 | 0.7248 | 0.9912 | 8.4424 | 0.7924 |
🧪 Analysis. We select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants using their respective structure token codebooks.
Although the ESM3 tokenizer achieves superior reconstruction accuracy
(
📌 Implications: Improvements in reconstruction do not necessarily translate into better generation, and hence greater emphasis should be placed on
improving generative modeling and architectural design.
Model | Testset | Struct Token Acc ↑ | Struct Eval Metric | ||
---|---|---|---|---|---|
index | bit | RMSD ↓ | TMscore ↑ | ||
DPLM-2 index-based | CAMEO 2022 | 0.0864 | 0.7720 | 7.7025 | 0.7936 |
PDB date split | 0.1188 | 0.7932 | 5.3071 | 0.8306 | |
DPLM-2 Bit-based | CAMEO 2022 | 0.1258 | 0.7958 | 6.4028 | 0.8380 |
PDB date split | 0.2641 | 0.8648 | 3.2213 | 0.9043 |
🧪 Analysis. Direct index prediction of structure token is highly inaccurate. (0.0864 on CAMEO).
The conventional learning process of index-based labels is highly challenging:
since each index is derived from multiple quantized bits,
even small changes at the bit level can result in drastically
different indices, leading to suboptimal generation performance.
📌 Implications: While the model struggles to recover exact indices,
it effectively captures structural patterns at the bit level.
In this section, we present several improvements on generative modeling aimed at
enhancing the accuracy and detail of protein structure modeling.
These approaches build upon the initial structure
tokenization and aim to improve predictions by introducing methods for recovering tokenization losses (ResDiff),
bridging discrete and continuous tokens (Bit-based), and
enabling direct data-space modeling (Hybrid).
These designs significantly improve the structural generative performance of DPLM-2, even outperforming the 3B-scale baseline.
We highlight these results below.
We introduce geometric modules to capture higher-order relationship between residues beyond simple sequence-based architectures, essential for the intricate nature of protein structures as evidenced in protein folding. Our component-wise analysis reveals a balanced configuration that significantly improves both structural modeling and generation diversity, without reducing much training efficiency—a common limitation of models with geometric layers.
We adopt representation alignment (REPA) to enhance structure generation by addressing two core challenges: the limitations of discrete tokens and the challenge of training diffusion models in learning high-quality representations. Unlike sharp discrete supervision, REPA enables smooth, high-dimensional learning that preserves subtle structural details. To transfer meaningful structural semantics, we align representations from the protein language model with those from a specialized folding model—using ESMFold for its efficient inference, though other models like AlphaFold are compatible. Originally designed for vision, we demonstrate that REPA improves the structural diversity of generated proteins and is compatible with both sequence- and geometry-based architectures.
Building on the individual analysis of each design method, we examine the interactions of these designs by combining them in a unified setting. This analysis completes our blueprint, enabling us to recommend a final configuration and discuss the orthogonality between each design.
Multimer (multi-chain protein) data presents diverse structural arrangements and interaction scenarios, which are essential for developing a more general multimodal model. Notably, most existing protein language models have been trained solely on single-chain proteins (monomer). We conduct a series of analysis to examine the relevance and gap between monomer and multimer data. Our findings suggest that multimer and monomer data are deeply interconnected, and incorporating multimer data could effectively improves the structure folding for both multimer and monomer.
Training Data | SFT | PDB-Multimer | CAMEO 2022 | |||
---|---|---|---|---|---|---|
PDB-Multimer | Swissprot | RMSD ↓ | TMscore ↑ | RMSD ↓ | TMscore ↑ | |
✓ | 17.966 | 0.771 | 7.703 | 0.793 | ||
✓ | ✓ | 19.615 | 0.799 | 6.612 | 0.823 | |
✓ | ✓ | 16.146 | 0.775 | 10.989 | 0.686 | |
✓ | ✓ | ✓ | 16.674 | 0.798 | 6.410 | 0.831 |
In this work, we identify the limitations in structural modeling for multimodal protein language models and propose an effective design space to bridge the gap. We demonstrate that tokenization quantization loss can be effectively mitigated with bit-label supervision and flow-matching, which significantly improve the structure prediction accuracy. We introduce geometric inductive biases through architectural design and leverage representation learning to refine generation diversity. Building on the strengths of each component, we further investigate their orthogonality, which informs the final recommended setting. Lastly, to tackle the scarcity of structure data, we explore the data coverage to include multimers, ensuring broader 3D structural understanding. Our results show that these effective designs allow multimodal models to achieve on-par or even superior folding accuracy compared to larger, specialized folding models. We believe this work will contribute to advancing the development of more effective multimodal protein language models.
@article{hsieh2025dplm2-1,
title={Elucidating the Design Space of Multimodal Protein Language Models},
author={Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu},
journal={arXiv preprint arXiv:2504.11454},
year={2025},
url={https://arxiv.org/abs/2504.11454},
}