Bio2Token: All-atom tokenization of any biomolecular structure with Mamba
@FlagshipPioneer
• This paper introduces “Bio2Token”, a method that tokenizes biomolecular structures at an all-atom level using Mamba. Unlike many current approaches that rely on coarse-grained residue-level representations, Bio2Token focuses on a more detailed atomic-level tokenization.
• The innovation here lies in the use of quantized auto-encoders that learn atom-level representations, achieving reconstruction accuracies below and around 1 Ångström.
• Mamba, a state space model, plays a key role by providing efficient and scalable encoding, overcoming computational limitations of traditional transformer-based models. Bio2Token can handle structures up to 95,000 atoms, which is significantly larger than the limit for many transformer models.
• This approach not only achieves high accuracy but also uses fewer parameters and training resources compared to existing methods like AlphaFold-3 and ESM-3.
• Bio2Token demonstrates versatility by tokenizing proteins, RNA, and small molecules, making it a flexible tool for biomolecular structure representation.
• The quantized auto-encoders (QAE) efficiently transform 3D structures into 1D discrete tokens, allowing future integration with language models for biomolecular tasks.
• The authors present domain-specific tokenizers (mol2token, protein2token, RNA2token) and a combined tokenizer (bio2token) that generalizes across different types of biomolecules.
• Compared to ESM-3, Bio2Token achieves a lower reconstruction RMSE and superior performance across protein and RNA datasets, demonstrating its potential as a robust tool for accurate structural modeling.
• The combination of Mamba-based architecture and quantized auto-encoders provides a lightweight yet powerful solution, avoiding the quadratic computational cost seen in transformers.
• Limitations include ensuring chemical validity in reconstructed structures, as even small deviations can lead to unrealistic bonding. Future directions involve improving accuracy by adding more training data and integrating post-processing steps for chemical validity.
@oliviaviessmann
📜Paper:
arxiv.org/abs/2410.19110
#biomoleculardesign #proteinmodeling #machinelearning #stateSpaceModel #bioinformatics #Mamba #tokenization