fajie yuan (@duguyuan) | nitter

Pinned Tweet

fajie yuan @duguyuan

7 Jan 2025

We release our protein chatGPT, Evola! 🌟 chat-protein.com/ Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬 biorxiv.org/content/10.1101/…

This Post is from an account that no longer exists.

20

135

615

130,529

fajie yuan @duguyuan

24 Jan 2025

🧬 Design PETase with Pinal: 1️⃣ ChatGPT → Describe PETase 2️⃣ Pinal → Input text then run it 3️⃣ AlphaFold3 → Structure prediction 4️⃣ Evolla → Function validation 🚀 Simple yet powerful! Easy! Try：denovo-pinal.com/ #SyntheticBiology #ProteinDesign #AI4Science

14

89

388

63,918

fajie yuan @duguyuan

26 Nov 2024

🧬✨Excited to share our online demo " Natural language → De Novo Protein design" Live demo: http://113.45.254.183:8888/ The demo version Pinal is 1.2B.🔬 You try very detailed textual prompt up to 500 words. biorxiv.org/content/10.1101/…

fajie yuan @duguyuan

2 Aug 2024

Toward De Novo Protein Design from Natural Language: Propose Pinal, a 2-stage generative framework, avoiding end-to-end text-protein generation. Design an optimal sampler to integrate both stages. Outperform ESM3 when prompting with text. #ProteinDesign biorxiv.org/content/10.1101/…

11

50

294

29,002

fajie yuan @duguyuan

2 Aug 2024

Toward De Novo Protein Design from Natural Language: Propose Pinal, a 2-stage generative framework, avoiding end-to-end text-protein generation. Design an optimal sampler to integrate both stages. Outperform ESM3 when prompting with text. #ProteinDesign biorxiv.org/content/10.1101/…

13

53

234

62,526

fajie yuan @duguyuan

2 Apr 2025

🚀 Update! Our latest Pinal bioRxiv now includes wet lab results. More proteins with diverse text prompt on the way. Design proteins with just text. Everyone can do protein design! Demo: denovo-pinal.com/ paper: biorxiv.org/content/10.1101/… GitHub: github.com/westlake-repl/Den…

5

46

233

17,037

fajie yuan @duguyuan

19 Jul 2024

Recruited 12 bio students, no coding exp, to use ColabSaprot for re-training, zero-shot mutation, & protein design. They matched AI experts w/o hyper-parameter tuning! With SaprotHub, any biologist can train protein models! @sokrypton @LTEnjoy biorxiv.org/content/10.1101/…

7

33

212

20,484

fajie yuan @duguyuan

15 Sep 2022

with @KevinKaichuang Our paper about deciphering #AlphaFold as protein function predictor accepted in #NeurIPS2022. First paper since we started biologcal AI from last year. arxiv.org/abs/2206.06583 Some updates will be available later

4

26

137

fajie yuan @duguyuan

3 Oct 2023

SaProt: Protein Language Modeling with Structure-aware Vocabulary A 650M protein language model trained with 64 80G A100 for 3 months. A good alternative to the ESM family. thanks @sokrypton biorxiv.org/content/10.1101/…

3

32

143

51,235

fajie yuan @duguyuan

2 Jan 2025

🧬 ProTrek Major Update: Added: 2B+ marine proteins from GOPC Total: 2.25B+ searchable proteins Natural language protein search powered by trimodal PLM 🔍 Try now: search-protrek.com/ Paper: biorxiv.org/content/10.1101/…

2

25

126

19,457

fajie yuan @duguyuan

9 Feb 2025

ProTrek Update 🚀: search-protrek.com/ We've just added 700M proteins from NCBI! Now ProTrek has 3B proteins from 7 major databases – 10x larger than UniProt. Generating embeddings for 3B proteins on a single A100 GPU takes 3–4 years 😱🔬 #Bioinformatics #Proteomics

fajie yuan @duguyuan

2 Jan 2025

🧬 ProTrek Major Update: Added: 2B+ marine proteins from GOPC Total: 2.25B+ searchable proteins Natural language protein search powered by trimodal PLM 🔍 Try now: search-protrek.com/ Paper: biorxiv.org/content/10.1101/…

2

28

121

12,262

fajie yuan @duguyuan

18 Jan 2024

Our protein language model, SaProt, has been accepted at #ICLR2024 as a spotlight paper! A more biologist-friendly interface will be coming soon. Huge thanks to @sokrypton for lots of helps! Code: github.com/westlake-repl/SaP… Paper：biorxiv.org/content/10.1101/…

GitHub - westlake-repl/SaProt: Saprot: Protein Language Model with Structural Alphabet (AA+3Di)

Saprot: Protein Language Model with Structural Alphabet (AA+3Di) - westlake-repl/SaProt

3

27

116

9,419

fajie yuan @duguyuan

4 Dec 2024

🚀 Announcing ColabSaprot v2! 🧬 Train your own protein language models instantly - no ML or coding expertise required. Everyone can do it in a few minutes! 📺 Video Tutorial: [piped.video/watch?v=nmLtjlCI…] 💻 Try it: [colab.research.google.com/gi…] 📄Paper: [biorxiv.org/content/10.1101/…]

Sergey Ovchinnikov @sokrypton

28 May 2024

Now everyone customize/share protein language models for their custom task/dataset via @GoogleColab 🤓 Paper: biorxiv.org/content/10.1101/… Colab: colab.research.google.com/dr… Credit: @LTEnjoy, Zhikai Li, @ChenchenHa42849, @BonnieSwt, Junjie Shan, @XibinBayesZhou, Dacheng Ma, @duguyuan

5

23

116

12,422

fajie yuan @duguyuan

3 Oct 2023

A novel protein language model has outperformed Meta ESM models in 10 protein function tasks. Huge thanks to @sokrypton his invaluable help and contribution. Sergey deserves authorship recognition. biorxiv.org/content/10.1101/…

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsuperv...

3

18

106

13,802

fajie yuan @duguyuan

7 Nov 2024

Excited to share our AI+cryo-EM work! 🧬 🔬 Cryo-IEF: Foundation model trained on 65M particles 🤖 CryoWizard: automated structure pipeline 🎯 Making cryo-EM accessible to more labs Preprint: biorxiv.org/content/10.1101/… Code: github.com/westlake-repl/Cry… #CryoEM #AI #StructuralBiology

5

29

105

13,494

fajie yuan @duguyuan

21 Oct 2024

Update: We've added the OMG database (200M new proteins) to ProTrek!🔎 You can now use ProTrek to search for new proteins that match your research needs. 🌐New link: search-protrek.com/ ⏳If many people are searching at the same time, you may experience waits. @LTEnjoy

fajie yuan @duguyuan

17 Sep 2024

🚀 New Update. The latest version of ProTrek is now available on bioRxiv. 🧬 📑 Read it here: biorxiv.org/content/10.1101/… • Service: huggingface.co/spaces/westla… • Try it on Colab: colab.research.google.com/dr…

2

18

95

12,329

fajie yuan @duguyuan

10 Jan 2025

🚀 Update on Pinal (Natural Language➡️De novo Proteins) Model weights + demo released! 🔗 Demo: denovo-pinal.com/ 🌟 16B parameters, trained on 1.7B text-protein pairs. 📈 Scaling data + model size = 🔥 results! Impressive scaling! 📄 Paper: biorxiv.org/content/10.1101/…

6

18

88

16,977

fajie yuan @duguyuan

4 Jun 2024

Introducing ProTrek, a 3-modal PLM for protein seq, struc, and func: ✨ Trained on 40M protein-text pairs, 100x larger than ProteinCLIP, ProtST, ProteinCLAP 🚀 30x/60x better accuracy than ProtST, ProteinCLAP ⚡ 100x faster than Foldseek, MMseq2 for similar function searches

fajie yuan @duguyuan

4 Jun 2024

Excited to share ProTrec, a fast & accurate protein search tool! 30x/60x better seq-func/func-seq retrieval 100x faster than Foldseek & MMseq2 9 tasks: seq-stru, seq-func, struc-fun, etc. Beats ESM2 in 9/11 tasks Thanks to @sokrypton @WChentong biorxiv.org/content/10.1101/…

7

29

81

11,001

fajie yuan @duguyuan

10 Jan 2025

I tried another protein design case study: 1）Generated immunoglobulin protein via Pinal 2）Predict the generated sequence using AF3 3) Predict its function using ProTrek and Evolla It seems they are consitent? Not sure

fajie yuan @duguyuan

10 Jan 2025

🚀 Update on Pinal (Natural Language➡️De novo Proteins) Model weights + demo released! 🔗 Demo: denovo-pinal.com/ 🌟 16B parameters, trained on 1.7B text-protein pairs. 📈 Scaling data + model size = 🔥 results! Impressive scaling! 📄 Paper: biorxiv.org/content/10.1101/…

3

13

86

17,009

fajie yuan @duguyuan

31 Oct 2023

We've just released the lightweight 35M-version SaProt, for download! For comparison , we have independently trained an ESM-2 35M model, achieving highly similar results to the official version developed by Meta. @ebetica @sokrypton See here github.com/westlake-repl/SaP…

GitHub - westlake-repl/SaProt: Saprot: Protein Language Model with Structural Alphabet (AA+3Di)

Saprot: Protein Language Model with Structural Alphabet (AA+3Di) - westlake-repl/SaProt

fajie yuan @duguyuan

3 Oct 2023

SaProt: Protein Language Modeling with Structure-aware Vocabulary A 650M protein language model trained with 64 80G A100 for 3 months. A good alternative to the ESM family. thanks @sokrypton biorxiv.org/content/10.1101/…

1

15

73

14,430

fajie yuan @duguyuan

19 Oct 2022

Our final version (Evaluating Alphafold Evoformer for protein function prediction) is online with all related codes & datasets used in the paper. #NeurIPS2022 #AlphaFold #NeurIPS arxiv.org/pdf/2206.06583.pdf

fajie yuan @duguyuan

15 Sep 2022

with @KevinKaichuang Our paper about deciphering #AlphaFold as protein function predictor accepted in #NeurIPS2022. First paper since we started biologcal AI from last year. arxiv.org/abs/2206.06583 Some updates will be available later

1

11

70

fajie yuan @duguyuan

16 Jan 2025

Randomly chose four prompts used in 310.ai to design proteins using Pinal(denovo-pinal.com ). AF3 for our structure prediction.

fajie yuan @duguyuan

10 Jan 2025

I tried another protein design case study: 1）Generated immunoglobulin protein via Pinal 2）Predict the generated sequence using AF3 3) Predict its function using ProTrek and Evolla It seems they are consitent? Not sure

1

8

69

6,254

fajie yuan @duguyuan

4 Jun 2024

Excited to share ProTrec, a fast & accurate protein search tool! 30x/60x better seq-func/func-seq retrieval 100x faster than Foldseek & MMseq2 9 tasks: seq-stru, seq-func, struc-fun, etc. Beats ESM2 in 9/11 tasks Thanks to @sokrypton @WChentong biorxiv.org/content/10.1101/…

ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

ProTrek, a tri-modal protein language model, enables contrastive learning of protein sequence, structure, and function (SSF). Through its natural language search interface, users can navigate the...

15

57

13,591

fajie yuan @duguyuan

4 Sep 2024

We've released ColabProTrek, the successor to ColabSaprot. 🔬 Try it out: colab.research.google.com/dr… 🆕 We've also expanded ProTrek's search capabilities with additional databases including UniRef50 and PDB. 🧬 Explore: huggingface.co/spaces/westla… Paper: biorxiv.org/content/10.1101/…

1

20

58

5,966

fajie yuan @duguyuan

26 Jun 2024

Love it, congrats! 🎉 Glad to see PLMs with structural tokens trending. Like SaProt & ProTrek, ESM3 finds struct vocab+mask loss more effective & scalable. 🚀 We've ablated this on SaProt. @alexrives @EvoscaleAI @proteinrosh, guys, a little nod to related work be happier! 😉🙌

Roshan Rao

@proteinrosh

25 Jun 2024

We have trained ESM3, a generative bidirectional masked language model that reasons over the sequence, structure, and function of proteins. ESM3 is trained at three model scales - 1.4B, 7B, and 98B.

3

5

54

6,471

fajie yuan @duguyuan

16 Feb 2025

🔥 Breaking: @FengyDai launches SaProt-T: Try: http://113.45.254.183:9527/ SaProt-T, key module of Pinal: design proteins by input: • Partial structure • Partial aa seq • text function One can use ProTrek to find desired structures→ then redesign seq using SaProt-T✨

fajie yuan @duguyuan

24 Jan 2025

🧬 Design PETase with Pinal: 1️⃣ ChatGPT → Describe PETase 2️⃣ Pinal → Input text then run it 3️⃣ AlphaFold3 → Structure prediction 4️⃣ Evolla → Function validation 🚀 Simple yet powerful! Easy! Try：denovo-pinal.com/ #SyntheticBiology #ProteinDesign #AI4Science

2

8

56

4,294

fajie yuan @duguyuan

17 Sep 2024

🚀 New Update. The latest version of ProTrek is now available on bioRxiv. 🧬 📑 Read it here: biorxiv.org/content/10.1101/… • Service: huggingface.co/spaces/westla… • Try it on Colab: colab.research.google.com/dr…

1

8

52

16,567

fajie yuan @duguyuan

6 Apr 2025

Our work using ESM-Ezy to mine novel multicopper oxidases: nature.com/articles/s41467-0… ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties with Qian hui, Yajie Wang, Yuxuan and Xibin.

11

54

2,447

fajie yuan @duguyuan

6 May 2024

My student Jin will present Saprot at #ICLR2024. We're thrilled to share that our Saprot model (checkpoint version from last October) achieved 1st place on the Proteingym benchmark (github.com/OATML-Markslab/Pr…) in last month. Happy to see some new PLM with structural alphabet.

[New Model] SaProt implementation by LTEnjoy · Pull Request #24 · OATML-Markslab/ProteinGym

Dear Pascal, I have uploaded the implementation of SaProt, including both 650M and 35M version. The result csv file is also contained in the baselines/saprot for check. To reproduce the result, ple...

Jin Su @LTEnjoy

5 May 2024

#ICLR2024 We'll be at Halle B #33 on 10 May 4:30 p.m. If you are interested in Protein Language Modeling, feel free to reach out! Hope we could have deep communications with all you guys!😆😆😆

3

8

55

9,857

fajie yuan @duguyuan

11 Jul 2024

Interesting work：a novel training strategy that boosts protein language model performance using minimal data. Authors evaluated ESM-2, ESM-1v and SaProt. SaProt gives impressive results. nature.com/articles/s41467-0…

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

Nature Communications - In this work, the authors proposed a few-shot learning approach that can efficiently optimize protein language models for fitness prediction. It combines the techniques of...

1

12

53

4,108

fajie yuan @duguyuan

18 Jan 2025

Just for fun, i tried Pinal, AF3, then Evolla Surprise! Evolla said the designed protein is expressed in the venom gland of the organism Daboia siamensis,aka the Eastern Russel's viper & Daboia russelii siamensis. Pinal: denovo-pinal.com/ Evolla: chat-protein.com/

Nature Biotechnology

@NatureBiotech

16 Jan 2025

Deep learning methods aid in de novo design of proteins to neutralize lethal snake venom toxins in vitro and protect mice from a lethal neurotoxin challenge. nature.com/articles/s41586-0… #NBThighlight

2

13

52

7,791

fajie yuan @duguyuan

7 Nov 2024

New idea: (1) ML approaches try to fit all proteins, limiting accuracy on specific ones. 🔍 (2)Test-time training adapts models to target proteins on the fly ! 🧬 TRAINING ON TEST PROTEINS IMPROVES FITNESS, STRUCTURE, AND FUNCTION PREDICTION arxiv.org/pdf/2411.02109

1

11

51

5,577

fajie yuan @duguyuan

29 May 2024

Exciting highlights: 1️⃣ Training is super easy—no ML or coding expertise needed! 2️⃣ Biologists can share models on our community store for others to use or retrain. 3️⃣ Join OPMC as a paper author! Welcome more contributions！ FAQs：github.com/westlake-repl/Sap… @GoogleColab #OPMC

GitHub - westlake-repl/SaprotHub: Making Protein Language Modeling Accessible to All Biologists

Making Protein Language Modeling Accessible to All Biologists - westlake-repl/SaprotHub

Sergey Ovchinnikov @sokrypton

28 May 2024

Now everyone customize/share protein language models for their custom task/dataset via @GoogleColab 🤓 Paper: biorxiv.org/content/10.1101/… Colab: colab.research.google.com/dr… Credit: @LTEnjoy, Zhikai Li, @ChenchenHa42849, @BonnieSwt, Junjie Shan, @XibinBayesZhou, Dacheng Ma, @duguyuan

1

16

43

7,145

fajie yuan @duguyuan

28 Aug 2024

Embeddings of ProTrek & ESM3 etc. were compared. While ProTrek excels in transfer learning, its true power emerges in search capabilities. Leveraging datasets 100x larger, ProTrek dramatically enhances text-protein & protein-text retrieval. Demo: huggingface.co/spaces/westla…

Leo Zang

@LeoTZ03

27 Aug 2024

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks - Benchmark six tpLMs (OntoProtein, ProteinDT, ProtST, ProteinCLIP, ProTrek, ESM3) against ESM2-3B on six tasks (GB1, GFP, AAV, Location, Meltome, Stability) - No tpLM outperforms consistently, with ProTrek and OntoProtein ranking first 3 and 2 times - Concatenate average embeddings and search for the optimal embedding combination heuristically to achieve the best benchmark performance Preprint: biorxiv.org/content/10.1101/…

1

10

46

5,993

fajie yuan @duguyuan

21 Feb 2024

Our paper on engineering uracil-N-glycosylase using protein language model ESM is now published in Molecular Cell. @XibinBayesZhou Would love to know if replacing ESM with our Saprot (biorxiv.org/content/10.1101/…) would result in better performance. sciencedirect.com/science/ar…

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsuperv...

19

47

4,325

fajie yuan @duguyuan

18 Sep 2022

Our paper TenRec accepted in #NeurIPS2022, a large-recommender system #Recsys dataset, covering 10 recommender tasks, with 4 scenarios & 6 user feedback. We released all baseline codes and will create a leaderboard for benchmarking RS advances. openreview.net/forum?id=PfuW…

1

8

49

fajie yuan @duguyuan

16 Dec 2024

🚀 SaprotHub Major Updates! • ColabSaprot-v2 released - easier than ever • 2 new wet lab validations added • Release Saprot 1.3B • New tools: ColabProTrek, ColabProtBerts & ColabMETL • New OPMC members 🔥 Train & share your PLMs - open for everyone! piped.video/watch?v=nmLtjlCI…

SaprotHub v2: a basic tutorial

Welcome to SaprotHub! 🚀 Biologists can now easily train & share the...

2

3

35

2,732

fajie yuan @duguyuan

10 May 2024

Just evaluated Saprot and ProstT5 for protein inverse folding task. Surprisingly, Saprot is also good even in the generation task by simply masking its 3Di token. It is also 20x faster than proteinMPNN #ICLR2024 #iclr24

fajie yuan @duguyuan

6 May 2024

My student Jin will present Saprot at #ICLR2024. We're thrilled to share that our Saprot model (checkpoint version from last October) achieved 1st place on the Proteingym benchmark (github.com/OATML-Markslab/Pr…) in last month. Happy to see some new PLM with structural alphabet.

1

8

35

3,976

fajie yuan @duguyuan

3 Jul 2024

Zhikai uploaded a 6-min tutorial for SaprotHub! 🚀 Biologists can now easily train & share their protein language models. Join us, be a SaprotHub author! #Bioinformatics #ProteinModeling @LTEnjoy @sokrypton Paper: biorxiv.org/content/10.1101/… Video: piped.video/watch?v=r42z1hvY…

SaprotHub: Making Protein Modeling Accessible to All Biologists

Training and deploying deep learning models pose challenges for users without machine learning (ML) expertise. SaprotHub offers a user-friendly platform that democratizes the training, utilization,...

5

11

35

8,492

fajie yuan @duguyuan

8 Jul 2024

Great news: a wet lab submitted a EYFP fluorescence fitness model to SaprotHub with a Spearman ρ of 0.94, close to wet lab accuracy for double/triple-site mutations. Trained on 100K variants, it's a great🔧 tool for biologists! @ProteinBoston @ml4proteins @sokrypton @LTEnjoy

fajie yuan @duguyuan

3 Jul 2024

Zhikai uploaded a 6-min tutorial for SaprotHub! 🚀 Biologists can now easily train & share their protein language models. Join us, be a SaprotHub author! #Bioinformatics #ProteinModeling @LTEnjoy @sokrypton Paper: biorxiv.org/content/10.1101/… Video: piped.video/watch?v=r42z1hvY…

2

5

33

5,741

fajie yuan @duguyuan

25 Oct 2024

Cool！"It first encodes protein structures to be aligned using the 3Di+AA alphab." 3Di+AA token would be a new way to represent protein in the future.

This Post is from an account that no longer exists.

7

22

1,735

fajie yuan @duguyuan

5 Mar 2025

🔥 Our team recruiting PhD students 2025 🔥 2 PhD positions for international students at Westlake University, China! We build cutting-edge protein language models (SaProt, ProTrek, Evolla, Pinal) ⏰ Apply now - deadline soon! piped.video/watch?v=fTdRsA4M…

Welcome to Westlake University

Established in 2018, Westlake University is a new type of research ...

3

4

27

2,976

fajie yuan @duguyuan

9 Jul 2023

Will recommender systems still insist on using ID features? A foundamental question. arxiv.org/abs/2303.13835 Many paper have started to explore this question.

Where to Go Next for Recommender Systems? ID- vs. Modality-based...

Recommendation models that utilize unique identities (IDs) to represent distinct users and items have been state-of-the-art (SOTA) and dominated the recommender systems (RS) literature for over a...

3

5

23

1,554

fajie yuan @duguyuan

20 Feb 2025

🧬 Sharing recent wet lab results for ProTrek: search-protrek.com/ ! Our UDG validation shows remarkable success - all ProTrek-identified candidates from OMG database demonstrated effective T-editing, with our top hit outperforming existing published results. #ProTrek

1

1

20

1,453

fajie yuan @duguyuan

21 Mar 2025

Jin recently set up a Slack group for ColabSaprot discussions. Feel free to join here: westlakeai.slack.com/?redir=… We have recently received positive experimental results from over 10 wet labs by using ColabSaprot. Video Tutorial: piped.video/watch?v=nmLtjlCI…

Brian Naughton @btnaughton

18 Dec 2024

ColabSaprot is really very impressive... Fine-tune a state-of-the-art protein language model by just uploading a csv of proteins and values. colab.research.google.com/gi… Or download other people's models from huggingface.co/SaProtHub

1

6

18

2,373

fajie yuan @duguyuan

20 Jun 2022

AlphaFold structural representation also useful to predict functions, both for annotation prediction and fitness prediction. We run experiments for 10 months with 20 A40 A100. @KevinKaichuang @DeepMind #alphafold #AlphaFold arxiv.org/pdf/2206.06583.pdf

1

3

16

fajie yuan @duguyuan

20 Jan 2025

Deepseek (latest) as Protein Chat GPT？

This tweet is unavailable

1

2

15

1,566

fajie yuan @duguyuan

30 Sep 2023

We provided 4 huge datasets for recommender systems community (Everything is there!) #Recsys #sigir #wsdm #kdd arxiv.org/pdf/2309.15379.pdf… arxiv.org/pdf/2309.06789.pdf… openreview.net/forum?id=PfuW…… arxiv.org/abs/2309.15379

4

14

886

fajie yuan @duguyuan

7 May 2020

#sigir2020 our fp: Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation: arxiv.org/pdf/2001.04253.pdf @alexk_z our findings: watching Tiktok/YouTube heavily exposes personal info,gender, age, job, marriage. What is needed for privacy protect

3

15

fajie yuan @duguyuan

31 Jul 2024

🚀 The Chang team at Westlake Uni used ColabSaprot to predict eTDG mutations with great results! 📢 16 prominent scientists have joined OPMC (see github.com/westlake-repl/Sap…). Saprot saw over 10,000 downloads last month on Hugging Face, with its 35M and 650M versions. 🧬 Join us!

GitHub - westlake-repl/SaprotHub: Making Protein Language Modeling Accessible to All Biologists

Making Protein Language Modeling Accessible to All Biologists - westlake-repl/SaprotHub

Jin Su @LTEnjoy

31 Jul 2024

Used SaprotHub to predict mutations for eTDG, a uracil-N-glycosylase variant. 🧬 Lab results: 17 out of top 20 mutations had higher T-to-G editing efficiency than wild type (marked as red), with 3 showing nearly 2x improvement! 🚀

3

15

1,524

fajie yuan @duguyuan

5 Dec 2024

Amazing work ESM C, congrats @EvoscaleAI! 🌟 Scaling helps structure prediction (70% seq id)! 🚀 What about for function prediction? 😔Sad, our 1.3B Saprot trained on AFDB shows minimal gains. Maybe 10B AF2 structures would scale better? 🤔 When we can have 10B AF2 structures

Alex Rives

@alexrives

4 Dec 2024

Introducing ESM Cambrian. Unsupervised learning can invert biology at scale to reveal the hidden structure of the natural world. We’ve scaled up compute and data to train a new generation of protein language models. ESM C defines a new state of the art for protein representation learning.

15

1,988

fajie yuan @duguyuan

1 Jan 2023

Great work！Using protein language model for discovering antibiotic resistance genes (ARGs) and virulence factor genes with ultra-high accuracy. We also designed an adapter mechanism for comunity efforts sharing! @FengJu2020 @Westlake_Uni @Westlake_SOE @jyang1981

Westlake EMBLab (We-EMBLab)@FengJu2020

31 Dec 2022

Gratifying that the original idea of FunGeneTyper in 2017 is finally realized and online by 2022 thanks to joint efforts from our Westlake students and PIs 👍@Westlake_Uni @Westlake_SOE @duguyuan @jyang1981 . biorxiv.org/content/10.1101/…

3

13

1,018

fajie yuan @duguyuan

26 Jun 2024

like comment! @pranamanam Curious about scalability. In Saprot and SaprotHub, the structural token + mask LM loss scales well on AFDB. However, it's unclear if scaling to larger datasets will improve performance with a larger model. @alexrives @proteinrosh @THayes427

Pranam Chatterjee

@pranamanam

26 Jun 2024

Had a day to reflect on the release of ESM3, and just wanted to share a few thoughts (and a few shameless highlights of my lab's work! 😅). Before that, for the people who know our stuff, you know that I am an ESM evangelist: I think pLMs will be the future of protein design. 🪄 But it's super important for my lab to understand strengths and weaknesses! To the few points: The Good: -ESM3 uses progressive unmasking for generation. I know a lot of people are like, why not just do next-token? MLM is a way more natural, representative strategy of nature's evolutionary "generative" process, where mutations arise epistatically to confer higher fitness. We've found significant success ourselves with de novo binder generation via span MLM on ESM-2-650M latents (we didn't find the same success with GPT-like models). Check out our PepMLM model with @LeoTZ03: arxiv.org/abs/2310.03842 -Overall, you should not sleep on BERT-like models: they are great generators in many ways, and the same will probably be true for ESM3 (though GFP is probably not enough for validation). We've explored strategies with ESM-2 to perturb latent embeddings with Gaussian noise and decode back into de novo sequences for binder design (which work amazingly in the lab!). Check out our PepPrCLIP model with @bhat_suhaas and @kalyanmpalepu: biorxiv.org/content/10.1101/… -With the largest models trained on 2.78 billion proteins on the MLM task, I have no doubt the model should have excellent unconditional generation/representation capabilities for prediction tasks. As academics, we're thankful that ESM3 will release these models for us to play around with (if we have the compute)! The Not So Good: -Look, I'm a sequence-only guy. I believe all of the useful information of protein properties should be contained in a good sequence representation. I am quite disappointed that ESM3 went with incorporating structure tokens. No doubt this will improve performance for a lot of representation/design tasks (look at SaProt from @duguyuan!) on structured proteins, but this will likely reduce our ability to model conformationally disordered proteins, i.e. transcription factors, which are the most important from a disease/regulatory perspective. My lab has gone in the opposite direction and regularly fine-tune sequence-only ESM models on more disordered sequences, like fusion oncoproteins, and get strong performance. Check out our FusOn-pLM model with @SophieVincoff: biorxiv.org/cgi/content/shor… -What about other special tokens? PTMs, chemical modifications, etc. -- these could have been integrated in training as new tokens. We've described new ways to introduce PTM tokens into pLMs like ESM-2. Doing this for ESM3 will be fun (but potentially difficult with the size of the models)! Check out our PTM-Mamba paper with @pengzhangzhi1: biorxiv.org/content/10.1101/… -Size, size, size. ESM-2-650M is BY FAR the best pLM that balances size and representation capacity. All of our papers (and pretty much every other paper I've read) find this model is optimal for de novo design and downstream prediction tasks, despite being the "medium-sized" ESM. Check out our SaLT&PepPr paper with @garykbrixi: nature.com/articles/s42003-0…. -For academic labs (pretty much the main ones who can use it), it's going to be tough to use the bigger models for optimization, even the open-sourced 1.4B model. Switching away from ESM-2-650M will be a mistake for most applications that don't involve unconditional generation. I hope the ESM3 team will do more ablation studies to prove the model's additional utility! 🥹 The Neutral Finally, ESM3 is available with a non-commercial, academic use-only license. I think this is absolutely the right move (similar to AlphaFold3) to protect EvolutionaryScale's commercial interests while still letting academics push the frontiers of research if ESM3 proves to be useful! However, for some of us that use ESM-like models to develop therapeutics, it will be hard for us to get ESM3-assisted designed molecules to market without commercialization capabilities. That's why I would still recommend continued usage of ESM-2-650M for most tasks -- it's such a good model! 😊 Would love to hear the ESM team's thoughts and would be very open to collaboration! 🌟 @alexrives @TomSercu @proteinrosh @denizzokt @ebetica @THayes427

1

14

3,276

fajie yuan @duguyuan

18 Jul 2024

Interesting—We've been using ProTrek to evaluate the matching relation between text and generated proteins, its matching score looks good. 😊 Try it out: huggingface.co/spaces/westla… （Calculate a matching score using ProTrek） paper: biorxiv.org/content/10.1101/…

kooshiar

@kooshiar

16 Jul 2024

First text2protein AI model, compressing billions of years of life. 800+ novel, functional and foldable proteins are discovered by researchers. Whitepaper and repo bit.ly/310paper

2

14

1,694

fajie yuan @duguyuan

9 Jan 2025

Xibin just released the 10B-version weights on our GitHub: github.com/westlake-repl/Evo… Fine-tuning example code coming soon! 🚀 The 80B version is in training and will be released after convergence.

GitHub - westlake-repl/Evolla: Evolla: A frontier protein-language generative model designed to...

Evolla: A frontier protein-language generative model designed to decode the molecular language of proteins. - westlake-repl/Evolla

fajie yuan @duguyuan

7 Jan 2025

We release our protein chatGPT, Evola! 🌟 chat-protein.com/ Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬 biorxiv.org/content/10.1101/…

2

13

1,254

fajie yuan @duguyuan

20 Oct 2023

A paper accepted at #WSDM2024! It evaluates the use of "Adapter" for Multimodal #Recsys models. The paper titled 'Exploring Adapter-based Transfer Learning for Recommender Systems: Empirical Studies and Practical Insights' Check out related papers here: github.com/westlake-repl/Rec…

GitHub - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review:...

Paper List of Pre-trained Foundation Recommender Models - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review

13

692

fajie yuan @duguyuan

27 Apr 2023

Our paper accepted #SIGIR2023, "Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited", asks a crucial question for Recsys: whether the prevailing ID embedding models will remain dominant in the future? arxiv.org/pdf/2303.13835.pdf

1

8

757

fajie yuan @duguyuan

18 Nov 2024

Replying to @miangoar

Foldseek is definitely groundbreaking. Have you tried our ProTrek :) ? It finds proteins with similar functions using text/seq/structure inputs, even when their structures differ. Can be used to study convergent evolution. search-protrek.com/

ProTrek

search-protrek.com

1

1

12

1,022

fajie yuan @duguyuan

1 Jul 2022

How to make a recommender system model general and transferable to various other systems so as to realize "one model to serve all" like foundation models in NLP. See our recent work TransRec: arxiv.org/pdf/2206.06190.pdf

2

1

11

fajie yuan @duguyuan

15 Dec 2016

received our certificates

3

11

fajie yuan @duguyuan

4 Dec 2024

In the new version (release soon), we performed extensive wet-lab validations for ColabSaprot using both zero-shot methods and supervised training approaches to engineer various proteins. The validated targets included TDG (a uracil-N-glycosylase (UNG) variant), xylanase, vGFP.

4

10

1,367

fajie yuan @duguyuan

1 Dec 2022

A Large-scale Multipurpose Benchmark Dataset for Recommender Systems #NeurIPS @NeurIPSConf openreview.net/pdf?id=PfuW84… static.qblv.qq.com/qblv/h5/a…

1

11

fajie yuan @duguyuan

16 Jan 2025

Great job, @anthonygitter! ColabMETL is also a member of OPMC: theopmc.github.io/

Anthony Gitter @anthonygitter

18 Mar 2024

Our manuscript "Biophysics-based protein language models for protein engineering" with @romerolab1 is now on bioRxiv. We present Mutational Effect Transfer Learning (METL), a protein language model trained on biophysical simulations, and showcase it for protein engineering. 1/

Mutational Effect Transfer Learning (METL). (a) METL combines sparse experimental protein sequence-function data with dense biophysical simulation data to learn biophysics-informed sequence-function landscapes. (b) The pretraining phase involves generating millions of protein sequence variants and computing biophysical attributes for them with Rosetta, which are then used to pretrain a protein language model. The model is subsequently finetuned with experimental sequence-function data to predict protein properties such as binding, enzyme activity, thermostability, and expression. (c) The METL architecture consists of a transformer encoder with a structure-based relative position embedding. (d) METL-Local and
METL-Global differ in the sequences included in the pretraining data. METL-Local trains on the local sequence space around a protein of interest, learning a representation specific to that protein. METL-Global trains on diverse sequences across protein fold space.

ALT Mutational Effect Transfer Learning (METL). (a) METL combines sparse experimental protein sequence-function data with dense biophysical simulation data to learn biophysics-informed sequence-function landscapes. (b) The pretraining phase involves generating millions of protein sequence variants and computing biophysical attributes for them with Rosetta, which are then used to pretrain a protein language model. The model is subsequently finetuned with experimental sequence-function data to predict protein properties such as binding, enzyme activity, thermostability, and expression. (c) The METL architecture consists of a transformer encoder with a structure-based relative position embedding. (d) METL-Local and METL-Global differ in the sequences included in the pretraining data. METL-Local trains on the local sequence space around a protein of interest, learning a representation specific to that protein. METL-Global trains on diverse sequences across protein fold space.

1

3

10

1,298

fajie yuan @duguyuan

20 Mar 2024

Utilizing Protein Language Models to Enhance the Development of Novel Base Editors by AI-Advance link.medium.com/xiFlLCbY6Hb

Utilizing Protein Language Models to Enhance the Development of Novel Base Editors

title：Utilizing Protein Language Models to Enhance the Development of Novel Base Editors

link.medium.com

2

10

372

fajie yuan @duguyuan

24 Jan 2025

Do it follow video tutorial. The generation time is usally 1-2minutes by Pinal.

10

1,212

fajie yuan @duguyuan

17 Mar 2024

github.com/westlake-repl/Rec… Paper list foundation models for #Recsys

GitHub - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review:...

Paper List of Pre-trained Foundation Recommender Models - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review

10

365

fajie yuan @duguyuan

27 Sep 2024

If scaling is not the right way, what is next for pLM? How about ESM-3? Is 100B necessary?

Sarath Chandar

@apsarathchandar

26 Sep 2024

🌟 Excited to announce AMPLIFY, our latest protein language model that challenges the scaling trend! While current models like ESM2 15B rely on billions of parameters, AMPLIFY achieves superior performance with only 350M parameters. 1/7

1

8

1,838

fajie yuan @duguyuan

26 Jun 2024

ProTrek is more like a retrieval model that learns protein sequence, structure, and function (SSF) within a unified architecture using both CE loss and masked language model loss. Checking out ESM3-generated proteins with ProTrek would be interesting. huggingface.co/spaces/westla…

1

7

599

fajie yuan @duguyuan

7 Jan 2025

What is most impressive about Evola is that it shows comparable results to CLEAN in enzyme EC number prediction. CLEAN is a sota model trained on the enzyme EC No. dataset and is a classification model, while Evola is a purely generative model trained on diverse protein data 1)

10

1,614

fajie yuan @duguyuan

4 Oct 2024

Interesting results. SaProt used AFDB structures for training which I remembered have indeed excluded virus proteins.

Anthony Gitter @anthonygitter

4 Oct 2024

Although I welcome more discussion of biosafety in AI, I see condensing safety into a single score as an oversimplification of the issues. 1/

ProEdit Figure 1

ALT ProEdit Figure 1

7

919

fajie yuan @duguyuan

19 Feb 2024

Exciting news! Our paper "NineRec" accepted in TPAMI. 🔹10 multi-modal recommendation datasets from 5 RS platforms, featuring text and images. 🔹Evaluate cross-domain recommender models with NineRec. #Recsys #WSDM Paper: arxiv.org/pdf/2309.07705.pdf Code: github.com/westlake-repl/Nin…

2

2

8

672

fajie yuan @duguyuan

23 Jan 2025

Like to see "Scaling"! Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT

Biology+AI Daily @BiologyAIDaily

22 Jan 2025

Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT 1. SPRINT is a novel vector-based method for drug-target interaction (DTI) prediction, offering unparalleled scalability and speed. It can screen the entire human proteome against a library of 6.7 billion compounds in just 16 minutes. 2. Unlike traditional structure-based approaches, SPRINT leverages structure-aware protein language models (PLMs) to create co-embedding spaces for drugs and protein targets. This allows accurate DTI prediction without explicit 3D modeling. 3. The platform achieved state-of-the-art performance on virtual screening benchmarks, DTI classification tasks, and binding affinity predictions. It offers residue-level interpretability through attention maps, aiding mechanistic insights. 4. SPRINT was validated through large-scale applications, including antimicrobial drug discovery and SARS-CoV-2 NSP13 helicase inhibitor identification, showcasing its utility in identifying diverse, high-quality molecular scaffolds. 5. Using a multi-head attention pooling strategy, SPRINT effectively captures sequence-dependent protein representations, overcoming limitations of previous pooling methods like averaging. 6. The method is highly efficient, using Chroma vector search to handle billions of molecules and proteomes. It reduces computational barriers, enabling pan-proteome DTI screens and virtual screening for drug repurposing. 7. SPRINT’s open-source framework supports modular protein and molecule encoder integration, enhancing compatibility with future PLM developments. It also demonstrated synergy with existing molecular fingerprint methods for property prediction tasks. 8. This approach democratizes virtual screening by delivering accurate, interpretable, and large-scale drug discovery capabilities at a fraction of the computational cost of traditional methods. @david_koes @ericxing @monica_dayao @probablybots 💻Code: github.com/abhinadduri/pansp… 📜Paper: arxiv.org/abs/2411.15418 #DrugDiscovery #VirtualScreening #ProteinLanguageModels #ComputationalBiology

2

7

904

fajie yuan @duguyuan

12 Dec 2024

Replying to @pranamanam @amelie_iska

AA sequences theoretically may encode all information, but explicit features (e.g., structure) often enhance learning - similar to AF2's effective use of MSA. Even a 15B PLM cannot replace a very small Evoformer model. Not easy to do stuff with only AA sequence. :)

2

8

410

fajie yuan @duguyuan

11 Mar 2024

Pre-training and Transfer Learning in Recommender System by WestlakeRPLlab link.medium.com/Fj3RwSBASHb

Pre-training and Transfer Learning in Recommender System

Title：Pre-training and Transfer Learning in Recommender System

link.medium.com

1

6

318

fajie yuan @duguyuan

27 Sep 2024

Replying to @apsarathchandar

Hi Sarath, excellent work! I completely agree with your point. Have you seen our Saprot paper? We also showed that the 35M version of Saprot is better than ESM2-15B. Take a look： biorxiv.org/content/10.1101/… openreview.net/pdf?id=6MRm3G…

1

7

503

fajie yuan @duguyuan

2 Aug 2024

Pinal has two components: T2struct for translating natural language to protein structure, and SaProt-T for sequence design conditioned on language and structure. Both use discrete Foldseek 3Di tokens for structure. thanks @thesteinegger for such great work.

6

628

fajie yuan @duguyuan

2 Apr 2025

The enzyme designed by Pinal has been shown exhibited functional activity.

1

1

7

643

fajie yuan @duguyuan

19 Jul 2024

For the EYFP task, biology researchers fine-tuned a peer-shared model from SaprotHub (huggingface.co/SaProtHub/Mod…), significantly outperforming AI researchers with limited data. For all other tasks, they used exactly the same dataset for evaluation.

SaProtHub/Model-EYFP_100K-650M · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

1

7

849

fajie yuan @duguyuan

19 Apr 2024

NineRec, a transferable RS #recsys dataset suite comprising a large-scale source domain dataset and nine diverse target domain recommendation datasets. Each item in NineRec has a descriptive text and a high-resolution cover image.

fajie yuan @duguyuan

19 Apr 2024

Multimodal Multi-domain Recommendation System DataSet and Benchmark link.medium.com/FlEGaXlgVIb Paper: arxiv.org/pdf/2309.07705.pdf Code: github.com/westlake-repl/Nin…

2

6

532

fajie yuan @duguyuan

9 Feb 2024

Great work! congrats to @KevinKaichuang

Kevin K. Yang 楊凱筌 @KevinKaichuang

8 Feb 2024

We did 370 experiments to discover that protein language models primarily learn structure and won't scale for protein function prediction. We need new pretraining tasks! Work led by @francescazfl with @avapamini @yisongyue @alexijielu See Alex's thread + the paper for more!

7

774

fajie yuan @duguyuan

24 Jan 2025

Step5: Wet lab: We welcome potential collaborators interested in testing various predictions in wet lab conditions! Just provide us your evidence.

7

980

fajie yuan @duguyuan

28 Aug 2024

Glad to see this：

1

7

401

fajie yuan @duguyuan

26 Nov 2024

Model Architecture: Stage 1: Text to Foldseek 3Di tokens - 1.2B parameters Stage 2: Text + 3Di to amino acid sequence (Saprot variant) - 0.8B parameters Total: 2B parameters @thesteinegger Thanks Martin for your Foldseek.

1

5

620

fajie yuan @duguyuan

18 Dec 2024

Replying to @btnaughton @MartinMayta2

Thx Brian, we provide toy dataset in the training interface and larger data in SaprotHub. For efficiency, SaprotHub processes PDB into SA token sequences. Users with their own datasets can simply upload PDB structures - we handle it automatically. Follow the hints should be okay.

4

169

fajie yuan @duguyuan

8 Jul 2024

Replying to @duguyuan @ProteinBoston @ml4proteins @sokrypton @LTEnjoy

Join us in submitting more pLMs to SaprotHub! Biologists can now train, use, share and co-build protein ML models without coding. ColabSaprot: colab.research.google.com/dr… SaprotHub: huggingface.co/SaProtHub Paper: biorxiv.org/content/10.1101/… Video: piped.video/watch?v=r42z1hvY…

Google Colab Notebook

Run, share, and edit Python notebooks

colab.research.google.com

1

6

412

fajie yuan @duguyuan

15 Jul 2021

our code and dataset #SIGIR2021 have been open-sourced. "One Person, One Model, One World: Learning Continual User Representation without Forgetting" "StackRec: Efficient Training of Very Deep Sequential Recommender Models by Layer Stacking" welcome to use.

6

fajie yuan @duguyuan

19 Oct 2022

Our #NeurIPS2022 final version is ready with all related code and datasets. It includes 11 recommendation tasks with various baseline implementations, hope it will be a reliable benchmark for new + old researchers in RS. #recsys2022 #SIGIR2022 #KDD2022

fajie yuan @duguyuan

18 Sep 2022

Our paper TenRec accepted in #NeurIPS2022, a large-recommender system #Recsys dataset, covering 10 recommender tasks, with 4 scenarios & 6 user feedback. We released all baseline codes and will create a leaderboard for benchmarking RS advances. openreview.net/forum?id=PfuW…

5

fajie yuan @duguyuan

21 Feb 2024

Highlights: - nCas9 with engineered UNGs enable transversion base editing without deamination - PLMs were used to predict enzymatic variant activities - Using the PLMs, an efficient T>S (G or C) base editor, TSBE3, was developed ...

fajie yuan @duguyuan

21 Feb 2024

Our paper on engineering uracil-N-glycosylase using protein language model ESM is now published in Molecular Cell. @XibinBayesZhou Would love to know if replacing ESM with our Saprot (biorxiv.org/content/10.1101/…) would result in better performance. sciencedirect.com/science/ar…

2

5

489

fajie yuan @duguyuan

9 Jan 2025

Replying to @KevinKaichuang

Here is the model checkpoint: github.com/westlake-repl/Evo…

GitHub - westlake-repl/Evolla: Evolla: A frontier protein-language generative model designed to...

Evolla: A frontier protein-language generative model designed to decode the molecular language of proteins. - westlake-repl/Evolla

6

253

fajie yuan @duguyuan

17 Nov 2022

Two nice paper using protein language model to construct MSA, showing competitive results with HHblits and jackhammer, but much faster: arxiv.org/pdf/2206.06583.pdf

Kevin K. Yang 楊凱筌 @KevinKaichuang

16 Nov 2022

Use protein language model representations to construct multiple sequence alignments. @clairemcwhite @ProfMonaSingh biorxiv.org/content/10.1101/…

1

6

fajie yuan @duguyuan

19 Jul 2023

Paper List for LLM4Rec, foundation recommendation models, multimodal pre-training and transfer learning for RS. github.com/westlake-repl/Rec…

GitHub - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review:...

Paper List of Pre-trained Foundation Recommender Models - westlake-repl/Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review

1

6

621

fajie yuan @duguyuan

9 Feb 2025

8 Ways to Search Proteins: seq2seq: Find similar function sequences seq2text: Get function from sequence seq2struct ...struct2struct ...struct2text ... struct2seq text2seq: Find sequence by description text2struct: Search structure by function Predict Go annotation, EC number.

5

550

fajie yuan @duguyuan

24 Jan 2025

Step4 see its function by Evolla: chat-protein.com/

1

7

1,051

fajie yuan @duguyuan

26 Nov 2024

If you don't know if the designed protein is relevant to your text, try our ProTrek: search-protrek.com/ Paper link: biorxiv.org/content/10.1101/…

1

6

745

fajie yuan @duguyuan

24 Jan 2025

Replying to @SynmitoYao

I do not know. Need wet lab

1

6

748

fajie yuan @duguyuan

2 Aug 2024

Pinal demonstrates impressive performance when evaluated using GT-TMscore and ProTrek CLIP score, outperforming ESM-3 for with key words as promt in dry experiment metrics. We plan to validate these results with wet experiments.

2

5

707

fajie yuan @duguyuan

1 Oct 2023

NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation arxiv.org/abs/2309.07705

3

308