Assistant Prof at Westlake University

Hangzhou,China
We release our protein chatGPT, Evola! 🌟 chat-protein.com/ Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬 biorxiv.org/content/10.1101/…
20
135
615
130,529
🧬 Design PETase with Pinal: 1️⃣ ChatGPT → Describe PETase 2️⃣ Pinal → Input text then run it 3️⃣ AlphaFold3 → Structure prediction 4️⃣ Evolla → Function validation 🚀 Simple yet powerful! Easy! Try:denovo-pinal.com/ #SyntheticBiology #ProteinDesign #AI4Science
14
89
388
63,918
🧬✨Excited to share our online demo " Natural language → De Novo Protein design" Live demo: http://113.45.254.183:8888/ The demo version Pinal is 1.2B.🔬 You try very detailed textual prompt up to 500 words. biorxiv.org/content/10.1101/…
Toward De Novo Protein Design from Natural Language: Propose Pinal, a 2-stage generative framework, avoiding end-to-end text-protein generation. Design an optimal sampler to integrate both stages. Outperform ESM3 when prompting with text. #ProteinDesign biorxiv.org/content/10.1101/…
11
50
294
29,002
Toward De Novo Protein Design from Natural Language: Propose Pinal, a 2-stage generative framework, avoiding end-to-end text-protein generation. Design an optimal sampler to integrate both stages. Outperform ESM3 when prompting with text. #ProteinDesign biorxiv.org/content/10.1101/…
13
53
234
62,526
🚀 Update! Our latest Pinal bioRxiv now includes wet lab results. More proteins with diverse text prompt on the way. Design proteins with just text. Everyone can do protein design! Demo: denovo-pinal.com/ paper: biorxiv.org/content/10.1101/… GitHub: github.com/westlake-repl/Den…
5
46
233
17,037
Recruited 12 bio students, no coding exp, to use ColabSaprot for re-training, zero-shot mutation, & protein design. They matched AI experts w/o hyper-parameter tuning! With SaprotHub, any biologist can train protein models! @sokrypton @LTEnjoy biorxiv.org/content/10.1101/…
7
33
212
20,484
with @KevinKaichuang Our paper about deciphering #AlphaFold as protein function predictor accepted in #NeurIPS2022. First paper since we started biologcal AI from last year. arxiv.org/abs/2206.06583 Some updates will be available later
4
26
137
SaProt: Protein Language Modeling with Structure-aware Vocabulary A 650M protein language model trained with 64 80G A100 for 3 months. A good alternative to the ESM family. thanks @sokrypton biorxiv.org/content/10.1101/…
3
32
143
51,235
🧬 ProTrek Major Update: Added: 2B+ marine proteins from GOPC Total: 2.25B+ searchable proteins Natural language protein search powered by trimodal PLM 🔍 Try now: search-protrek.com/ Paper: biorxiv.org/content/10.1101/…
2
25
126
19,457
ProTrek Update 🚀: search-protrek.com/ We've just added 700M proteins from NCBI! Now ProTrek has 3B proteins from 7 major databases – 10x larger than UniProt. Generating embeddings for 3B proteins on a single A100 GPU takes 3–4 years 😱🔬 #Bioinformatics #Proteomics
🧬 ProTrek Major Update: Added: 2B+ marine proteins from GOPC Total: 2.25B+ searchable proteins Natural language protein search powered by trimodal PLM 🔍 Try now: search-protrek.com/ Paper: biorxiv.org/content/10.1101/…
2
28
121
12,262
Our protein language model, SaProt, has been accepted at #ICLR2024 as a spotlight paper! A more biologist-friendly interface will be coming soon. Huge thanks to @sokrypton for lots of helps! Code: github.com/westlake-repl/SaP… Paper:biorxiv.org/content/10.1101/…
3
27
116
9,419
🚀 Announcing ColabSaprot v2! 🧬 Train your own protein language models instantly - no ML or coding expertise required. Everyone can do it in a few minutes! 📺 Video Tutorial: [piped.video/watch?v=nmLtjlCI…] 💻 Try it: [colab.research.google.com/gi…] 📄Paper: [biorxiv.org/content/10.1101/…]
Now everyone customize/share protein language models for their custom task/dataset via @GoogleColab 🤓 Paper: biorxiv.org/content/10.1101/… Colab: colab.research.google.com/dr… Credit: @LTEnjoy, Zhikai Li, @ChenchenHa42849, @BonnieSwt, Junjie Shan, @XibinBayesZhou, Dacheng Ma, @duguyuan
5
23
116
12,422
Excited to share our AI+cryo-EM work! 🧬 🔬 Cryo-IEF: Foundation model trained on 65M particles 🤖 CryoWizard: automated structure pipeline 🎯 Making cryo-EM accessible to more labs Preprint: biorxiv.org/content/10.1101/… Code: github.com/westlake-repl/Cry… #CryoEM #AI #StructuralBiology
5
29
105
13,494
Update: We've added the OMG database (200M new proteins) to ProTrek!🔎 You can now use ProTrek to search for new proteins that match your research needs. 🌐New link: search-protrek.com/ ⏳If many people are searching at the same time, you may experience waits. @LTEnjoy
🚀 New Update. The latest version of ProTrek is now available on bioRxiv. 🧬 📑 Read it here: biorxiv.org/content/10.1101/… • Service: huggingface.co/spaces/westla… • Try it on Colab: colab.research.google.com/dr…
2
18
95
12,329
🚀 Update on Pinal (Natural Language➡️De novo Proteins) Model weights + demo released! 🔗 Demo: denovo-pinal.com/ 🌟 16B parameters, trained on 1.7B text-protein pairs. 📈 Scaling data + model size = 🔥 results! Impressive scaling! 📄 Paper: biorxiv.org/content/10.1101/…
6
18
88
16,977
Introducing ProTrek, a 3-modal PLM for protein seq, struc, and func: ✨ Trained on 40M protein-text pairs, 100x larger than ProteinCLIP, ProtST, ProteinCLAP 🚀 30x/60x better accuracy than ProtST, ProteinCLAP ⚡ 100x faster than Foldseek, MMseq2 for similar function searches
Excited to share ProTrec, a fast & accurate protein search tool! 30x/60x better seq-func/func-seq retrieval 100x faster than Foldseek & MMseq2 9 tasks: seq-stru, seq-func, struc-fun, etc. Beats ESM2 in 9/11 tasks Thanks to @sokrypton @WChentong biorxiv.org/content/10.1101/…
7
29
81
11,001
I tried another protein design case study: 1)Generated immunoglobulin protein via Pinal 2)Predict the generated sequence using AF3 3) Predict its function using ProTrek and Evolla It seems they are consitent? Not sure
🚀 Update on Pinal (Natural Language➡️De novo Proteins) Model weights + demo released! 🔗 Demo: denovo-pinal.com/ 🌟 16B parameters, trained on 1.7B text-protein pairs. 📈 Scaling data + model size = 🔥 results! Impressive scaling! 📄 Paper: biorxiv.org/content/10.1101/…
3
13
86
17,009
We've just released the lightweight 35M-version SaProt, for download! For comparison , we have independently trained an ESM-2 35M model, achieving highly similar results to the official version developed by Meta. @ebetica @sokrypton See here github.com/westlake-repl/SaP…
SaProt: Protein Language Modeling with Structure-aware Vocabulary A 650M protein language model trained with 64 80G A100 for 3 months. A good alternative to the ESM family. thanks @sokrypton biorxiv.org/content/10.1101/…
1
15
73
14,430
Our final version (Evaluating Alphafold Evoformer for protein function prediction) is online with all related codes & datasets used in the paper. #NeurIPS2022 #AlphaFold #NeurIPS arxiv.org/pdf/2206.06583.pdf
with @KevinKaichuang Our paper about deciphering #AlphaFold as protein function predictor accepted in #NeurIPS2022. First paper since we started biologcal AI from last year. arxiv.org/abs/2206.06583 Some updates will be available later
1
11
70
Randomly chose four prompts used in 310.ai to design proteins using Pinal(denovo-pinal.com ). AF3 for our structure prediction.
I tried another protein design case study: 1)Generated immunoglobulin protein via Pinal 2)Predict the generated sequence using AF3 3) Predict its function using ProTrek and Evolla It seems they are consitent? Not sure
1
8
69
6,254
We've released ColabProTrek, the successor to ColabSaprot. 🔬 Try it out: colab.research.google.com/dr… 🆕 We've also expanded ProTrek's search capabilities with additional databases including UniRef50 and PDB. 🧬 Explore: huggingface.co/spaces/westla… Paper: biorxiv.org/content/10.1101/…
1
20
58
5,966
Love it, congrats! 🎉 Glad to see PLMs with structural tokens trending. Like SaProt & ProTrek, ESM3 finds struct vocab+mask loss more effective & scalable. 🚀 We've ablated this on SaProt. @alexrives @EvoscaleAI @proteinrosh, guys, a little nod to related work be happier! 😉🙌
We have trained ESM3, a generative bidirectional masked language model that reasons over the sequence, structure, and function of proteins. ESM3 is trained at three model scales - 1.4B, 7B, and 98B.
3
5
54
6,471
🔥 Breaking: @FengyDai launches SaProt-T: Try: http://113.45.254.183:9527/ SaProt-T, key module of Pinal: design proteins by input: • Partial structure • Partial aa seq • text function One can use ProTrek to find desired structures→ then redesign seq using SaProt-T✨
🧬 Design PETase with Pinal: 1️⃣ ChatGPT → Describe PETase 2️⃣ Pinal → Input text then run it 3️⃣ AlphaFold3 → Structure prediction 4️⃣ Evolla → Function validation 🚀 Simple yet powerful! Easy! Try:denovo-pinal.com/ #SyntheticBiology #ProteinDesign #AI4Science
2
8
56
4,294
🚀 New Update. The latest version of ProTrek is now available on bioRxiv. 🧬 📑 Read it here: biorxiv.org/content/10.1101/… • Service: huggingface.co/spaces/westla… • Try it on Colab: colab.research.google.com/dr…
1
8
52
16,567
Our work using ESM-Ezy to mine novel multicopper oxidases: nature.com/articles/s41467-0… ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties with Qian hui, Yajie Wang, Yuxuan and Xibin.
11
54
2,447
My student Jin will present Saprot at #ICLR2024. We're thrilled to share that our Saprot model (checkpoint version from last October) achieved 1st place on the Proteingym benchmark (github.com/OATML-Markslab/Pr…) in last month. Happy to see some new PLM with structural alphabet.
#ICLR2024 We'll be at Halle B #33 on 10 May 4:30 p.m. If you are interested in Protein Language Modeling, feel free to reach out! Hope we could have deep communications with all you guys!😆😆😆
3
8
55
9,857
Just for fun, i tried Pinal, AF3, then Evolla Surprise! Evolla said the designed protein is expressed in the venom gland of the organism Daboia siamensis,aka the Eastern Russel's viper & Daboia russelii siamensis. Pinal: denovo-pinal.com/ Evolla: chat-protein.com/
Deep learning methods aid in de novo design of proteins to neutralize lethal snake venom toxins in vitro and protect mice from a lethal neurotoxin challenge. nature.com/articles/s41586-0… #NBThighlight
2
13
52
7,791
New idea: (1) ML approaches try to fit all proteins, limiting accuracy on specific ones. 🔍 (2)Test-time training adapts models to target proteins on the fly ! 🧬 TRAINING ON TEST PROTEINS IMPROVES FITNESS, STRUCTURE, AND FUNCTION PREDICTION arxiv.org/pdf/2411.02109
1
11
51
5,577
Exciting highlights: 1️⃣ Training is super easy—no ML or coding expertise needed! 2️⃣ Biologists can share models on our community store for others to use or retrain. 3️⃣ Join OPMC as a paper author! Welcome more contributions! FAQs:github.com/westlake-repl/Sap… @GoogleColab #OPMC
Now everyone customize/share protein language models for their custom task/dataset via @GoogleColab 🤓 Paper: biorxiv.org/content/10.1101/… Colab: colab.research.google.com/dr… Credit: @LTEnjoy, Zhikai Li, @ChenchenHa42849, @BonnieSwt, Junjie Shan, @XibinBayesZhou, Dacheng Ma, @duguyuan
1
16
43
7,145
Embeddings of ProTrek & ESM3 etc. were compared. While ProTrek excels in transfer learning, its true power emerges in search capabilities. Leveraging datasets 100x larger, ProTrek dramatically enhances text-protein & protein-text retrieval. Demo: huggingface.co/spaces/westla…
Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks - Benchmark six tpLMs (OntoProtein, ProteinDT, ProtST, ProteinCLIP, ProTrek, ESM3) against ESM2-3B on six tasks (GB1, GFP, AAV, Location, Meltome, Stability) - No tpLM outperforms consistently, with ProTrek and OntoProtein ranking first 3 and 2 times - Concatenate average embeddings and search for the optimal embedding combination heuristically to achieve the best benchmark performance Preprint: biorxiv.org/content/10.1101/…
1
10
46
5,993
Our paper TenRec accepted in #NeurIPS2022, a large-recommender system #Recsys dataset, covering 10 recommender tasks, with 4 scenarios & 6 user feedback. We released all baseline codes and will create a leaderboard for benchmarking RS advances. openreview.net/forum?id=PfuW…
1
8
49
🚀 SaprotHub Major Updates! • ColabSaprot-v2 released - easier than ever • 2 new wet lab validations added • Release Saprot 1.3B • New tools: ColabProTrek, ColabProtBerts & ColabMETL • New OPMC members 🔥 Train & share your PLMs - open for everyone! piped.video/watch?v=nmLtjlCI…
2
3
35
2,732
Just evaluated Saprot and ProstT5 for protein inverse folding task. Surprisingly, Saprot is also good even in the generation task by simply masking its 3Di token. It is also 20x faster than proteinMPNN #ICLR2024 #iclr24
My student Jin will present Saprot at #ICLR2024. We're thrilled to share that our Saprot model (checkpoint version from last October) achieved 1st place on the Proteingym benchmark (github.com/OATML-Markslab/Pr…) in last month. Happy to see some new PLM with structural alphabet.
1
8
35
3,976
Great news: a wet lab submitted a EYFP fluorescence fitness model to SaprotHub with a Spearman ρ of 0.94, close to wet lab accuracy for double/triple-site mutations. Trained on 100K variants, it's a great🔧 tool for biologists! @ProteinBoston @ml4proteins @sokrypton @LTEnjoy
Zhikai uploaded a 6-min tutorial for SaprotHub! 🚀 Biologists can now easily train & share their protein language models. Join us, be a SaprotHub author! #Bioinformatics #ProteinModeling @LTEnjoy @sokrypton Paper: biorxiv.org/content/10.1101/… Video: piped.video/watch?v=r42z1hvY…
2
5
33
5,741
Cool!"It first encodes protein structures to be aligned using the 3Di+AA alphab." 3Di+AA token would be a new way to represent protein in the future.
7
22
1,735
🔥 Our team recruiting PhD students 2025 🔥 2 PhD positions for international students at Westlake University, China! We build cutting-edge protein language models (SaProt, ProTrek, Evolla, Pinal) ⏰ Apply now - deadline soon! piped.video/watch?v=fTdRsA4M…
3
4
27
2,976
🧬 Sharing recent wet lab results for ProTrek: search-protrek.com/ ! Our UDG validation shows remarkable success - all ProTrek-identified candidates from OMG database demonstrated effective T-editing, with our top hit outperforming existing published results. #ProTrek
1
1
20
1,453
Jin recently set up a Slack group for ColabSaprot discussions. Feel free to join here: westlakeai.slack.com/?redir=… We have recently received positive experimental results from over 10 wet labs by using ColabSaprot. Video Tutorial: piped.video/watch?v=nmLtjlCI…
ColabSaprot is really very impressive... Fine-tune a state-of-the-art protein language model by just uploading a csv of proteins and values. colab.research.google.com/gi… Or download other people's models from huggingface.co/SaProtHub
1
6
18
2,373
AlphaFold structural representation also useful to predict functions, both for annotation prediction and fitness prediction. We run experiments for 10 months with 20 A40 A100. @KevinKaichuang @DeepMind #alphafold #AlphaFold arxiv.org/pdf/2206.06583.pdf
1
3
16
Deepseek (latest) as Protein Chat GPT?
1
2
15
1,566
We provided 4 huge datasets for recommender systems community (Everything is there!) #Recsys #sigir #wsdm #kdd arxiv.org/pdf/2309.15379.pdfarxiv.org/pdf/2309.06789.pdfopenreview.net/forum?id=PfuW…arxiv.org/abs/2309.15379
4
14
886
#sigir2020 our fp: Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation: arxiv.org/pdf/2001.04253.pdf @alexk_z our findings: watching Tiktok/YouTube heavily exposes personal info,gender, age, job, marriage. What is needed for privacy protect
3
15
🚀 The Chang team at Westlake Uni used ColabSaprot to predict eTDG mutations with great results! 📢 16 prominent scientists have joined OPMC (see github.com/westlake-repl/Sap…). Saprot saw over 10,000 downloads last month on Hugging Face, with its 35M and 650M versions. 🧬 Join us!
Used SaprotHub to predict mutations for eTDG, a uracil-N-glycosylase variant. 🧬 Lab results: 17 out of top 20 mutations had higher T-to-G editing efficiency than wild type (marked as red), with 3 showing nearly 2x improvement! 🚀
3
15
1,524
Amazing work ESM C, congrats @EvoscaleAI! 🌟 Scaling helps structure prediction (70% seq id)! 🚀 What about for function prediction? 😔Sad, our 1.3B Saprot trained on AFDB shows minimal gains. Maybe 10B AF2 structures would scale better? 🤔 When we can have 10B AF2 structures
Introducing ESM Cambrian. Unsupervised learning can invert biology at scale to reveal the hidden structure of the natural world. We’ve scaled up compute and data to train a new generation of protein language models. ESM C defines a new state of the art for protein representation learning.
15
1,988
Great work!Using protein language model for discovering antibiotic resistance genes (ARGs) and virulence factor genes with ultra-high accuracy. We also designed an adapter mechanism for comunity efforts sharing! @FengJu2020 @Westlake_Uni @Westlake_SOE @jyang1981
Gratifying that the original idea of FunGeneTyper in 2017 is finally realized and online by 2022 thanks to joint efforts from our Westlake students and PIs 👍@Westlake_Uni @Westlake_SOE @duguyuan @jyang1981 . biorxiv.org/content/10.1101/…
3
13
1,018
like comment! @pranamanam Curious about scalability. In Saprot and SaprotHub, the structural token + mask LM loss scales well on AFDB. However, it's unclear if scaling to larger datasets will improve performance with a larger model. @alexrives @proteinrosh @THayes427
Had a day to reflect on the release of ESM3, and just wanted to share a few thoughts (and a few shameless highlights of my lab's work! 😅). Before that, for the people who know our stuff, you know that I am an ESM evangelist: I think pLMs will be the future of protein design. 🪄 But it's super important for my lab to understand strengths and weaknesses! To the few points: The Good: -ESM3 uses progressive unmasking for generation. I know a lot of people are like, why not just do next-token? MLM is a way more natural, representative strategy of nature's evolutionary "generative" process, where mutations arise epistatically to confer higher fitness. We've found significant success ourselves with de novo binder generation via span MLM on ESM-2-650M latents (we didn't find the same success with GPT-like models). Check out our PepMLM model with @LeoTZ03: arxiv.org/abs/2310.03842 -Overall, you should not sleep on BERT-like models: they are great generators in many ways, and the same will probably be true for ESM3 (though GFP is probably not enough for validation). We've explored strategies with ESM-2 to perturb latent embeddings with Gaussian noise and decode back into de novo sequences for binder design (which work amazingly in the lab!). Check out our PepPrCLIP model with @bhat_suhaas and @kalyanmpalepu: biorxiv.org/content/10.1101/… -With the largest models trained on 2.78 billion proteins on the MLM task, I have no doubt the model should have excellent unconditional generation/representation capabilities for prediction tasks. As academics, we're thankful that ESM3 will release these models for us to play around with (if we have the compute)! The Not So Good: -Look, I'm a sequence-only guy. I believe all of the useful information of protein properties should be contained in a good sequence representation. I am quite disappointed that ESM3 went with incorporating structure tokens. No doubt this will improve performance for a lot of representation/design tasks (look at SaProt from @duguyuan!) on structured proteins, but this will likely reduce our ability to model conformationally disordered proteins, i.e. transcription factors, which are the most important from a disease/regulatory perspective. My lab has gone in the opposite direction and regularly fine-tune sequence-only ESM models on more disordered sequences, like fusion oncoproteins, and get strong performance. Check out our FusOn-pLM model with @SophieVincoff: biorxiv.org/cgi/content/shor… -What about other special tokens? PTMs, chemical modifications, etc. -- these could have been integrated in training as new tokens. We've described new ways to introduce PTM tokens into pLMs like ESM-2. Doing this for ESM3 will be fun (but potentially difficult with the size of the models)! Check out our PTM-Mamba paper with @pengzhangzhi1: biorxiv.org/content/10.1101/… -Size, size, size. ESM-2-650M is BY FAR the best pLM that balances size and representation capacity. All of our papers (and pretty much every other paper I've read) find this model is optimal for de novo design and downstream prediction tasks, despite being the "medium-sized" ESM. Check out our SaLT&PepPr paper with @garykbrixi: nature.com/articles/s42003-0…. -For academic labs (pretty much the main ones who can use it), it's going to be tough to use the bigger models for optimization, even the open-sourced 1.4B model. Switching away from ESM-2-650M will be a mistake for most applications that don't involve unconditional generation. I hope the ESM3 team will do more ablation studies to prove the model's additional utility! 🥹 The Neutral Finally, ESM3 is available with a non-commercial, academic use-only license. I think this is absolutely the right move (similar to AlphaFold3) to protect EvolutionaryScale's commercial interests while still letting academics push the frontiers of research if ESM3 proves to be useful! However, for some of us that use ESM-like models to develop therapeutics, it will be hard for us to get ESM3-assisted designed molecules to market without commercialization capabilities. That's why I would still recommend continued usage of ESM-2-650M for most tasks -- it's such a good model! 😊 Would love to hear the ESM team's thoughts and would be very open to collaboration! 🌟 @alexrives @TomSercu @proteinrosh @denizzokt @ebetica @THayes427
1
14
3,276
Interesting—We've been using ProTrek to evaluate the matching relation between text and generated proteins, its matching score looks good. 😊 Try it out: huggingface.co/spaces/westla… (Calculate a matching score using ProTrek) paper: biorxiv.org/content/10.1101/…
First text2protein AI model, compressing billions of years of life. 800+ novel, functional and foldable proteins are discovered by researchers. Whitepaper and repo bit.ly/310paper
2
14
1,694
Xibin just released the 10B-version weights on our GitHub: github.com/westlake-repl/Evo… Fine-tuning example code coming soon! 🚀 The 80B version is in training and will be released after convergence.
We release our protein chatGPT, Evola! 🌟 chat-protein.com/ Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬 biorxiv.org/content/10.1101/…
2
13
1,254
A paper accepted at #WSDM2024! It evaluates the use of "Adapter" for Multimodal #Recsys models. The paper titled 'Exploring Adapter-based Transfer Learning for Recommender Systems: Empirical Studies and Practical Insights' Check out related papers here: github.com/westlake-repl/Rec…
13
692
Our paper accepted #SIGIR2023, "Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited", asks a crucial question for Recsys: whether the prevailing ID embedding models will remain dominant in the future? arxiv.org/pdf/2303.13835.pdf
1
8
757
Replying to @miangoar
Foldseek is definitely groundbreaking. Have you tried our ProTrek :) ? It finds proteins with similar functions using text/seq/structure inputs, even when their structures differ. Can be used to study convergent evolution. search-protrek.com/
1
1
12
1,022
How to make a recommender system model general and transferable to various other systems so as to realize "one model to serve all" like foundation models in NLP. See our recent work TransRec: arxiv.org/pdf/2206.06190.pdf
2
1
11
received our certificates
3
11
In the new version (release soon), we performed extensive wet-lab validations for ColabSaprot using both zero-shot methods and supervised training approaches to engineer various proteins. The validated targets included TDG (a uracil-N-glycosylase (UNG) variant), xylanase, vGFP.
4
10
1,367
A Large-scale Multipurpose Benchmark Dataset for Recommender Systems #NeurIPS @NeurIPSConf openreview.net/pdf?id=PfuW84… static.qblv.qq.com/qblv/h5/a…
1
11
Great job, @anthonygitter! ColabMETL is also a member of OPMC: theopmc.github.io/
Our manuscript "Biophysics-based protein language models for protein engineering" with @romerolab1 is now on bioRxiv. We present Mutational Effect Transfer Learning (METL), a protein language model trained on biophysical simulations, and showcase it for protein engineering. 1/
1
3
10
1,298
Do it follow video tutorial. The generation time is usally 1-2minutes by Pinal.
10
1,212
If scaling is not the right way, what is next for pLM? How about ESM-3? Is 100B necessary?
🌟 Excited to announce AMPLIFY, our latest protein language model that challenges the scaling trend! While current models like ESM2 15B rely on billions of parameters, AMPLIFY achieves superior performance with only 350M parameters. 1/7
1
8
1,838
ProTrek is more like a retrieval model that learns protein sequence, structure, and function (SSF) within a unified architecture using both CE loss and masked language model loss. Checking out ESM3-generated proteins with ProTrek would be interesting. huggingface.co/spaces/westla…
1
7
599
What is most impressive about Evola is that it shows comparable results to CLEAN in enzyme EC number prediction. CLEAN is a sota model trained on the enzyme EC No. dataset and is a classification model, while Evola is a purely generative model trained on diverse protein data 1)
10
1,614
Interesting results. SaProt used AFDB structures for training which I remembered have indeed excluded virus proteins.
Although I welcome more discussion of biosafety in AI, I see condensing safety into a single score as an oversimplification of the issues. 1/
7
919
Exciting news! Our paper "NineRec" accepted in TPAMI. 🔹10 multi-modal recommendation datasets from 5 RS platforms, featuring text and images. 🔹Evaluate cross-domain recommender models with NineRec. #Recsys #WSDM Paper: arxiv.org/pdf/2309.07705.pdf Code: github.com/westlake-repl/Nin…
2
2
8
672
Like to see "Scaling"! Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT
Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT 1. SPRINT is a novel vector-based method for drug-target interaction (DTI) prediction, offering unparalleled scalability and speed. It can screen the entire human proteome against a library of 6.7 billion compounds in just 16 minutes. 2. Unlike traditional structure-based approaches, SPRINT leverages structure-aware protein language models (PLMs) to create co-embedding spaces for drugs and protein targets. This allows accurate DTI prediction without explicit 3D modeling. 3. The platform achieved state-of-the-art performance on virtual screening benchmarks, DTI classification tasks, and binding affinity predictions. It offers residue-level interpretability through attention maps, aiding mechanistic insights. 4. SPRINT was validated through large-scale applications, including antimicrobial drug discovery and SARS-CoV-2 NSP13 helicase inhibitor identification, showcasing its utility in identifying diverse, high-quality molecular scaffolds. 5. Using a multi-head attention pooling strategy, SPRINT effectively captures sequence-dependent protein representations, overcoming limitations of previous pooling methods like averaging. 6. The method is highly efficient, using Chroma vector search to handle billions of molecules and proteomes. It reduces computational barriers, enabling pan-proteome DTI screens and virtual screening for drug repurposing. 7. SPRINT’s open-source framework supports modular protein and molecule encoder integration, enhancing compatibility with future PLM developments. It also demonstrated synergy with existing molecular fingerprint methods for property prediction tasks. 8. This approach democratizes virtual screening by delivering accurate, interpretable, and large-scale drug discovery capabilities at a fraction of the computational cost of traditional methods. @david_koes @ericxing @monica_dayao @probablybots 💻Code: github.com/abhinadduri/pansp… 📜Paper: arxiv.org/abs/2411.15418 #DrugDiscovery #VirtualScreening #ProteinLanguageModels #ComputationalBiology
2
7
904
AA sequences theoretically may encode all information, but explicit features (e.g., structure) often enhance learning - similar to AF2's effective use of MSA. Even a 15B PLM cannot replace a very small Evoformer model. Not easy to do stuff with only AA sequence. :)
2
8
410
Replying to @apsarathchandar
Hi Sarath, excellent work! I completely agree with your point. Have you seen our Saprot paper? We also showed that the 35M version of Saprot is better than ESM2-15B. Take a look: biorxiv.org/content/10.1101/… openreview.net/pdf?id=6MRm3G…
1
7
503
Pinal has two components: T2struct for translating natural language to protein structure, and SaProt-T for sequence design conditioned on language and structure. Both use discrete Foldseek 3Di tokens for structure. thanks @thesteinegger for such great work.
6
628
The enzyme designed by Pinal has been shown exhibited functional activity.
1
1
7
643
For the EYFP task, biology researchers fine-tuned a peer-shared model from SaprotHub (huggingface.co/SaProtHub/Mod…), significantly outperforming AI researchers with limited data. For all other tasks, they used exactly the same dataset for evaluation.
1
7
849
NineRec, a transferable RS #recsys dataset suite comprising a large-scale source domain dataset and nine diverse target domain recommendation datasets. Each item in NineRec has a descriptive text and a high-resolution cover image.
Multimodal Multi-domain Recommendation System DataSet and Benchmark link.medium.com/FlEGaXlgVIb Paper: arxiv.org/pdf/2309.07705.pdf Code: github.com/westlake-repl/Nin…
2
6
532
Great work! congrats to @KevinKaichuang
We did 370 experiments to discover that protein language models primarily learn structure and won't scale for protein function prediction. We need new pretraining tasks! Work led by @francescazfl with @avapamini @yisongyue @alexijielu See Alex's thread + the paper for more!
7
774
Step5: Wet lab: We welcome potential collaborators interested in testing various predictions in wet lab conditions! Just provide us your evidence.
7
980
Glad to see this:
1
7
401
Model Architecture: Stage 1: Text to Foldseek 3Di tokens - 1.2B parameters Stage 2: Text + 3Di to amino acid sequence (Saprot variant) - 0.8B parameters Total: 2B parameters @thesteinegger Thanks Martin for your Foldseek.
1
5
620
Thx Brian, we provide toy dataset in the training interface and larger data in SaprotHub. For efficiency, SaprotHub processes PDB into SA token sequences. Users with their own datasets can simply upload PDB structures - we handle it automatically. Follow the hints should be okay.
4
169
Join us in submitting more pLMs to SaprotHub! Biologists can now train, use, share and co-build protein ML models without coding. ColabSaprot: colab.research.google.com/dr… SaprotHub: huggingface.co/SaProtHub Paper: biorxiv.org/content/10.1101/… Video: piped.video/watch?v=r42z1hvY…
1
6
412
our code and dataset #SIGIR2021 have been open-sourced. "One Person, One Model, One World: Learning Continual User Representation without Forgetting" "StackRec: Efficient Training of Very Deep Sequential Recommender Models by Layer Stacking" welcome to use.
6
Our #NeurIPS2022 final version is ready with all related code and datasets. It includes 11 recommendation tasks with various baseline implementations, hope it will be a reliable benchmark for new + old researchers in RS. #recsys2022 #SIGIR2022 #KDD2022
Our paper TenRec accepted in #NeurIPS2022, a large-recommender system #Recsys dataset, covering 10 recommender tasks, with 4 scenarios & 6 user feedback. We released all baseline codes and will create a leaderboard for benchmarking RS advances. openreview.net/forum?id=PfuW…
5
Highlights: - nCas9 with engineered UNGs enable transversion base editing without deamination - PLMs were used to predict enzymatic variant activities - Using the PLMs, an efficient T>S (G or C) base editor, TSBE3, was developed ...
Our paper on engineering uracil-N-glycosylase using protein language model ESM is now published in Molecular Cell. @XibinBayesZhou Would love to know if replacing ESM with our Saprot (biorxiv.org/content/10.1101/…) would result in better performance. sciencedirect.com/science/ar…
2
5
489
Two nice paper using protein language model to construct MSA, showing competitive results with HHblits and jackhammer, but much faster: arxiv.org/pdf/2206.06583.pdf
Use protein language model representations to construct multiple sequence alignments. @clairemcwhite @ProfMonaSingh biorxiv.org/content/10.1101/…
1
6
8 Ways to Search Proteins: seq2seq: Find similar function sequences seq2text: Get function from sequence seq2struct ...struct2struct ...struct2text ... struct2seq text2seq: Find sequence by description text2struct: Search structure by function Predict Go annotation, EC number.
5
550
Step4 see its function by Evolla: chat-protein.com/
1
7
1,051
If you don't know if the designed protein is relevant to your text, try our ProTrek: search-protrek.com/ Paper link: biorxiv.org/content/10.1101/…
1
6
745
Replying to @SynmitoYao
I do not know. Need wet lab
1
6
748
Pinal demonstrates impressive performance when evaluated using GT-TMscore and ProTrek CLIP score, outperforming ESM-3 for with key words as promt in dry experiment metrics. We plan to validate these results with wet experiments.
2
5
707
NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation arxiv.org/abs/2309.07705
3
308