Had a day to reflect on the release of ESM3, and just wanted to share a few thoughts (and a few shameless highlights of my lab's work! 😅). Before that, for the people who know our stuff, you know that I am an ESM evangelist: I think pLMs will be the future of protein design. 🪄 But it's super important for my lab to understand strengths and weaknesses! To the few points:
The Good:
-ESM3 uses progressive unmasking for generation. I know a lot of people are like, why not just do next-token? MLM is a way more natural, representative strategy of nature's evolutionary "generative" process, where mutations arise epistatically to confer higher fitness. We've found significant success ourselves with de novo binder generation via span MLM on ESM-2-650M latents (we didn't find the same success with GPT-like models). Check out our PepMLM model with
@LeoTZ03:
arxiv.org/abs/2310.03842
-Overall, you should not sleep on BERT-like models: they are great generators in many ways, and the same will probably be true for ESM3 (though GFP is probably not enough for validation). We've explored strategies with ESM-2 to perturb latent embeddings with Gaussian noise and decode back into de novo sequences for binder design (which work amazingly in the lab!). Check out our PepPrCLIP model with
@bhat_suhaas and
@kalyanmpalepu:
biorxiv.org/content/10.1101/…
-With the largest models trained on 2.78 billion proteins on the MLM task, I have no doubt the model should have excellent unconditional generation/representation capabilities for prediction tasks. As academics, we're thankful that ESM3 will release these models for us to play around with (if we have the compute)!
The Not So Good:
-Look, I'm a sequence-only guy. I believe all of the useful information of protein properties should be contained in a good sequence representation. I am quite disappointed that ESM3 went with incorporating structure tokens. No doubt this will improve performance for a lot of representation/design tasks (look at SaProt from
@duguyuan!) on structured proteins, but this will likely reduce our ability to model conformationally disordered proteins, i.e. transcription factors, which are the most important from a disease/regulatory perspective. My lab has gone in the opposite direction and regularly fine-tune sequence-only ESM models on more disordered sequences, like fusion oncoproteins, and get strong performance. Check out our FusOn-pLM model with
@SophieVincoff:
biorxiv.org/cgi/content/shor…
-What about other special tokens? PTMs, chemical modifications, etc. -- these could have been integrated in training as new tokens. We've described new ways to introduce PTM tokens into pLMs like ESM-2. Doing this for ESM3 will be fun (but potentially difficult with the size of the models)! Check out our PTM-Mamba paper with
@pengzhangzhi1:
biorxiv.org/content/10.1101/…
-Size, size, size. ESM-2-650M is BY FAR the best pLM that balances size and representation capacity. All of our papers (and pretty much every other paper I've read) find this model is optimal for de novo design and downstream prediction tasks, despite being the "medium-sized" ESM. Check out our SaLT&PepPr paper with
@garykbrixi:
nature.com/articles/s42003-0….
-For academic labs (pretty much the main ones who can use it), it's going to be tough to use the bigger models for optimization, even the open-sourced 1.4B model. Switching away from ESM-2-650M will be a mistake for most applications that don't involve unconditional generation. I hope the ESM3 team will do more ablation studies to prove the model's additional utility! 🥹
The Neutral
Finally, ESM3 is available with a non-commercial, academic use-only license. I think this is absolutely
the right move (similar to AlphaFold3) to protect EvolutionaryScale's commercial interests while still letting academics push the frontiers of research if ESM3 proves to be useful! However, for some of us that use ESM-like models to develop therapeutics, it will be hard for us to get ESM3-assisted designed molecules to market without commercialization capabilities. That's why I would still recommend continued usage of ESM-2-650M for most tasks -- it's such a good model! 😊
Would love to hear the ESM team's thoughts and would be very open to collaboration! 🌟
@alexrives @TomSercu @proteinrosh @denizzokt @ebetica @THayes427