Session: Parallel session 3 - AI and Bioinformatics

Tesorai Search: Large pretrained model boosts peptide identifications without the need for Percolator

Maximilien BURQ¹, Peter CIMERMANCIC¹, Dejan STEPEC¹, Juan RESTREPO², Jure ZBONTAR¹, Shamil URAZBAKHTIN², Bryan CRAMPTON¹, Shivani TIWARY¹, Melissa MIAO¹, Juergen COX²

¹Tesorai, San Diego, United States
²Max Planck Institute of Biochemistry, Martinsried, Germany

Current search algorithms for mass spectrometry proteomics typically fail to identify up to 75% of spectra. To mitigate the issue, machine learning (ML) has become a staple of modern database search engines for mass-spectrometry proteomics. Second-generation software such as Percolator, PeptideProphet, Prosit, AlphaPeptDeep, MSBooster, Peaks, Chimerys and others show impressive increases in the number of peptides and proteins identified, when compared to first-generation tools (eg, Sequest, X!Tandem, Comet, Andromeda, and Sage). However, recent evidence has shown that the increased identifications come at the cost of a significant underestimation of the FDR estimate. They observe that this is inherent to how these tools leverage ML, where a new model is re-trained to separate targets from decoys for every run. This approach stands in sharp contrast with the recent trend in other fields of ML, where large models are trained only once and then applied on a wide range of modalities and use-cases.

We hypothesized that training a single large peptide-spectrum matching (PSM) model can improve upon the gains from using Percolator-like tools, while solving the FDR control issue. To that end, we built a novel approach to peptide-spectrum matching, based on a pre-trained large deep learning model, which does not utilize decoys during training and does not require training a new model for every new sample. We trained the Tesorai model on over 100M real peptide-spectrum pairs, demonstrating that the approach performs robustly across a wide range of use-cases including standard trypsin-digested samples, metaproteomics, single-cell, and isobaric-labeled samples.

In addition to providing robust FDR control, our method consistently increases identifications when compared to state-of-the-art approaches up to 170%. We incorporated our model into a cloud-native, end-to-end system which can process 500 samples in less than an hour, and made it publicly available at www.tesorai.com.