An LC-MSMS system extends on the peptide mass fingerprinting (PMF) ZooMS MALDI-TOF approach by measuring not only the intensity and m/z of the precursor masses, but also providing a fragmentation pattern for each peptide. For identifying these MSMS spectra, the target-decoy approach is at the heart of most algorithms, for score thresholding and/or to improve the feature weights used for scoring.
However, this presents paleoproteomics with a conundrum: ancient bone samples generate sparse datasets built from only a few proteins, disabling the use of target-decoy strategies. In other words, where the simplicity of the protein composition enables fast and effective ZooMS through PMF using MALDI, it effectively complicates LC MSMS data interpretation. Fortunately, probabilistic search engines like Mascot and MaxQuant are still suited for sparse peptide identification and are therefore most used in LC-MSMS species classification.
Still, these peptides don’t derive from functionally different proteins from one given organism, for which most scoring algorithms are developed. Rather, the data consists of often low quality MSMS spectra of a few homologues protein targets that could be derived from a plethora of organisms, i.e. a long list of very similar orthologue sequences. The net effect is that basic search results display huge amounts of ambiguity, in turn impairing accurate post-processing tools like Unipept.
Therefore, we suggest to embrace and even extend the ambiguity using a newly developed “isoBLAST” approach in order to obtain an exhaustive set of potential peptide candidates (PPCs) for each spectrum. Using only the collagen PPCs, we then walk down the taxonomic tree retaining unique peptides at every bifurcation and plot the complete set of considered peptides in a sunburst plot color-coded by an adapted Bray Curtis dissimilarity score. Using public and in-house data, we show the performance of the approach on different instruments and search engines.