High-resolution tandem mass spectrometry is a powerful tool for characterizing how microbiota function by analyzing their proteic content. However, accurately assessing their taxonomical composition based solely on peptide sequences remains challenging due to sample complexity. To address this, we present LineageFilter (LF), a new python-based AI software for refined proteotyping of complex samples using metaproteomics raw data and machine learning.
Given a tentative list of taxa, their abundance, and the scores associated to their identified peptides, LF computes a comprehensive set of features for each identified taxon at all taxonomical ranks. Its random forest-based machine-learning model assesses the likelihood of each taxon's presence based on these features, effectively filtering false-positives. LF's training and test datasets comprise tandem mass spectrometry data from 16 mixtures, including both previously described and in-house assembled datasets.
LF represents a robust and powerful tool to improve identification of the taxa in complex multispecies samples, paving the way for better protein and functional annotation in metaproteomics. The versatility of LF is showcased by its effectiveness in handling noisy data, allowing for the extraction of a higher number of proteins validated at FDR 1%. LF can be seamlessly integrated into existing tools and pipelines, including those using Unipept, to enhance proteotyping sensitivity and recall. Although default settings are recommended, LF can be easily customized with other parameters.
To demonstrate LF’s ability to enhance both the efficiency and precision of complex samples characterization, we showcase its application to profile microbiota from challenging samples like sputum from cystic fibrosis patients. We demonstrate how using LF facilitated the elucidation of key microbe-host interactions improving metaproteomics profiles and their ability to complement clinical data to optimize the patient’s management.