Mass spectrometry (MS)-based proteomics most often relies on database searching to identify peptides or proteins. Ideally, protein sequences come from public databases, such as NCBI or UniProtKB. However, 63.3% of the 165.891 species with sequencing data in the SRA database have no annotation (96.4% for eukaryotic species; March 2024 ). To overcome this bottleneck and enable the proteomist to annotate any species on their own, we developed the bioinformatics pipeline named Brownotate (BR).
BR gathers existing tools to process sequencing data, filter out unreliable reads, assemble the retained reads, predict protein-coding genes, perform translation into protein sequences, name them, and perform BUSCO completeness evaluations (search for species-specific conserved orthologs in OrthoDB) . To assess BR performance, BR-derived assemblies and annotations were generated for 27 species belonging to 11 different taxa, then compared to reference (REF) data extracted from NCBI. MS raw data downloaded from PRIDE for these same species were also processed using Maxquant against either the BR-derived or the REF protein database, and datasets of identified proteins (FDR 1%) were compared.
REF and BR assemblies look quite similar in terms of length. Their BUSCO completeness is comparable for short-genome species, but decreases for BR in large genome-species . On average, BR predicts twice as many proteins than REF , but generally of shorter lengths . The analysis of PRIDE-extracted data using BR and REF protein databases reveals that, in general, a similar number of protein groups are identified using the BR and REF-derived databases, with 89.6% of protein groups shared by both databases .
BR proved to generate high-quality protein sequence databases, well-adapted for MS/MS data interpretation. Thanks to its friendly graphical interface, BR should help for studies on species for which only sequencing or assembly data are available or in the context of personalized medicine.