- Best practices for optimizing a proteomics database search.
- How-to for creating effective focused protein databases
The ideal protein database for Byonic™ or any other proteomics search engine is one that contains all the proteins represented in the tandem mass spectra, but no other proteins, except for decoys used to compute False Discovery Rate (FDR). A search against a too-large protein database takes more time and gives lower sensitivity (fewer true positives), and a search against a too-small protein database gives lower specificity (more false positives), because spectra from missing proteins are likely to find false matches to proteins that are in the database.
Laboratories working with one type of biological material, for example human plasma, would be well advised to put together a long-lasting database customized for their ongoing work; however, laboratories that run many different materials need a quick way to build a “focused” protein database for each new study. Byonic provides just such a method with the Create focused database option on the Processing nodes tab under Protein Output Options. Setting Create Focused Database to "Yes" in an initial first-pass search will cause Byonic to write out a protein database in FASTA format in the objs folder inside the Results folder.
The database will have a name similar to Lophiotoma.acuta.focused_with_decoys.fasta; it can of course be renamed and moved to a more convenient folder. The database will have all the proteins found in Byonic’s search, that is, all the proteins (target or decoy) shown in Byonic’s protein list, along with sufficient decoys to balance the numbers of target and decoy proteins. This database can then be used for subsequent wide searches with nonspecific digestion, many known modifications, and/or a wildcard modification, and because it includes decoys the user should set Add Decoys to "No" for any subsequent Byonic search.
The first pass search is typically a narrow search, one that is expected to find all the proteins in the sample, but not necessarily all the peptides. For example, in a sample enriched for phosphopeptides, the first-pass search may allow phosphorylations but no other modifications, and subsequent searches may allow many more modifications along with one or two nonspecific peptide termini. There is no need to optimize the focused database precisely; a database double or triple the ideal size may slow down the search but will lose very few true positives, and leaving out some low-abundance proteins will cause very few false positives.
Here we show three searches using data from the ABRF iPRG 2012 study on post-translational modifications:
- A one-pass search using a full protein database (the 42,450-protein database supplied by the study organizers) and 24 variable modifications.
- A first-pass search using the full database and 4 variable modifications and Protein FDR set to 2% FDR (or 50 reverse count) on the Advanced tab.
- A second-pass search using the focused database produced by search (2) with the same 24 variable modifications used in search (1).
Figure 1: Byonic’s Modifications tab for a 24-modification search against the full protein database. Here we are counting each combination of amino acid residue and mass delta as a distinct modification. Note the small trick of setting C[+57] as fixed and C[-57] as variable in order to allow for under-alkylation yet prefer complete alkylation
All searches used 6 ppm precursor tolerance, 40 ppm fragment tolerance, and considered only fully tryptic peptides with a maximum of 2 missed cleavages. Figure 1 shows Byonic’s Modifications tab for search (1); the inputs for search (3) are identical except for the name of the FASTA file. Search (2) allowed only acetylated protein N-terminus, pyro-Glu on N-terminal E and Q, and under-alkylated C; these modifications do not apply to most peptides and hence barely slow down the search.
Comparative results for the three searches are shown in Table 1. Notice that searches (2) and (3) together take 40x less time than search (1) alone, yet search (3) gives about 4% more peptide-spectrum matches (PSMs) than search (1) with only a mild increase in FDR. FDR tends to increase in a focused database, because there are fewer “distractor” proteins (targets and decoys) for the spectra to match.
|Search||# PSM||PSM FDR||# Unique Peptides||# Proteins||Running Time|
|One Pass (1)||7122||0.0%||4528||578||4 hr 56 min|
|First Pass (2)||6211||0.0%||4019||572||3 min 2 sec|
|Second Pass (3)||7408||0.7%||4685||571||4 min 12 sec|
Table 1: Byonic results for the three searches. # PSMs means the number of spectra assigned to target proteins when using Byonic’s automatic PSM thresholding; PSM FDR is Byonic’s estimate of the spectrum-level False Discovery Rate on proteins ranked above the top decoy protein; and # Proteins is the number of proteins in a list with 1% protein-level FDR. Both the One Pass and Second Pass results shown above are more sensitive and accurate than any of the 24 submissions in the ABRF study, many of which used multiple search engines.