How should I prepare my data for Byonic?
Nothing is required. Byonic takes in raw data in raw format from Thermo, Waters, Sciex, Bruker, Shimadzu, and Agilent. Optionally, you may want to run the data through Preview, and if the scatter plots of mass error vs. m/z for precursor and fragment mass errors reveal systematic m/z measurement errors, ask Preview to recalibrate the data.
Should I de-isotope my data?
No! Byonic handles isotope peaks internally. De-isotoping the spectra beforehand using, for example, Mascot Distiller, destroys valuable information. De-isotoping is an especially bad idea for ETD spectra, which often have peaks (c–1 and z+1 peaks) that lead de-isotoping algorithms astray.
How should I choose search parameters?
Set mass tolerances appropriate for the type of instrument, for example, 10 ppm precursor tolerance for a high resolution instrument and 0.3 Dalton fragment tolerance for ion trap fragmentation. Preview’s mass error plots can help you choose these tolerances. Preview’s m/z recalibration can remove systematic errors so that data can be run with tighter tolerances, for example, 5 ppm instead of 10 ppm tolerance for a high resolution instrument. Tight tolerances offer significant advantage for difficult searches, for example, resolving nearly isobaric modifications such as sulfation and phosphorylation, or identifying glycopeptides with poor fragmentation. Tolerances can be set in either Da or ppm, as appropriate for the instrument. Set digestion specificity based on the prevalence of nonspecific digestion and the complexity of the search. If the modification complexity of the search is high, as in wildcard, glycosylation, or oxidative footprinting searches, it is best to avoid the extra complexity of searching for nonspecific digestion, unless the nonspecific digestion rate is high (say, over 20% of all peptides). Set modifications based upon prevalence reported by Preview and the goal of the study. If the goal of the study is phosphorylation site identification, enable up to 3 or even 4 phosphorylation sites per peptide, and avoid other modifications unless they are prevalent. If the goal of the study is simply protein identification, it is best to enable only the most common modifications (for example, oxidized methionine and deamidation). Be especially alert to over-alkylation; in some samples, over-alkylation is so common that the majority of peptides carry iodoacetamide artifacts. Some modifications are more costly (for example, sodiation on any residue as opposed to just E and D), but others (such as pyro-glu on N-terminal glutamine) barely increase the size of the search space.
What is a focused database?
Byonic enables the user output a small protein database, containing only the higher –ranking proteins (whether they be target or decoy) from a first search, along with appropriate decoys. A database specifically focused on the proteins in the sample under study improves speed and accuracy for subsequent searches. Focused databases are especially useful for wide modification searches, such as glycosylation and wildcard searches. We do not, however, advocate the use of a focused database for every study; they are unnecessary for most searches. Similarly, there is no need to make the first search very narrow; enable common modifications and nonspecific digestion as appropriate.
What is a wildcard modification?
A wildcard modification is nonspecific in both mass and (optionally) residue type; this is Byonic’s version of blind modification search. Adding a wildcard with mass range of 100 Daltons increases the search time approximately 100-fold, so it is faster to add a wildcard only on small searches, for example, fully tryptic searches with few other modifications. However, wildcard modification searches can be specified to apply only to specific residues.
When should I use a wildcard modification?
We use a wildcard modification most often in a final clean-up search to be sure we haven’t missed anything interesting. We take out most other modifications and search against a focused database so that the wildcard search does not take too long. Wildcard search can also be used to find sequence variants. On the other hand, wildcard matches tend to be approximate, rather than exact: the wildcard modification is often misplaced, off by one Dalton, or the sum of two closely spaced modifications. Be alert for approximate answers such as EV[–18]PQLEVTK, where –18 almost surely belongs on E not V, and L[–113]EDEFVEVTK, where the right answer is surely EDEFVEVTK. Finally, we use wildcard search to solve mystery spectra; on a well-sequenced organism almost all spectra are at most one wildcard away from a database sequence.
Should I search my data more than once?
Probably. It can be a good idea to bracket your search with several settings of the crucial parameters when trying to get the most from the data. Even on data with overall 5 ppm precursor mass accuracy, there will be a few valid identifications with much larger errors, due to interfering MS1 peaks, mixture spectra, and so forth.
Should I combine multiple search engines?
In our experiments, other search engines find very few valid spectrum identifications missed by Byonic, typically less than 1%.
How does Byonic compute p-values?
Byonic computes both peptide-spectrum match (PSM) and protein p-values, assuming simple models of random matches. For PSMs, Byonic assumes that random scores are independent identically distributed (i.i.d.) picks from a probability distribution with an exponential right-hand tail. This distribution depends only upon the size of the search (number of modifications, digestion specificity, size of the protein database, and so forth), and not upon the spectrum itself. Byonic reports the log base 10 of the p-value, so that a LogProb (log p-value) of –2.0 should occur by chance on only about one out of 100 spectra. For proteins, Byonic computes the expected total LogProb of PSMs hitting each protein, assuming that random PSMs are distributed uniformly over the protein database. The protein LogProb is the excess of total LogProb over the expected amount (that is, how much more negative). Proteins are ranked from most confident on down according to LogProb.
How does Byonic estimate the False Discovery Rate (FDR)?
The False Discovery Rate (FDR) in a list of identifications (either proteins or PSMs) is the number of incorrect identifications divided by the total number of identifications. Byonic estimates PSM FDR using the target/decoy approach, which is the de facto standard for significance testing in proteomics. We have devised a method called two-dimensional FDR (2D FDR) http://www.ncbi.nlm.nih.gov/pubmed/22010998 that can take into account protein-level information when computing PSM FDR, without biasing the FDR estimate. Two-dimensional FDR gives greater sensitivity/specificity than other methods because it can retain lower scoring PSMs to high-ranking proteins (which are likely to be correct) yet discard higher scoring PSMs to low-ranking proteins (which are likely to be incorrect).