Inspecting De Novo Sequencing (Supernovo) Results – Protein Metrics

The De novo sequencing workflow in Byos is designed to sequence monoclonal antibodies using high resolution mass spectrometry data without any prior sequence information (reference). Once the workflow is run, it produces two output files: a project file (.blgc) which is opened by Byos, and an .html file in the same location. While inspecting results, it's important to look at both output files.

In the project file, the Protein Coverage view provides the most relevant information:

Depth of coverage in the constant region

We expect to see a high number for "Depth of coverage in the constant region", ideally >20. A lower number would mean either 1) there aren't enough peptides and fragment ions in the constant region, or 2) there wasn't a good starting template in Supernovo's libraries. The latter can occur if the antibody is sourced from an uncommon species (ie. not human, mouse, rat, rabbit, or hamster) or if it's a highly engineered recombinant mAb.

If the number is <20 the determined sequencing of the variable region can still be accurate, although it would require careful inspection and validation.

Fragmentation summary

The blue vertical bars between each residue represent total fragment ions observed at that peptide bond. If little or no ions are observed, the residues on both sides of the cleavage bond would be highlighted yellow (medium confidence) or red (low confidence). Particular attention should be given to CDRs. In the above example, CDR-H3 shows low fragment coverage, so we would like to click on each peptide covering this region and inspect the MS/MS as well as MS1 and XIC plots to verify that the determined sequence is correct (and in this case, it is.) There are no fragment ions at the N-term, hence QV is highlighted red. This is expected due to N-term Q cyclization, so it is not a concern.

Digestion summary

The vertical orange bars represent cumulative peptide cleavages observed at that location. For example, if one of the samples were digested with trypsin, we expect to see a large orange bar at C-term side of K and R. Low bars at expected cleavage sides would indicate a likely sample prep problem.

Protein Termini

Both C- and N-terminus of the protein sequence can show low fragment coverage and extensions. Low fragment coverage is often expected and sequences in these regions are determined by template sequence homology and the precursor ion mass evidence, rather than MS/MS fragmentation.

Observing extensions are common as well. Often there are low abundance species in the sample, due to partially cleaved signal peptide at N-terminus, and little-known phenomena at C-terminus that result in 1-2 residue extensions. Sensitive instruments can easily pick these species up, and de novo sequencing can determine the actual sequences. Often the user can safely ignore the extensions. If the user is not sure if the termini has extensions or not, the alignment in the html output file would be helpful to highlight where the template sequence begins/ends.

Leucine/Isoleucine Differentiation

The accuracy in differentiating leucine (L) and isoleucine (I) in mass spectrometry-based de novo sequencing analysis is difficult as the side chains of these amino acids are isomers and have the same mass of 113.08406 Da. However, using the samples from multiple proteases and running analysis on instrument that can generate EThcD data, the amino acids I/L can be differentiated with sophisticated analysis. Supernovo uses the characteristic EThcD fragment ion peaks and the digestion or lack of digestion of C-terminal of these amino acids (I/L) to distinguish Isoleucine (I) from Leucine (L). For the EThcD analysis, the output .html file summarizes the confidence of the I/L differentiation and other metrics and graphics. In the below figure, the actual sequence (highlighted in grey), the de novo predicted sequence (highlighted in black) and an additional row showing the confidence of I/L is shown. Different colors of I/L represent different confidences; Green is high confidence, Orange is medium confidence, and Red is low confidence. Regions highlighted in Yellow are Complementarity Determining Regions (CDRs) of the antibody.

In absence of EthcD data, Supernovo can still differentiate I/L based on digestion specificity and the template sequence. If chymotrypsin and pepsin are used, for example, we expect to see more peptides cleaved at C-term of L compared to I, even though the specificity is low. Inspecting the protein coverage image carefully is required for validation, paying attention to nearby cleavage sites (FLWY) as well since they could result in peptides too small to be detected.

Occasionally the user may encounter a sequence segment that doesn't have enough evidence to make the I/L determination. In these cases, we recommend sticking to the template sequence. It's rare to see a Leu mutated to an Ile and vice versa, although it does happen. And if there is no template at all (eg. CDR-H3), the only option may be to express both versions of the recombinant mAb and make the final determination via an activity assay.

Other Isobaric masses

Some other common mistakes due to isobaric masses are:

N vs GG: The alignment in the html file would highlight which one is correct. If there is no fragment ion between the two Gs, and if the template sequence is N, it's most likely an N.

D vs N[+1]: If all the observed Ns are N{+1], ie. deamidated, one would need to question the source of the antibody and consider its structure as well. The sequence may simply be misdetermined, original being a D. However, mAbs that have been exposed to high pH, high temperature, long shelf life, etc., can indeed show 100% deamidation at sites with high risk of deamidation. NG and NS motifs are most common, but also NN, NH, NA, etc.

G vs IAM: Overalkylation can sometimes be mistaken as a G insertion. Once again, the .html file alignment will be helpful. Using Iodoacetic acid (C+58) or even better isotopically labeled (C+59) alkylating agent would reduce these mistakes. Also recommended for sequence variant analysis.

Q vs K: These are not exactly isobaric (Δ = 0.036380 Da) and can usually be differentiated by high resolution MS/MS data.