2.1 Prepare MS/MS Identifications

2.1.1 Read MS-GF+ Data

# Read MS-GF+ data
data_package_num <- 3626 # phospho
msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR

2.1.2 Correct Isotope Selection Error

# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)

2.1.3 Remove Unmodified Peptides

Generally, we will remove unmodified peptides before any sort of filtering steps; however, unmodified peptides will be removed automatically in Section 2.1.9, so this step can be skipped if we need to tally the number of modified and unmodified peptides toward the end of processing.

In this case, the phosphorylation of an amino acid is marked by a * appearing next in the sequence. We can filter out peptides that do not contain this symbol with apply_filter. In regular expressions, the * is a special character called a metacharacter that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''"). For non-metacharacters, it is not necessary to include the backslashes.

# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR

2.1.4 Remove Contaminants

# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR

2.1.5 Improve Phosphosite Localization

Phospho datasets involve Ascore jobs for improving phosphosite localization. There should be one AScore job per data package. If the Ascore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore column that “measures the probability of correct phosphorylation site localization” (Beausoleil et al., 2006). AScore > 17 is considered confident.

# Filter PTMs by Ascore - only for phospho data
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)

2.1.6 MS/MS ID Filter: Peptide Level

# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid, level = "peptide", fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77741 at 0.51 % FDR
## #peptides: 23118 at 1 % FDR
## #accessions: 15964 at 4.8 % FDR

2.1.7 MS/MS ID Filter: Protein Level

This step is unnecessary for PTM data, since the cross-tab is not created at the protein level, so it is skipped.

2.1.8 Inference of Parsimonious Protein Set

If a protein was detected in the global proteomics results, we may be more confident that it will appear in the PTM results. We can perform prioritized inference of the protein set to ensure that, if a protein is reported in the global cross-tab, and it is present in the PTM MSnID after filtering, it will be included in the final PTM MSnID. We set the proteins from the global cross-tab as the prior. By default, peptides are allowed to match multiple proteins in the prior. If duplicates are not allowed, we can set the refine_prior argument to TRUE.

# Proteins from global proteomics cross-tab
load("./data/3442_global_proteins.RData")

# Prioritized inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
                                       prior = global_proteins, 
                                       refine_prior = FALSE)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77738 at 0.51 % FDR
## #peptides: 23117 at 0.99 % FDR
## #accessions: 4419 at 4.8 % FDR


2.1.9 Map Sites to Protein Sequences

MSnID::map_mod_sites creates a number of columns describing mapping of the modification sites onto the protein sequences. The most important for the user is SiteID. names(fst) must match accessions(msnid); usually, we will have to modify names to remove everything after the first word.

# Create AAStringSet
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))
## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"                    
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"             
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"             
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"             
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"
# Modify names to match accessions(msnid)
# Remove any space followed by any number of characters
names(fst) <- sub(" .*", "", names(fst))
# First 6 names
head(names(fst))
## [1] "NP_783171.2"    "NP_001101862.2" "NP_113721.4"    "NP_714948.1"   
## [5] "NP_001000704.1" "NP_001000638.1"

The names are in the proper format, so we can continue with the main mapping call. This will also remove any unmodified peptides, if Section 2.1.3 was skipped.

# Main mapping call
msnid <- map_mod_sites(object = msnid, 
                       fasta = fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide",
                       mod_char = "*", # asterisk for phosphorylation
                       site_delimiter = ";") # semicolon between multiple sites

Table 2.1 shows the first 6 rows of the processed MS-GF+ output.

Table 2.1: First 6 rows of the processed MS-GF+ results.
Dataset ResultID Scan FragMethod SpecIndex Charge PrecursorMZ DelM DelM_PPM MH OriginalPeptide Protein NTT DeNovoScore MSGFScore MSGFDB_SpecEValue Rank_MSGFDB_SpecEValue EValue QValue PepQValue IsotopeError accession calculatedMassToCharge chargeState experimentalMassToCharge isDecoy spectrumFile spectrumID pepSeq peptide maxAScore msmsScore absParentMassErrorPPM First_AA Last_AA First_AA_First Last_AA_First ProtLen ModShift ModAAs SiteLoc Site SiteCollapsed SiteCollapsedFirst SiteID
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 12697 27321 HCD 2256 3 1045.124 0.003 0.858 3131.346 A.AAAAAGDS*DS*WDADTFSMEDPVRK.V NP_001071138.1 1 146 58 0 1 0 0 0 2 NP_001071138.1 1044.454 3 1044.455 FALSE MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 27321 AAAAAGDSDSWDADTFSMEDPVRK A.AAAAAGDSDS*WDADT*FSMEDPVRK.V 0.000 Inf 1.057 5 28 5 28 259 9, 14 S, T 14, 19 S14, T19 S14,T19 S14,T19 NP_001071138.1-S14;T19
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 875 23519 HCD 264 3 952.144 0.004 1.538 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 165 129 0 1 0 0 0 0 XP_006232986.1 952.142 3 952.144 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 23519 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 52.349 Inf 1.625 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 12873 23508 HCD 2213 4 714.360 0.007 2.392 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 122 81 0 1 0 0 0 0 XP_006232986.1 714.358 4 714.360 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 23508 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 17.480 Inf 2.472 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 1793 23803 HCD 349 3 952.472 -0.015 -5.362 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 146 116 0 1 0 0 0 1 XP_006232986.1 952.142 3 952.137 FALSE MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 23803 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 52.349 Inf 5.277 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 2731 23697 HCD 502 4 714.610 0.002 0.706 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 135 104 0 1 0 0 0 1 XP_006232986.1 714.358 4 714.359 FALSE MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 23697 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 26.295 Inf 0.780 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 4877 21265 HCD 935 4 800.403 0.006 1.871 3196.577 R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M NP_112621.1 2 194 114 0 1 0 0 0 2 NP_112621.1 799.900 4 799.901 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 21265 AAAASAAEAGIATPGTEGERDSDDALLK R.AAAASAAEAGIATPGT*EGERDSDDALLK.M 6.213 Inf 1.902 238 265 238 265 380 15 T 253 T253 T253 T253 NP_112621.1-T253


2.1.10 Remove Decoy PSMs

# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77347 at 0 % FDR
## #peptides: 22890 at 0 % FDR
## #accessions: 4216 at 0 % FDR

References

Beausoleil, S. A., Villén, J., Gerber, S. A., Rush, J. and Gygi, S. P., A Probability-Based Approach for High-Throughput Protein Phosphorylation Analysis and Site Localization, Nature Biotechnology, vol. 24, no. 10, pp. 1285–92, accessed February 12, 2022, from https://www.nature.com/articles/nbt1240, October 2006. DOI: 10.1038/nbt1240