2.1 Prepare MS/MS Identifications

2.1.1 Read MS-GF+ Data

# Read MS-GF+ data
data_package_num <- 3626 # phospho
msnid <- read_msgf_data_from_DMS(data_package_num)

show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR

2.1.2 Correct Isotope Selection Error

# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)

2.1.3 Remove Unmodified Peptides

Generally, we will remove unmodified peptides before any sort of filtering steps; however, unmodified peptides will be removed automatically in Section 2.1.9, so this step can be skipped if we need to tally the number of modified and unmodified peptides toward the end of processing.

In this case, the phosphorylation of an amino acid is marked by a * appearing next in the sequence. We can filter out peptides that do not contain this symbol with apply_filter. In regular expressions, the * is a special character called a metacharacter that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''"). For non-metacharacters, it is not necessary to include the backslashes.

# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR

2.1.4 Remove Contaminants

# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR

2.1.5 Improve Phosphosite Localization

Phospho datasets involve Ascore jobs for improving phosphosite localization. There should be one AScore job per data package. If the Ascore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore column that “measures the probability of correct phosphorylation site localization” (Beausoleil et al., 2006). AScore > 17 is considered confident.

# Filter PTMs by Ascore - only for phospho data
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)

2.1.6 MS/MS ID Filter: Peptide Level

# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid, level = "peptide", fdr.max = 0.01)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77741 at 0.51 % FDR
## #peptides: 23118 at 1 % FDR
## #accessions: 15964 at 4.8 % FDR

2.1.7 MS/MS ID Filter: Protein Level

This step is unnecessary for PTM data, since the cross-tab is not created at the protein level, so it is skipped.

2.1.8 Inference of Parsimonious Protein Set

If a protein was detected in the global proteomics results, we may be more confident that it will appear in the PTM results. We can perform prioritized inference of the protein set to ensure that, if a protein is reported in the global cross-tab, and it is present in the PTM MSnID after filtering, it will be included in the final PTM MSnID. We set the proteins from the global cross-tab as the prior. By default, peptides are allowed to match multiple proteins in the prior. If duplicates are not allowed, we can set the refine_prior argument to TRUE.

# Proteins from global proteomics cross-tab
load("./data/3442_global_proteins.RData")

# Prioritized inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
                                       prior = global_proteins, 
                                       refine_prior = FALSE)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77738 at 0.51 % FDR
## #peptides: 23117 at 0.99 % FDR
## #accessions: 4419 at 4.8 % FDR

2.1.9 Map Sites to Protein Sequences

MSnID::map_mod_sites creates a number of columns describing mapping of the modification sites onto the protein sequences. The most important for the user is SiteID. names(fst) must match accessions(msnid); usually, we will have to modify names to remove everything after the first word.

# Create AAStringSet
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))

## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"                    
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"             
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"             
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"             
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"

# Modify names to match accessions(msnid)
# Remove any space followed by any number of characters
names(fst) <- sub(" .*", "", names(fst))
# First 6 names
head(names(fst))

## [1] "NP_783171.2"    "NP_001101862.2" "NP_113721.4"    "NP_714948.1"   
## [5] "NP_001000704.1" "NP_001000638.1"

The names are in the proper format, so we can continue with the main mapping call. This will also remove any unmodified peptides, if Section 2.1.3 was skipped.

# Main mapping call
msnid <- map_mod_sites(object = msnid, 
                       fasta = fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide",
                       mod_char = "*", # asterisk for phosphorylation
                       site_delimiter = ";") # semicolon between multiple sites

Table 2.1 shows the first 6 rows of the processed MS-GF+ output.

Table 2.1: First 6 rows of the processed MS-GF+ results.
Dataset	ResultID	Scan	FragMethod	SpecIndex	Charge	PrecursorMZ	DelM	DelM_PPM	MH	OriginalPeptide	Protein	NTT	DeNovoScore	MSGFScore	Rank_MSGFDB_SpecEValue	IsotopeError	accession	calculatedMassToCharge	chargeState	experimentalMassToCharge	isDecoy	spectrumFile	spectrumID	pepSeq	peptide	maxAScore	msmsScore	absParentMassErrorPPM	First_AA	Last_AA	First_AA_First	Last_AA_First	ProtLen	ModShift	ModAAs	SiteLoc	Site	SiteCollapsed	SiteCollapsedFirst	SiteID
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03	12697	27321	HCD	2256	3	1045.124	0.003	0.858	3131.346	A.AAAAAGDSDSWDADTFSMEDPVRK.V	NP_001071138.1	1	146	58	1	2	NP_001071138.1	1044.454	3	1044.455	FALSE	MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03	27321	AAAAAGDSDSWDADTFSMEDPVRK	A.AAAAAGDSDSWDADTFSMEDPVRK.V	0.000	Inf	1.057	5	28	5	28	259	9, 14	S, T	14, 19	S14, T19	S14,T19	S14,T19	NP_001071138.1-S14;T19
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	875	23519	HCD	264	3	952.144	0.004	1.538	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	165	129	1	0	XP_006232986.1	952.142	3	952.144	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	23519	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	52.349	Inf	1.625	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	12873	23508	HCD	2213	4	714.360	0.007	2.392	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	122	81	1	0	XP_006232986.1	714.358	4	714.360	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	23508	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	17.480	Inf	2.472	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	1793	23803	HCD	349	3	952.472	-0.015	-5.362	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	146	116	1	1	XP_006232986.1	952.142	3	952.137	FALSE	MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	23803	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	52.349	Inf	5.277	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	2731	23697	HCD	502	4	714.610	0.002	0.706	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	135	104	1	1	XP_006232986.1	714.358	4	714.359	FALSE	MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	23697	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	26.295	Inf	0.780	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	4877	21265	HCD	935	4	800.403	0.006	1.871	3196.577	R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M	NP_112621.1	2	194	114	1	2	NP_112621.1	799.900	4	799.901	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	21265	AAAASAAEAGIATPGTEGERDSDDALLK	R.AAAASAAEAGIATPGT*EGERDSDDALLK.M	6.213	Inf	1.902	238	265	238	265	380	15	T	253	T253	T253	T253	NP_112621.1-T253

2.1.10 Remove Decoy PSMs

# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 77347 at 0 % FDR
## #peptides: 22890 at 0 % FDR
## #accessions: 4216 at 0 % FDR

References

Beausoleil, S. A., Villén, J., Gerber, S. A., Rush, J. and Gygi, S. P., A Probability-Based Approach for High-Throughput Protein Phosphorylation Analysis and Site Localization, Nature Biotechnology, vol. 24, no. 10, pp. 1285–92, accessed February 12, 2022, from https://www.nature.com/articles/nbt1240, October 2006. DOI: 10.1038/nbt1240