1.2 Phosphoproteomics Data

This pipeline shows how to process phosphoproteomics data from the DMS. Phosphoproteomics deals with proteins that have been phosphorylated. Phosphorylation is a type of reversible post-translational modification (PTM; a protein modification that occurs after protein synthesis) in which a protein kinase attaches a phosphate group to an amino acid residue. This change in a protein’s structure can completely alter aspects such as its biological activity and protein-protein interactions, and “abnormal phosphorylation is now recognized as a cause or consequence of many human diseases” (Cohen, 2002). The most commonly phosphorylated amino acids are serine (S), threonine (T), and tyrosine (Y).

For this section, we will use data package number 3626. We will need the PlexedPiper package for isobaric quantification and PNNL.DMS.utils to interface with the DMS. Also, details will be omitted if they were already provided in Section 1.1.

# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)

1.2.1 Read MS-GF+ Output

# Read MS-GF+ data
data_package_num <- 3626
msnid <- read_msgf_data_from_DMS(data_package_num)

show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR

1.2.2 Correct Isotope Selection Error

# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)

1.2.3 Remove Non-Phosphorylated Peptides

In this case, the phosphorylation of an amino acid is marked by a * inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with apply_filter. The * is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''").

# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR

1.2.4 Remove Contaminants

# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR

1.2.5 Improve Phosphosite Localization

Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.

# Filter PTMs by Ascore
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)

show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 188791 at 30 % FDR
## #peptides: 101873 at 53 % FDR
## #accessions: 90677 at 93 % FDR

1.2.6 MS/MS ID Filter: Peptide Level

# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
                          level = "peptide",
                          fdr.max = 0.01)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 76103 at 0.49 % FDR
## #peptides: 23378 at 1 % FDR
## #accessions: 16090 at 4.7 % FDR

1.2.7 MS/MS ID Filter: Protein Level

# Get path to FASTA file
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)

# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)

# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
                          level = "accession",
                          fdr.max = 0.01)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 9424 at 0.98 % FDR

1.2.8 Inference of Parsimonious Protein Set

# Load proteins from global crosstab
global_proteins <- readRDS("data/3442_global_protein_names.rds")
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
                                       prior = global_proteins)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 2895 at 1.6 % FDR

1.2.9 Remove Decoy PSMs

# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72391 at 0 % FDR
## #peptides: 21211 at 0 % FDR
## #accessions: 2849 at 0 % FDR

1.2.10 Map Sites to Protein Sequences

Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. map_mod_sites creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is SiteID.

# Create AAStringSet
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))

## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"                    
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"             
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"             
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"             
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"

# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>% 
  # Select text before first space
  lapply(function(x) x[1]) %>% 
  unlist()
# First 6 names
head(names(fst))

## [1] "NP_783171.2"    "NP_001101862.2" "NP_113721.4"    "NP_714948.1"   
## [5] "NP_001000704.1" "NP_001000638.1"

# Main mapping call
msnid <- map_mod_sites(object = msnid, fasta = fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide", 
                       mod_char = "*",
                       site_delimiter = "lower")

Table 1.8: First 6 rows of the processed MS-GF+ results.
Dataset	ResultID	Scan	FragMethod	SpecIndex	Charge	PrecursorMZ	DelM	DelM_PPM	MH	OriginalPeptide	Protein	NTT	DeNovoScore	MSGFScore	Rank_MSGFDB_SpecEValue	IsotopeError	accession	calculatedMassToCharge	chargeState	experimentalMassToCharge	isDecoy	spectrumFile	spectrumID	pepSeq	peptide	maxAScore	msmsScore	absParentMassErrorPPM	peptides_per_1000aa	First_AA	Last_AA	First_AA_First	Last_AA_First	ProtLen	ModShift	ModAAs	SiteLoc	Site	SiteCollapsed	SiteCollapsedFirst	SiteID
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03	12697	27321	HCD	2256	3	1045.124	0.003	0.858	3131.346	A.AAAAAGDSDSWDADTFSMEDPVRK.V	NP_001071138.1	1	146	58	1	2	NP_001071138.1	1044.454	3	1044.455	FALSE	MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03	27321	AAAAAGDSDSWDADTFSMEDPVRK	A.AAAAAGDSDSWDADTFSMEDPVRK.V	0.000	Inf	1.057	7.722	5	28	5	28	259	9, 14	S, T	14, 19	S14, T19	S14,T19	S14,T19	NP_001071138.1-S14sT19t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	875	23519	HCD	264	3	952.144	0.004	1.538	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	165	129	1	0	XP_006232986.1	952.142	3	952.144	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	23519	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	52.349	Inf	1.625	5.305	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	12873	23508	HCD	2213	4	714.360	0.007	2.392	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	122	81	1	0	XP_006232986.1	714.358	4	714.360	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	23508	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	17.480	Inf	2.472	5.305	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	2731	23697	HCD	502	4	714.610	0.002	0.706	2854.412	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	XP_006232986.1	2	135	104	1	1	XP_006232986.1	714.358	4	714.359	FALSE	MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03	23697	AAAASAAEAGIATPGTEDSDDALLK	R.AAAASAAEAGIAT*PGTEDSDDALLK.M	26.295	Inf	0.780	5.305	238	262	238	262	377	12	T	250	T250	T250	T250	XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	4877	21265	HCD	935	4	800.403	0.006	1.871	3196.577	R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M	NP_112621.1	2	194	114	1	2	NP_112621.1	799.900	4	799.901	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	21265	AAAASAAEAGIATPGTEGERDSDDALLK	R.AAAASAAEAGIATPGT*EGERDSDDALLK.M	6.213	Inf	1.902	10.526	238	265	238	265	380	15	T	253	T253	T253	T253	NP_112621.1-T253t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	6826	21280	HCD	1251	3	1066.532	0.000	-0.095	3196.577	R.AAAASAAEAGIATPGT*EGERDSDDALLK.M	NP_112621.1	2	200	94	1	1	NP_112621.1	1066.197	3	1066.197	FALSE	MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03	21280	AAAASAAEAGIATPGTEGERDSDDALLK	R.AAAASAAEAGIATPGT*EGERDSDDALLK.M	0.000	Inf	0.043	10.526	238	265	238	265	380	15	T	253	T253	T253	T253	NP_112621.1-T253t

Table 1.8 shows the first 6 rows of the processed MS-GF+ output.

1.2.11 Read MASIC Output

# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score = TRUE)

1.2.12 Filter MASIC Data

# Filter MASIC data
masic_data <- filter_masic_data(masic_data, 
                                interference_score_threshold = 0.5,
                                s2n_threshold = 0)

1.2.13 Create Study Design Tables

If study design tables are on the DMS, they can be accessed in the following way.

# Read study design tables from DMS
study_design <- read_study_design_from_DMS(data_package_num)
fractions <- study_design$fractions
samples <- study_design$samples
references <- study_design$references

While the study design tables are not on the DMS, we already created them in Section 1.1.10. We just need to recreate the fractions table because the dataset names are different.

# Read tables from folder
samples <- read.delim("data/samples.txt")
references <- read.delim("data/references.txt")
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>% 
  mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))

1.2.14 Create Quantitative Cross-tab

# Set aggregation level
aggregation_level <- c("SiteID")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
                            aggregation_level = aggregation_level,
                            fractions, samples, references)

Table 1.9: First 6 rows of the phospho quantitative cross-tab.
	R_01	R_02	R_03	R_04	R_05	R_06	R_07	R_08	R_09	S_01	S_02	S_03	S_04	S_05	S_06	S_07	S_08	S_09
NP_001000283.1-Y132yS137s	NA	NA	NA	NA	NA	-0.7653521	-0.5428785	-0.2119856	-0.1239261	NA	NA	NA	NA	0.0038421	-0.2425074	-0.7416622	-0.7497171	-0.4879016
NP_001001064.1-Y129y	NA	NA	NA	NA	NA	-1.0408865	-0.2976259	-0.7301538	-0.1385762	NA	NA	NA	NA	0.1594274	-0.6587720	0.1932351	-0.6370354	-0.4461276
NP_001001512.2-S241s	-0.5441083	-0.0291611	-0.5540885	-0.1812456	-0.5675833	NA	NA	NA	NA	-0.0427948	-0.803056	-0.5236890	-1.0398773	NA	NA	NA	NA	NA
NP_001001512.2-S242s	-0.2806018	-0.2939827	-0.3923442	-0.3242804	-0.6512914	-1.3958746	-0.6640721	-0.7046127	-0.3091972	-0.4914234	-1.210593	-0.6421375	-0.8869142	-0.3079991	-0.8072458	-0.4801256	-0.4458455	-0.7588038
NP_001001512.2-S699s	NA	NA	NA	NA	NA	-0.8584401	-0.4918316	-0.2745069	0.2867235	NA	NA	NA	NA	-0.7930334	-1.2173509	-0.4608541	-1.2596689	-0.8004630
NP_001001512.2-S746s	NA	NA	NA	NA	NA	NA	-1.5981389	-2.4732843	-0.4736739	NA	NA	NA	NA	-1.4611181	-2.2975481	-1.6828088	-2.3039510	-1.8632844

We will save the cross-tab for later sections.

# Modify cross-tab for saving
crosstab <- crosstab %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column("SiteID")

# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)

References

Cohen, P., The Origins of Protein Phosphorylation, Nature Cell Biology, vol. 4, no. 5, pp. E127–30, accessed October 21, 2021, from https://www.nature.com/articles/ncb0502-e127, May 2002. DOI: 10.1038/ncb0502-e127