1.2 Phosphoproteomics Data

This pipeline shows how to process phosphoproteomics data from the DMS. Phosphoproteomics deals with proteins that have been phosphorylated. Phosphorylation is a type of reversible post-translational modification (PTM; a protein modification that occurs after protein synthesis) in which a protein kinase attaches a phosphate group to an amino acid residue. This change in a protein’s structure can completely alter aspects such as its biological activity and protein-protein interactions, and “abnormal phosphorylation is now recognized as a cause or consequence of many human diseases” (Cohen, 2002). The most commonly phosphorylated amino acids are serine (S), threonine (T), and tyrosine (Y).

For this section, we will use data package number 3626. We will need the PlexedPiper package for isobaric quantification and PNNL.DMS.utils to interface with the DMS. Also, details will be omitted if they were already provided in Section 1.1.

# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)

1.2.1 Read MS-GF+ Output

# Read MS-GF+ data
data_package_num <- 3626
msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR

1.2.2 Correct Isotope Selection Error

# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)

1.2.3 Remove Non-Phosphorylated Peptides

In this case, the phosphorylation of an amino acid is marked by a * inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with apply_filter. The * is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''").

# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR

1.2.4 Remove Contaminants

# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR

1.2.5 Improve Phosphosite Localization

Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.

# Filter PTMs by Ascore
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 188791 at 30 % FDR
## #peptides: 101873 at 53 % FDR
## #accessions: 90677 at 93 % FDR

1.2.6 MS/MS ID Filter: Peptide Level

# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
                          level = "peptide",
                          fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 76103 at 0.49 % FDR
## #peptides: 23378 at 1 % FDR
## #accessions: 16090 at 4.7 % FDR

1.2.7 MS/MS ID Filter: Protein Level

# Get path to FASTA file
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)

# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)

# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
                          level = "accession",
                          fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 9424 at 0.98 % FDR

1.2.8 Inference of Parsimonious Protein Set

# Load proteins from global crosstab
global_proteins <- readRDS("data/3442_global_protein_names.rds")
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
                                       prior = global_proteins)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 2895 at 1.6 % FDR

1.2.9 Remove Decoy PSMs

# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  23 
## #PSMs: 72391 at 0 % FDR
## #peptides: 21211 at 0 % FDR
## #accessions: 2849 at 0 % FDR

1.2.10 Map Sites to Protein Sequences

Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. map_mod_sites creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is SiteID.

# Create AAStringSet
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))
## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"                    
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"             
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"             
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"             
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"
# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>% 
  # Select text before first space
  lapply(function(x) x[1]) %>% 
  unlist()
# First 6 names
head(names(fst))
## [1] "NP_783171.2"    "NP_001101862.2" "NP_113721.4"    "NP_714948.1"   
## [5] "NP_001000704.1" "NP_001000638.1"
# Main mapping call
msnid <- map_mod_sites(object = msnid, fasta = fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide", 
                       mod_char = "*",
                       site_delimiter = "lower")
Table 1.8: First 6 rows of the processed MS-GF+ results.
Dataset ResultID Scan FragMethod SpecIndex Charge PrecursorMZ DelM DelM_PPM MH OriginalPeptide Protein NTT DeNovoScore MSGFScore MSGFDB_SpecEValue Rank_MSGFDB_SpecEValue EValue QValue PepQValue IsotopeError accession calculatedMassToCharge chargeState experimentalMassToCharge isDecoy spectrumFile spectrumID pepSeq peptide maxAScore msmsScore absParentMassErrorPPM peptides_per_1000aa First_AA Last_AA First_AA_First Last_AA_First ProtLen ModShift ModAAs SiteLoc Site SiteCollapsed SiteCollapsedFirst SiteID
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 12697 27321 HCD 2256 3 1045.124 0.003 0.858 3131.346 A.AAAAAGDS*DS*WDADTFSMEDPVRK.V NP_001071138.1 1 146 58 0 1 0 0 0 2 NP_001071138.1 1044.454 3 1044.455 FALSE MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 27321 AAAAAGDSDSWDADTFSMEDPVRK A.AAAAAGDSDS*WDADT*FSMEDPVRK.V 0.000 Inf 1.057 7.722 5 28 5 28 259 9, 14 S, T 14, 19 S14, T19 S14,T19 S14,T19 NP_001071138.1-S14sT19t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 875 23519 HCD 264 3 952.144 0.004 1.538 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 165 129 0 1 0 0 0 0 XP_006232986.1 952.142 3 952.144 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 23519 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 52.349 Inf 1.625 5.305 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 12873 23508 HCD 2213 4 714.360 0.007 2.392 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 122 81 0 1 0 0 0 0 XP_006232986.1 714.358 4 714.360 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 23508 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 17.480 Inf 2.472 5.305 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 2731 23697 HCD 502 4 714.610 0.002 0.706 2854.412 R.AAAASAAEAGIAT*PGTEDSDDALLK.M XP_006232986.1 2 135 104 0 1 0 0 0 1 XP_006232986.1 714.358 4 714.359 FALSE MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 23697 AAAASAAEAGIATPGTEDSDDALLK R.AAAASAAEAGIAT*PGTEDSDDALLK.M 26.295 Inf 0.780 5.305 238 262 238 262 377 12 T 250 T250 T250 T250 XP_006232986.1-T250t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 4877 21265 HCD 935 4 800.403 0.006 1.871 3196.577 R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M NP_112621.1 2 194 114 0 1 0 0 0 2 NP_112621.1 799.900 4 799.901 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 21265 AAAASAAEAGIATPGTEGERDSDDALLK R.AAAASAAEAGIATPGT*EGERDSDDALLK.M 6.213 Inf 1.902 10.526 238 265 238 265 380 15 T 253 T253 T253 T253 NP_112621.1-T253t
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 6826 21280 HCD 1251 3 1066.532 0.000 -0.095 3196.577 R.AAAASAAEAGIATPGT*EGERDSDDALLK.M NP_112621.1 2 200 94 0 1 0 0 0 1 NP_112621.1 1066.197 3 1066.197 FALSE MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 21280 AAAASAAEAGIATPGTEGERDSDDALLK R.AAAASAAEAGIATPGT*EGERDSDDALLK.M 0.000 Inf 0.043 10.526 238 265 238 265 380 15 T 253 T253 T253 T253 NP_112621.1-T253t


Table 1.8 shows the first 6 rows of the processed MS-GF+ output.

1.2.11 Read MASIC Output

# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score = TRUE)

1.2.12 Filter MASIC Data

# Filter MASIC data
masic_data <- filter_masic_data(masic_data, 
                                interference_score_threshold = 0.5,
                                s2n_threshold = 0)

1.2.13 Create Study Design Tables

If study design tables are on the DMS, they can be accessed in the following way.

# Read study design tables from DMS
study_design <- read_study_design_from_DMS(data_package_num)
fractions <- study_design$fractions
samples <- study_design$samples
references <- study_design$references

While the study design tables are not on the DMS, we already created them in Section 1.1.10. We just need to recreate the fractions table because the dataset names are different.

# Read tables from folder
samples <- read.delim("data/samples.txt")
references <- read.delim("data/references.txt")
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>% 
  mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))

1.2.14 Create Quantitative Cross-tab

# Set aggregation level
aggregation_level <- c("SiteID")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
                            aggregation_level = aggregation_level,
                            fractions, samples, references)
Table 1.9: First 6 rows of the phospho quantitative cross-tab.
R_01 R_02 R_03 R_04 R_05 R_06 R_07 R_08 R_09 S_01 S_02 S_03 S_04 S_05 S_06 S_07 S_08 S_09
NP_001000283.1-Y132yS137s NA NA NA NA NA -0.7653521 -0.5428785 -0.2119856 -0.1239261 NA NA NA NA 0.0038421 -0.2425074 -0.7416622 -0.7497171 -0.4879016
NP_001001064.1-Y129y NA NA NA NA NA -1.0408865 -0.2976259 -0.7301538 -0.1385762 NA NA NA NA 0.1594274 -0.6587720 0.1932351 -0.6370354 -0.4461276
NP_001001512.2-S241s -0.5441083 -0.0291611 -0.5540885 -0.1812456 -0.5675833 NA NA NA NA -0.0427948 -0.803056 -0.5236890 -1.0398773 NA NA NA NA NA
NP_001001512.2-S242s -0.2806018 -0.2939827 -0.3923442 -0.3242804 -0.6512914 -1.3958746 -0.6640721 -0.7046127 -0.3091972 -0.4914234 -1.210593 -0.6421375 -0.8869142 -0.3079991 -0.8072458 -0.4801256 -0.4458455 -0.7588038
NP_001001512.2-S699s NA NA NA NA NA -0.8584401 -0.4918316 -0.2745069 0.2867235 NA NA NA NA -0.7930334 -1.2173509 -0.4608541 -1.2596689 -0.8004630
NP_001001512.2-S746s NA NA NA NA NA NA -1.5981389 -2.4732843 -0.4736739 NA NA NA NA -1.4611181 -2.2975481 -1.6828088 -2.3039510 -1.8632844


We will save the cross-tab for later sections.

# Modify cross-tab for saving
crosstab <- crosstab %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column("SiteID")

# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)

References

Cohen, P., The Origins of Protein Phosphorylation, Nature Cell Biology, vol. 4, no. 5, pp. E127–30, accessed October 21, 2021, from https://www.nature.com/articles/ncb0502-e127, May 2002. DOI: 10.1038/ncb0502-e127