1.2 Phosphoproteomics Data
This pipeline shows how to process phosphoproteomics data from the DMS. Phosphoproteomics deals with proteins that have been phosphorylated. Phosphorylation is a type of reversible post-translational modification (PTM; a protein modification that occurs after protein synthesis) in which a protein kinase attaches a phosphate group to an amino acid residue. This change in a protein’s structure can completely alter aspects such as its biological activity and protein-protein interactions, and “abnormal phosphorylation is now recognized as a cause or consequence of many human diseases” (Cohen, 2002). The most commonly phosphorylated amino acids are serine (S), threonine (T), and tyrosine (Y).
For this section, we will use data package number 3626. We will need the PlexedPiper package for isobaric quantification and PNNL.DMS.utils to interface with the DMS. Also, details will be omitted if they were already provided in Section 1.1.
# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)1.2.1 Read MS-GF+ Output
# Read MS-GF+ data
data_package_num <- 3626
msnid <- read_msgf_data_from_DMS(data_package_num)show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR
1.2.2 Correct Isotope Selection Error
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)1.2.3 Remove Non-Phosphorylated Peptides
In this case, the phosphorylation of an amino acid is marked by a * inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with apply_filter. The * is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''").
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR
1.2.4 Remove Contaminants
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR
1.2.5 Improve Phosphosite Localization
Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.
# Filter PTMs by Ascore
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 188791 at 30 % FDR
## #peptides: 101873 at 53 % FDR
## #accessions: 90677 at 93 % FDR
1.2.6 MS/MS ID Filter: Peptide Level
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
level = "peptide",
fdr.max = 0.01)
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 76103 at 0.49 % FDR
## #peptides: 23378 at 1 % FDR
## #accessions: 16090 at 4.7 % FDR
1.2.7 MS/MS ID Filter: Protein Level
# Get path to FASTA file
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
level = "accession",
fdr.max = 0.01)
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 9424 at 0.98 % FDR
1.2.8 Inference of Parsimonious Protein Set
# Load proteins from global crosstab
global_proteins <- readRDS("data/3442_global_protein_names.rds")
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
prior = global_proteins)
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 2895 at 1.6 % FDR
1.2.9 Remove Decoy PSMs
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72391 at 0 % FDR
## #peptides: 21211 at 0 % FDR
## #accessions: 2849 at 0 % FDR
1.2.10 Map Sites to Protein Sequences
Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. map_mod_sites creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is SiteID.
# Create AAStringSet
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"
# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>%
# Select text before first space
lapply(function(x) x[1]) %>%
unlist()
# First 6 names
head(names(fst))## [1] "NP_783171.2" "NP_001101862.2" "NP_113721.4" "NP_714948.1"
## [5] "NP_001000704.1" "NP_001000638.1"
# Main mapping call
msnid <- map_mod_sites(object = msnid, fasta = fst,
accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*",
site_delimiter = "lower")| Dataset | ResultID | Scan | FragMethod | SpecIndex | Charge | PrecursorMZ | DelM | DelM_PPM | MH | OriginalPeptide | Protein | NTT | DeNovoScore | MSGFScore | MSGFDB_SpecEValue | Rank_MSGFDB_SpecEValue | EValue | QValue | PepQValue | IsotopeError | accession | calculatedMassToCharge | chargeState | experimentalMassToCharge | isDecoy | spectrumFile | spectrumID | pepSeq | peptide | maxAScore | msmsScore | absParentMassErrorPPM | peptides_per_1000aa | First_AA | Last_AA | First_AA_First | Last_AA_First | ProtLen | ModShift | ModAAs | SiteLoc | Site | SiteCollapsed | SiteCollapsedFirst | SiteID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 12697 | 27321 | HCD | 2256 | 3 | 1045.124 | 0.003 | 0.858 | 3131.346 | A.AAAAAGDS*DS*WDADTFSMEDPVRK.V | NP_001071138.1 | 1 | 146 | 58 | 0 | 1 | 0 | 0 | 0 | 2 | NP_001071138.1 | 1044.454 | 3 | 1044.455 | FALSE | MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 27321 | AAAAAGDSDSWDADTFSMEDPVRK | A.AAAAAGDSDS*WDADT*FSMEDPVRK.V | 0.000 | Inf | 1.057 | 7.722 | 5 | 28 | 5 | 28 | 259 | 9, 14 | S, T | 14, 19 | S14, T19 | S14,T19 | S14,T19 | NP_001071138.1-S14sT19t |
| MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 875 | 23519 | HCD | 264 | 3 | 952.144 | 0.004 | 1.538 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 165 | 129 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 952.142 | 3 | 952.144 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23519 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 52.349 | Inf | 1.625 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
| MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 12873 | 23508 | HCD | 2213 | 4 | 714.360 | 0.007 | 2.392 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 122 | 81 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 714.358 | 4 | 714.360 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23508 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 17.480 | Inf | 2.472 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
| MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 2731 | 23697 | HCD | 502 | 4 | 714.610 | 0.002 | 0.706 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 135 | 104 | 0 | 1 | 0 | 0 | 0 | 1 | XP_006232986.1 | 714.358 | 4 | 714.359 | FALSE | MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 23697 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 26.295 | Inf | 0.780 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
| MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 4877 | 21265 | HCD | 935 | 4 | 800.403 | 0.006 | 1.871 | 3196.577 | R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M | NP_112621.1 | 2 | 194 | 114 | 0 | 1 | 0 | 0 | 0 | 2 | NP_112621.1 | 799.900 | 4 | 799.901 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 21265 | AAAASAAEAGIATPGTEGERDSDDALLK | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | 6.213 | Inf | 1.902 | 10.526 | 238 | 265 | 238 | 265 | 380 | 15 | T | 253 | T253 | T253 | T253 | NP_112621.1-T253t |
| MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 6826 | 21280 | HCD | 1251 | 3 | 1066.532 | 0.000 | -0.095 | 3196.577 | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | NP_112621.1 | 2 | 200 | 94 | 0 | 1 | 0 | 0 | 0 | 1 | NP_112621.1 | 1066.197 | 3 | 1066.197 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 21280 | AAAASAAEAGIATPGTEGERDSDDALLK | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | 0.000 | Inf | 0.043 | 10.526 | 238 | 265 | 238 | 265 | 380 | 15 | T | 253 | T253 | T253 | T253 | NP_112621.1-T253t |
Table 1.8 shows the first 6 rows of the processed MS-GF+ output.
1.2.11 Read MASIC Output
# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num,
interference_score = TRUE)1.2.12 Filter MASIC Data
# Filter MASIC data
masic_data <- filter_masic_data(masic_data,
interference_score_threshold = 0.5,
s2n_threshold = 0)1.2.13 Create Study Design Tables
If study design tables are on the DMS, they can be accessed in the following way.
# Read study design tables from DMS
study_design <- read_study_design_from_DMS(data_package_num)
fractions <- study_design$fractions
samples <- study_design$samples
references <- study_design$referencesWhile the study design tables are not on the DMS, we already created them in Section 1.1.10. We just need to recreate the fractions table because the dataset names are different.
# Read tables from folder
samples <- read.delim("data/samples.txt")
references <- read.delim("data/references.txt")
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>%
mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))1.2.14 Create Quantitative Cross-tab
# Set aggregation level
aggregation_level <- c("SiteID")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
aggregation_level = aggregation_level,
fractions, samples, references)| R_01 | R_02 | R_03 | R_04 | R_05 | R_06 | R_07 | R_08 | R_09 | S_01 | S_02 | S_03 | S_04 | S_05 | S_06 | S_07 | S_08 | S_09 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NP_001000283.1-Y132yS137s | NA | NA | NA | NA | NA | -0.7653521 | -0.5428785 | -0.2119856 | -0.1239261 | NA | NA | NA | NA | 0.0038421 | -0.2425074 | -0.7416622 | -0.7497171 | -0.4879016 |
| NP_001001064.1-Y129y | NA | NA | NA | NA | NA | -1.0408865 | -0.2976259 | -0.7301538 | -0.1385762 | NA | NA | NA | NA | 0.1594274 | -0.6587720 | 0.1932351 | -0.6370354 | -0.4461276 |
| NP_001001512.2-S241s | -0.5441083 | -0.0291611 | -0.5540885 | -0.1812456 | -0.5675833 | NA | NA | NA | NA | -0.0427948 | -0.803056 | -0.5236890 | -1.0398773 | NA | NA | NA | NA | NA |
| NP_001001512.2-S242s | -0.2806018 | -0.2939827 | -0.3923442 | -0.3242804 | -0.6512914 | -1.3958746 | -0.6640721 | -0.7046127 | -0.3091972 | -0.4914234 | -1.210593 | -0.6421375 | -0.8869142 | -0.3079991 | -0.8072458 | -0.4801256 | -0.4458455 | -0.7588038 |
| NP_001001512.2-S699s | NA | NA | NA | NA | NA | -0.8584401 | -0.4918316 | -0.2745069 | 0.2867235 | NA | NA | NA | NA | -0.7930334 | -1.2173509 | -0.4608541 | -1.2596689 | -0.8004630 |
| NP_001001512.2-S746s | NA | NA | NA | NA | NA | NA | -1.5981389 | -2.4732843 | -0.4736739 | NA | NA | NA | NA | -1.4611181 | -2.2975481 | -1.6828088 | -2.3039510 | -1.8632844 |
We will save the cross-tab for later sections.
# Modify cross-tab for saving
crosstab <- crosstab %>%
as.data.frame() %>%
tibble::rownames_to_column("SiteID")
# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
sep = "\t", quote = FALSE, row.names = FALSE)