1.2 Phosphoproteomics Data
This pipeline shows how to process phosphoproteomics data from the DMS. Phosphoproteomics deals with proteins that have been phosphorylated. Phosphorylation is a type of reversible post-translational modification (PTM; a protein modification that occurs after protein synthesis) in which a protein kinase attaches a phosphate group to an amino acid residue. This change in a protein’s structure can completely alter aspects such as its biological activity and protein-protein interactions, and “abnormal phosphorylation is now recognized as a cause or consequence of many human diseases” (Cohen, 2002). The most commonly phosphorylated amino acids are serine (S), threonine (T), and tyrosine (Y).
For this section, we will use data package number 3626
. We will need the PlexedPiper
package for isobaric quantification and PNNL.DMS.utils
to interface with the DMS. Also, details will be omitted if they were already provided in Section 1.1.
# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)
1.2.1 Read MS-GF+ Output
# Read MS-GF+ data
<- 3626
data_package_num <- read_msgf_data_from_DMS(data_package_num) msnid
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR
1.2.2 Correct Isotope Selection Error
# Correct for isotope selection error
<- correct_peak_selection(msnid) msnid
1.2.3 Remove Non-Phosphorylated Peptides
In this case, the phosphorylation of an amino acid is marked by a *
inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with apply_filter
. The *
is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''"
).
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
<- apply_filter(msnid, "grepl('\\\\*', peptide)")
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR
1.2.4 Remove Contaminants
# Remove contaminants
<- apply_filter(msnid, "!grepl('Contaminant', accession)")
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR
1.2.5 Improve Phosphosite Localization
Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore
column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.
# Filter PTMs by Ascore
<- get_AScore_results(data_package_num)
ascore <- best_PTM_location_by_ascore(msnid, ascore) msnid
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 188791 at 30 % FDR
## #peptides: 101873 at 53 % FDR
## #accessions: 90677 at 93 % FDR
1.2.6 MS/MS ID Filter: Peptide Level
# 1% FDR filter at the peptide level
<- filter_msgf_data(msnid,
msnid level = "peptide",
fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 76103 at 0.49 % FDR
## #peptides: 23378 at 1 % FDR
## #accessions: 16090 at 4.7 % FDR
1.2.7 MS/MS ID Filter: Protein Level
# Get path to FASTA file
<- path_to_FASTA_used_by_DMS(data_package_num)
path_to_FASTA
# Compute number of peptides per 1000 amino acids
<- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
msnid
# 1% FDR filter at the protein level
<- filter_msgf_data(msnid,
msnid level = "accession",
fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 9424 at 0.98 % FDR
1.2.8 Inference of Parsimonious Protein Set
# Load proteins from global crosstab
<- readRDS("data/3442_global_protein_names.rds")
global_proteins # Inference of parsimonious protein set
<- infer_parsimonious_accessions(msnid, unique_only = FALSE,
msnid prior = global_proteins)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72481 at 0.12 % FDR
## #peptides: 21266 at 0.26 % FDR
## #accessions: 2895 at 1.6 % FDR
1.2.9 Remove Decoy PSMs
# Remove Decoy PSMs
<- apply_filter(msnid, "!isDecoy")
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 72391 at 0 % FDR
## #peptides: 21211 at 0 % FDR
## #accessions: 2849 at 0 % FDR
1.2.10 Map Sites to Protein Sequences
Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. map_mod_sites
creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is SiteID
.
# Create AAStringSet
<- readAAStringSet(path_to_FASTA)
fst # Remove contaminants
<- fst[!grepl("Contaminant", names(fst)), ]
fst # First 6 names
head(names(fst))
## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"
# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>%
# Select text before first space
lapply(function(x) x[1]) %>%
unlist()
# First 6 names
head(names(fst))
## [1] "NP_783171.2" "NP_001101862.2" "NP_113721.4" "NP_714948.1"
## [5] "NP_001000704.1" "NP_001000638.1"
# Main mapping call
<- map_mod_sites(object = msnid, fasta = fst,
msnid accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*",
site_delimiter = "lower")
Dataset | ResultID | Scan | FragMethod | SpecIndex | Charge | PrecursorMZ | DelM | DelM_PPM | MH | OriginalPeptide | Protein | NTT | DeNovoScore | MSGFScore | MSGFDB_SpecEValue | Rank_MSGFDB_SpecEValue | EValue | QValue | PepQValue | IsotopeError | accession | calculatedMassToCharge | chargeState | experimentalMassToCharge | isDecoy | spectrumFile | spectrumID | pepSeq | peptide | maxAScore | msmsScore | absParentMassErrorPPM | peptides_per_1000aa | First_AA | Last_AA | First_AA_First | Last_AA_First | ProtLen | ModShift | ModAAs | SiteLoc | Site | SiteCollapsed | SiteCollapsedFirst | SiteID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 12697 | 27321 | HCD | 2256 | 3 | 1045.124 | 0.003 | 0.858 | 3131.346 | A.AAAAAGDS*DS*WDADTFSMEDPVRK.V | NP_001071138.1 | 1 | 146 | 58 | 0 | 1 | 0 | 0 | 0 | 2 | NP_001071138.1 | 1044.454 | 3 | 1044.455 | FALSE | MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 27321 | AAAAAGDSDSWDADTFSMEDPVRK | A.AAAAAGDSDS*WDADT*FSMEDPVRK.V | 0.000 | Inf | 1.057 | 7.722 | 5 | 28 | 5 | 28 | 259 | 9, 14 | S, T | 14, 19 | S14, T19 | S14,T19 | S14,T19 | NP_001071138.1-S14sT19t |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 875 | 23519 | HCD | 264 | 3 | 952.144 | 0.004 | 1.538 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 165 | 129 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 952.142 | 3 | 952.144 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23519 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 52.349 | Inf | 1.625 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 12873 | 23508 | HCD | 2213 | 4 | 714.360 | 0.007 | 2.392 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 122 | 81 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 714.358 | 4 | 714.360 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23508 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 17.480 | Inf | 2.472 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 2731 | 23697 | HCD | 502 | 4 | 714.610 | 0.002 | 0.706 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 135 | 104 | 0 | 1 | 0 | 0 | 0 | 1 | XP_006232986.1 | 714.358 | 4 | 714.359 | FALSE | MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 23697 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 26.295 | Inf | 0.780 | 5.305 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250t |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 4877 | 21265 | HCD | 935 | 4 | 800.403 | 0.006 | 1.871 | 3196.577 | R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M | NP_112621.1 | 2 | 194 | 114 | 0 | 1 | 0 | 0 | 0 | 2 | NP_112621.1 | 799.900 | 4 | 799.901 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 21265 | AAAASAAEAGIATPGTEGERDSDDALLK | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | 6.213 | Inf | 1.902 | 10.526 | 238 | 265 | 238 | 265 | 380 | 15 | T | 253 | T253 | T253 | T253 | NP_112621.1-T253t |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 6826 | 21280 | HCD | 1251 | 3 | 1066.532 | 0.000 | -0.095 | 3196.577 | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | NP_112621.1 | 2 | 200 | 94 | 0 | 1 | 0 | 0 | 0 | 1 | NP_112621.1 | 1066.197 | 3 | 1066.197 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 21280 | AAAASAAEAGIATPGTEGERDSDDALLK | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | 0.000 | Inf | 0.043 | 10.526 | 238 | 265 | 238 | 265 | 380 | 15 | T | 253 | T253 | T253 | T253 | NP_112621.1-T253t |
Table 1.8 shows the first 6 rows of the processed MS-GF+ output.
1.2.11 Read MASIC Output
# Read MASIC data
<- read_masic_data_from_DMS(data_package_num,
masic_data interference_score = TRUE)
1.2.12 Filter MASIC Data
# Filter MASIC data
<- filter_masic_data(masic_data,
masic_data interference_score_threshold = 0.5,
s2n_threshold = 0)
1.2.13 Create Study Design Tables
If study design tables are on the DMS, they can be accessed in the following way.
# Read study design tables from DMS
<- read_study_design_from_DMS(data_package_num)
study_design <- study_design$fractions
fractions <- study_design$samples
samples <- study_design$references references
While the study design tables are not on the DMS, we already created them in Section 1.1.10. We just need to recreate the fractions table because the dataset names are different.
# Read tables from folder
<- read.delim("data/samples.txt")
samples <- read.delim("data/references.txt")
references <- data.frame(Dataset = unique(masic_data$Dataset)) %>%
fractions mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))
1.2.14 Create Quantitative Cross-tab
# Set aggregation level
<- c("SiteID")
aggregation_level # Create cross-tab
<- create_crosstab(msnid, masic_data,
crosstab aggregation_level = aggregation_level,
fractions, samples, references)
R_01 | R_02 | R_03 | R_04 | R_05 | R_06 | R_07 | R_08 | R_09 | S_01 | S_02 | S_03 | S_04 | S_05 | S_06 | S_07 | S_08 | S_09 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NP_001000283.1-Y132yS137s | NA | NA | NA | NA | NA | -0.7653521 | -0.5428785 | -0.2119856 | -0.1239261 | NA | NA | NA | NA | 0.0038421 | -0.2425074 | -0.7416622 | -0.7497171 | -0.4879016 |
NP_001001064.1-Y129y | NA | NA | NA | NA | NA | -1.0408865 | -0.2976259 | -0.7301538 | -0.1385762 | NA | NA | NA | NA | 0.1594274 | -0.6587720 | 0.1932351 | -0.6370354 | -0.4461276 |
NP_001001512.2-S241s | -0.5441083 | -0.0291611 | -0.5540885 | -0.1812456 | -0.5675833 | NA | NA | NA | NA | -0.0427948 | -0.803056 | -0.5236890 | -1.0398773 | NA | NA | NA | NA | NA |
NP_001001512.2-S242s | -0.2806018 | -0.2939827 | -0.3923442 | -0.3242804 | -0.6512914 | -1.3958746 | -0.6640721 | -0.7046127 | -0.3091972 | -0.4914234 | -1.210593 | -0.6421375 | -0.8869142 | -0.3079991 | -0.8072458 | -0.4801256 | -0.4458455 | -0.7588038 |
NP_001001512.2-S699s | NA | NA | NA | NA | NA | -0.8584401 | -0.4918316 | -0.2745069 | 0.2867235 | NA | NA | NA | NA | -0.7930334 | -1.2173509 | -0.4608541 | -1.2596689 | -0.8004630 |
NP_001001512.2-S746s | NA | NA | NA | NA | NA | NA | -1.5981389 | -2.4732843 | -0.4736739 | NA | NA | NA | NA | -1.4611181 | -2.2975481 | -1.6828088 | -2.3039510 | -1.8632844 |
We will save the cross-tab for later sections.
# Modify cross-tab for saving
<- crosstab %>%
crosstab as.data.frame() %>%
::rownames_to_column("SiteID")
tibble
# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
sep = "\t", quote = FALSE, row.names = FALSE)