2.1 Prepare MS/MS Identifications
2.1.1 Read MS-GF+ Data
# Read MS-GF+ data
<- 3626 # phospho
data_package_num <- read_msgf_data_from_DMS(data_package_num) msnid
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 612667 at 55 % FDR
## #peptides: 396540 at 75 % FDR
## #accessions: 121521 at 98 % FDR
2.1.2 Correct Isotope Selection Error
# Correct for isotope selection error
<- correct_peak_selection(msnid) msnid
2.1.3 Remove Unmodified Peptides
Generally, we will remove unmodified peptides before any sort of filtering steps; however, unmodified peptides will be removed automatically in Section 2.1.9, so this step can be skipped if we need to tally the number of modified and unmodified peptides toward the end of processing.
In this case, the phosphorylation of an amino acid is marked by a *
appearing next in the sequence. We can filter out peptides that do not contain this symbol with apply_filter
. In regular expressions, the *
is a special character called a metacharacter that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string ("''"
). For non-metacharacters, it is not necessary to include the backslashes.
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
<- apply_filter(msnid, "grepl('\\\\*', peptide)")
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537749 at 57 % FDR
## #peptides: 353634 at 76 % FDR
## #accessions: 118817 at 98 % FDR
2.1.4 Remove Contaminants
# Remove contaminants
<- apply_filter(msnid, "!grepl('Contaminant', accession)")
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 537572 at 57 % FDR
## #peptides: 353489 at 76 % FDR
## #accessions: 118797 at 98 % FDR
2.1.5 Improve Phosphosite Localization
Phospho datasets involve Ascore jobs for improving phosphosite localization. There should be one AScore job per data package. If the Ascore job does not exist, see AScore Job Creation for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains AScore
column that “measures the probability of correct phosphorylation site localization” (Beausoleil et al., 2006). AScore > 17 is considered confident.
# Filter PTMs by Ascore - only for phospho data
<- get_AScore_results(data_package_num)
ascore <- best_PTM_location_by_ascore(msnid, ascore) msnid
2.1.6 MS/MS ID Filter: Peptide Level
# 1% FDR filter at the peptide level
<- filter_msgf_data(msnid, level = "peptide", fdr.max = 0.01)
msnid show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 77741 at 0.51 % FDR
## #peptides: 23118 at 1 % FDR
## #accessions: 15964 at 4.8 % FDR
2.1.7 MS/MS ID Filter: Protein Level
This step is unnecessary for PTM data, since the cross-tab is not created at the protein level, so it is skipped.
2.1.8 Inference of Parsimonious Protein Set
If a protein was detected in the global proteomics results, we may be more confident that it will appear in the PTM results. We can perform prioritized inference of the protein set to ensure that, if a protein is reported in the global cross-tab, and it is present in the PTM MSnID after filtering, it will be included in the final PTM MSnID. We set the proteins from the global cross-tab as the prior. By default, peptides are allowed to match multiple proteins in the prior. If duplicates are not allowed, we can set the refine_prior
argument to TRUE
.
# Proteins from global proteomics cross-tab
load("./data/3442_global_proteins.RData")
# Prioritized inference of parsimonious protein set
<- infer_parsimonious_accessions(msnid, unique_only = FALSE,
msnid prior = global_proteins,
refine_prior = FALSE)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 23
## #PSMs: 77738 at 0.51 % FDR
## #peptides: 23117 at 0.99 % FDR
## #accessions: 4419 at 4.8 % FDR
2.1.9 Map Sites to Protein Sequences
MSnID::map_mod_sites
creates a number of columns describing mapping of the modification sites onto the protein sequences. The most important for the user is SiteID
. names(fst)
must match accessions(msnid)
; usually, we will have to modify names to remove everything after the first word.
# Create AAStringSet
<- path_to_FASTA_used_by_DMS(data_package_num)
path_to_FASTA <- readAAStringSet(path_to_FASTA)
fst # Remove contaminants
<- fst[!grepl("Contaminant", names(fst)), ]
fst # First 6 names
head(names(fst))
## [1] "NP_783171.2 cathepsin R precursor [Rattus norvegicus]"
## [2] "NP_001101862.2 zinc finger protein ZIC 2 [Rattus norvegicus]"
## [3] "NP_113721.4 UDP-glucuronosyltransferase 2B2 precursor [Rattus norvegicus]"
## [4] "NP_714948.1 Ly-49 stimulatory receptor 3 [Rattus norvegicus]"
## [5] "NP_001000704.1 olfactory receptor Olr931 [Rattus norvegicus]"
## [6] "NP_001000638.1 olfactory receptor Olr652 [Rattus norvegicus]"
# Modify names to match accessions(msnid)
# Remove any space followed by any number of characters
names(fst) <- sub(" .*", "", names(fst))
# First 6 names
head(names(fst))
## [1] "NP_783171.2" "NP_001101862.2" "NP_113721.4" "NP_714948.1"
## [5] "NP_001000704.1" "NP_001000638.1"
The names are in the proper format, so we can continue with the main mapping call. This will also remove any unmodified peptides, if Section 2.1.3 was skipped.
# Main mapping call
<- map_mod_sites(object = msnid,
msnid fasta = fst,
accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*", # asterisk for phosphorylation
site_delimiter = ";") # semicolon between multiple sites
Table 2.1 shows the first 6 rows of the processed MS-GF+ output.
Dataset | ResultID | Scan | FragMethod | SpecIndex | Charge | PrecursorMZ | DelM | DelM_PPM | MH | OriginalPeptide | Protein | NTT | DeNovoScore | MSGFScore | MSGFDB_SpecEValue | Rank_MSGFDB_SpecEValue | EValue | QValue | PepQValue | IsotopeError | accession | calculatedMassToCharge | chargeState | experimentalMassToCharge | isDecoy | spectrumFile | spectrumID | pepSeq | peptide | maxAScore | msmsScore | absParentMassErrorPPM | First_AA | Last_AA | First_AA_First | Last_AA_First | ProtLen | ModShift | ModAAs | SiteLoc | Site | SiteCollapsed | SiteCollapsedFirst | SiteID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 12697 | 27321 | HCD | 2256 | 3 | 1045.124 | 0.003 | 0.858 | 3131.346 | A.AAAAAGDS*DS*WDADTFSMEDPVRK.V | NP_001071138.1 | 1 | 146 | 58 | 0 | 1 | 0 | 0 | 0 | 2 | NP_001071138.1 | 1044.454 | 3 | 1044.455 | FALSE | MoTrPAC_Pilot_TMT_P_S1_06_DIL_28Oct17_Elm_AQ-17-10-03 | 27321 | AAAAAGDSDSWDADTFSMEDPVRK | A.AAAAAGDSDS*WDADT*FSMEDPVRK.V | 0.000 | Inf | 1.057 | 5 | 28 | 5 | 28 | 259 | 9, 14 | S, T | 14, 19 | S14, T19 | S14,T19 | S14,T19 | NP_001071138.1-S14;T19 |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 875 | 23519 | HCD | 264 | 3 | 952.144 | 0.004 | 1.538 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 165 | 129 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 952.142 | 3 | 952.144 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23519 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 52.349 | Inf | 1.625 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250 |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 12873 | 23508 | HCD | 2213 | 4 | 714.360 | 0.007 | 2.392 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 122 | 81 | 0 | 1 | 0 | 0 | 0 | 0 | XP_006232986.1 | 714.358 | 4 | 714.360 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 23508 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 17.480 | Inf | 2.472 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250 |
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 1793 | 23803 | HCD | 349 | 3 | 952.472 | -0.015 | -5.362 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 146 | 116 | 0 | 1 | 0 | 0 | 0 | 1 | XP_006232986.1 | 952.142 | 3 | 952.137 | FALSE | MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 23803 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 52.349 | Inf | 5.277 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250 |
MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 2731 | 23697 | HCD | 502 | 4 | 714.610 | 0.002 | 0.706 | 2854.412 | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | XP_006232986.1 | 2 | 135 | 104 | 0 | 1 | 0 | 0 | 0 | 1 | XP_006232986.1 | 714.358 | 4 | 714.359 | FALSE | MoTrPAC_Pilot_TMT_P_S2_07_3Nov17_Elm_AQ-17-10-03 | 23697 | AAAASAAEAGIATPGTEDSDDALLK | R.AAAASAAEAGIAT*PGTEDSDDALLK.M | 26.295 | Inf | 0.780 | 238 | 262 | 238 | 262 | 377 | 12 | T | 250 | T250 | T250 | T250 | XP_006232986.1-T250 |
MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 4877 | 21265 | HCD | 935 | 4 | 800.403 | 0.006 | 1.871 | 3196.577 | R.AAAASAAEAGIAT*PGTEGERDSDDALLK.M | NP_112621.1 | 2 | 194 | 114 | 0 | 1 | 0 | 0 | 0 | 2 | NP_112621.1 | 799.900 | 4 | 799.901 | FALSE | MoTrPAC_Pilot_TMT_P_S1_07_DIL_28Oct17_Elm_AQ-17-10-03 | 21265 | AAAASAAEAGIATPGTEGERDSDDALLK | R.AAAASAAEAGIATPGT*EGERDSDDALLK.M | 6.213 | Inf | 1.902 | 238 | 265 | 238 | 265 | 380 | 15 | T | 253 | T253 | T253 | T253 | NP_112621.1-T253 |