Suggested TMT processing pipeline for PNNL DMS *phospho* data for MoTrPAC project. Peptide level. • PlexedPiper

PlexedPiper is the main library for the pipeline. MSnID package is a flexible way of handling MS/MS identifications. It handles MS/MS data filtering part.

library(PlexedPiper)
library(PNNL.DMS.utils)
library(MSnID)
library(dplyr)
library(Biostrings)

Data source

This vignette demonstrates processing usign PNNL’s DMS as a data source. PNNL DMS is based on MS SQL Server. Typically data is organized into data packages according to experiments. Data package that contains phospho data mirroring PlexedPiperTestData is number 3626. Smaller version of the data package, suggested for testing purposes is 3625.

data_package_num <- 3625

# if there is no connection to PNNL DMS, vignette is not compiled
if(!is_PNNL_DMS_connection_successful()){
   message("There is no connection to PNNL DMS. This code in this vignette won't be evaluated.")
   knitr::opts_chunk$set(eval = FALSE)
}

Prepare MS/MS IDs

Read the MS-GF+ output

First step is to determine MS-GF+ jobs using the data package number. This simply reads parsed to text output of MS-GF+ search engine. The text files are collated together and the resulting data.frame used to create MSnID object.

msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)

AScore

Phospho dataset involve Ascore jobs for improving phospho-site localization. There should be one Ascore job per data package. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly it contains AScore column that signifies the confidence of PTM assingment. AScore > 17 is considered confident.

ascore <- read_AScore_results_from_DMS(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)

Remove non-phospho. Need to be sure that phospho symbol is *!

msnid <- apply_filter(msnid, "grepl(\"\\\\*\", peptide)")

FDR filter

FDR filter at peptide level.

msnid <- filter_msgf_data_peptide_level(msnid, 0.01)

Inference of parsimonius set

msnid <- infer_parsimonious_accessions(msnid)

Mapping sites to protein sequence

Mapping sites to protein sequence. Call creates number of columns describing mapping of the site/s onto the protein sequences. The most important for the user is SiteID.

Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. Note, this type of ID extraction will ignore non-RefSeq entries such as contaminants.

fst_dms_pth <- path_to_FASTA_used_by_DMS(data_package_num)
fst <- readAAStringSet(fst_dms_pth)
names(fst) <- gsub(" .*", "", names(fst)) # make names match accessions

Mapping main call.

msnid <- map_mod_sites(msnid, fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide", 
                       mod_char = "*",
                       site_delimiter = "lower")
head(psms(msnid))

Prepare MASIC reporter ion intensities

Fetching and preparing reporter intensities based on MASIC ouput.

masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score = T)
masic_data <- filter_masic_data(masic_data, 0.5, 0)

Fetch study design tables

study_design_tables <- read_study_design_from_DMS(data_package_num)
fractions <- study_design_tables$fractions
samples <- study_design_tables$samples
references <- study_design_tables$references

Create cross-tab

Aggregation at RefSeq / peptide level.

msnid <- apply_filter(msnid, "!isDecoy")
# aggregation_level <- c("SiteID")
aggregation_level <- c("accession","peptide")
quant_cross_tab <- create_crosstab(msnid, 
                                   masic_data, 
                                   aggregation_level, 
                                   fractions, samples, references)
dim(quant_cross_tab)
head(quant_cross_tab)

Post-processing. Linking with AScores

x_ascore <- 
  psms(msnid) %>% 
  group_by(accession, peptide, SiteID) %>%
  summarise(maxAScore = max(maxAScore))

x <- quant_cross_tab %>%
  as.data.frame() %>%
  tibble::rownames_to_column("ID") %>%
  mutate(RefSeq =  sub("(.*)@(.\\..*\\..)","\\1",ID),
         peptide = sub("(.*)@(.\\..*\\..)","\\2",ID)) %>%
  dplyr::select(-ID) %>%
  inner_join(x_ascore) %>%
  mutate(ConfidentSite = maxAScore > 17,
         peptide = gsub(".\\.(.*)\\..","\\1",peptide, perl=T),
         peptide = gsub("(\\w)\\*","\\L\\1",peptide, perl=T),
         SiteID = sub("-", "_", SiteID),
         ptm_peptide = paste0(SiteID, "-", peptide))  %>%
  dplyr::rename(ptm_id = SiteID,
                ascore = maxAScore,
                confidence = ConfidentSite) %>%
  dplyr::select(-c(accession, RefSeq)) %>%
  dplyr::select(ptm_peptide, ptm_id, peptide, ascore, confidence, everything())
head(x)