Suggested TMT processing pipeline for PNNL DMS *phospho* data. • PlexedPiper

PlexedPiper is the main library for the pipeline. MSnID package is a flexible way of handling MS/MS identifications. It handles MS/MS data filtering part.

library(PlexedPiper)
library(PNNL.DMS.utils)
library(MSnID)
library(dplyr)

Data source

This vignette demonstrates processing usign PNNL’s DMS as a data source. PNNL DMS is based on MS SQL Server. Typically data is organized into data packages according to experiments. Data package that contains phospho data mirroring PlexedPiperTestData is number 3626. Smaller version of the data package, suggested for testing purposes is 3625.

data_package_num <- 3625

# if there is no connection to PNNL DMS, vignette is not compiled
if(!is_PNNL_DMS_connection_successful()){
  message("There is no connection to PNNL DMS. This code in this vignette won't be evaluated.")
  knitr::opts_chunk$set(eval = FALSE)
}

Prepare MS/MS IDs

Read the MS-GF+ output

First step is to determine MS-GF+ jobs using the data package number. This simply reads parsed to text output of MS-GF+ search engine. The text files are collated together and the resulting data.frame used to create MSnID object.

msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)

Phospho dataset involve Ascore jobs for improving phospho-site localization. There should be one Ascore job per data package. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly it contains AScore column that signifies the confidence of PTM assingment. AScore > 17 is considered confident.

ascore <- read_AScore_results_from_DMS(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)

Remove non-phospho. Need to be sure that phospho symbol is *!

msnid <- apply_filter(msnid, "grepl(\"\\\\*\", peptide)")

FDR filter at peptide level.

msnid <- filter_msgf_data_peptide_level(msnid, 0.01)

Remapping IDs from RefSeq to gene symbols both MSnID object and the FASTA file use for the search.

conv_tbl <- fetch_conversion_table(organism_name = "Rattus norvegicus", 
                                  from = "REFSEQ", to = "SYMBOL")

msnid <- remap_accessions_refseq_to_gene(msnid,
                                         conversion_table = conv_tbl)

# Remove missing accessions
psms(msnid) <- dplyr::filter(psms(msnid), !is.na(accession))

fst_dms_pth <- path_to_FASTA_used_by_DMS(data_package_num)
fst_dms_pth_2 <- remap_accessions_refseq_to_gene_fasta(
  fst_dms_pth, conversion_table = conv_tbl
)

Double checking FASTA file. Note, in case of multiple RefSeq mapping to the same gene, we picked the RefSeq entry with the longest sequence. Unfortunately, unlike UniProt, RefSeq does not report, which sequence should be considered as canonical. So to resolve ambiguity we resort to protein length.

library(Biostrings)
fst <- readAAStringSet(fst_dms_pth_2)

Mapping sites to protein sequence. In a lot of cases it is easier to handle and interpret for example, Hrc-T379t, rather then peptide sequence D.DDDNDGGST*ENVHQAHR.H itself.

msnid <- map_mod_sites(msnid, fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide", 
                       mod_char = "*",
                       site_delimiter = "lower")
head(psms(msnid))

Inference of parsimonius set

Inference of parsimonius step was skipped in this pipeline. Remapping RefSeq IDs to gene symbols helped the situation of the ambiguous assignment. Demo of inference should be available in other vignettes.

Prepare MASIC reporter ion intensities

Fetching and preparing reporter intensities based on MASIC ouput.

masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score = T)
masic_data <- filter_masic_data(masic_data, 0.5, 0)

Fetch study design tables

study_design_tables <- read_study_design_from_DMS(data_package_num)
fractions <- study_design_tables$fractions
samples <- study_design_tables$samples
references <- study_design_tables$references

Create cross-tab

msnid <- apply_filter(msnid, "!isDecoy")
aggregation_level <- c("SiteID")
quant_cross_tab <- create_crosstab(msnid, 
                                   masic_data, 
                                   aggregation_level, 
                                   fractions, samples, references)
dim(quant_cross_tab)
head(quant_cross_tab)