Suggested TMT processing pipeline for PNNL DMS *phospho* data.
Source:vignettes/tmt_pipeline_for_PNNL_DMS_phos.Rmd
tmt_pipeline_for_PNNL_DMS_phos.Rmd
PlexedPiper
is the main library for the pipeline.
MSnID
package is a flexible way of handling MS/MS
identifications. It handles MS/MS data filtering part.
library(PlexedPiper)
library(PNNL.DMS.utils)
library(MSnID)
library(dplyr)
Data source
This vignette demonstrates processing usign PNNL’s DMS as a data
source. PNNL DMS is based on MS SQL Server. Typically data is organized
into data packages according to experiments. Data package that contains
phospho data mirroring PlexedPiperTestData
is number
3626
. Smaller version of the data package, suggested for
testing purposes is 3625
.
data_package_num <- 3625
# if there is no connection to PNNL DMS, vignette is not compiled
if(!is_PNNL_DMS_connection_successful()){
message("There is no connection to PNNL DMS. This code in this vignette won't be evaluated.")
knitr::opts_chunk$set(eval = FALSE)
}
Prepare MS/MS IDs
Read the MS-GF+ output
First step is to determine MS-GF+ jobs using the data package number.
This simply reads parsed to text output of MS-GF+ search engine. The
text files are collated together and the resulting
data.frame
used to create MSnID object.
msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)
Phospho dataset involve Ascore jobs for improving phospho-site
localization. There should be one Ascore job per data package. The
fetched object is a data.frame
that links datasets, scans
and original PTM localization to newly suggested locations. Importantly
it contains AScore
column that signifies the confidence of
PTM assingment. AScore > 17
is considered confident.
ascore <- read_AScore_results_from_DMS(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)
Remove non-phospho. Need to be sure that phospho symbol is *!
msnid <- apply_filter(msnid, "grepl(\"\\\\*\", peptide)")
FDR filter at peptide level.
msnid <- filter_msgf_data_peptide_level(msnid, 0.01)
Remapping IDs from RefSeq to gene symbols both MSnID
object and the FASTA file use for the search.
conv_tbl <- fetch_conversion_table(organism_name = "Rattus norvegicus",
from = "REFSEQ", to = "SYMBOL")
msnid <- remap_accessions_refseq_to_gene(msnid,
conversion_table = conv_tbl)
# Remove missing accessions
psms(msnid) <- dplyr::filter(psms(msnid), !is.na(accession))
fst_dms_pth <- path_to_FASTA_used_by_DMS(data_package_num)
fst_dms_pth_2 <- remap_accessions_refseq_to_gene_fasta(
fst_dms_pth, conversion_table = conv_tbl
)
Double checking FASTA file. Note, in case of multiple RefSeq mapping to the same gene, we picked the RefSeq entry with the longest sequence. Unfortunately, unlike UniProt, RefSeq does not report, which sequence should be considered as canonical. So to resolve ambiguity we resort to protein length.
library(Biostrings)
fst <- readAAStringSet(fst_dms_pth_2)
Mapping sites to protein sequence. In a lot of cases it is easier to
handle and interpret for example, Hrc-T379t
, rather then
peptide sequence D.DDDNDGGST*ENVHQAHR.H
itself.
msnid <- map_mod_sites(msnid, fst,
accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*",
site_delimiter = "lower")
head(psms(msnid))
Prepare MASIC reporter ion intensities
Fetching and preparing reporter intensities based on MASIC ouput.
masic_data <- read_masic_data_from_DMS(data_package_num,
interference_score = T)
masic_data <- filter_masic_data(masic_data, 0.5, 0)
Fetch study design tables
study_design_tables <- read_study_design_from_DMS(data_package_num)
fractions <- study_design_tables$fractions
samples <- study_design_tables$samples
references <- study_design_tables$references
Create cross-tab
msnid <- apply_filter(msnid, "!isDecoy")
aggregation_level <- c("SiteID")
quant_cross_tab <- create_crosstab(msnid,
masic_data,
aggregation_level,
fractions, samples, references)
dim(quant_cross_tab)
head(quant_cross_tab)