Skip to contents

PlexedPiper is the main library for the pipeline. MSnID package is a flexible way of handling MS/MS identifications. It handles MS/MS data filtering part.

Data source

This pipeline shows how to process TMT data that is processed within PNNL DMS. Data in PNNL DMS is organized using MS SQL Server into data packages. The data package number matching PlexedPiperTestData is 3442. Note, this vignette won’t compile if there is now access to PNNL DMS.

Alternative, smaller variant of 3442 package is 3606.

data_package_num <- 3606
# if there is no connection to PNNL DMS, vignette is not compiled
if(!is_PNNL_DMS_connection_successful()){
   message("There is no connection to PNNL DMS. This code in this vignette won't be evaluated.")
   knitr::opts_chunk$set(eval = FALSE)
}

Prepare MS/MS IDs

Read the MS-GF+ output

Fetching MS/MS search results by data package number. The text files are collated together and the resulting data.frame used to create MSnID object.

msnid <- read_msgf_data_from_DMS(data_package_num)
show(msnid)

This MSnID methods corrects for isotope selection error. Occasionally, instrument selects not the lightest or monoisotopic peak, but a peak with +1 or more C13. While MS-GF+ is still capable of correctly identifying those, the downstream calculations of mass measurement error need to be fixed. This method corrects mass measurement errors accounting for selection of other than monoisotopic peak.

msnid <- correct_peak_selection(msnid)

MS/MS ID filter and peptide level

Main MS/MS filtering step. The filter_msgf_data_peptide_level function wraps-up MSnID::optimize_filter method and uses PepQValue from MS-GF+ output and absolute deviation of mass measurement error of parent ions (in ppm).

show(msnid)
msnid <- filter_msgf_data_peptide_level(msnid, 0.01)
show(msnid)

Switching annotation from RefSeq to gene symbols

This RefSeq to gene symbols conversion step can be skipped if the downstream analysis relies on RefSeq IDs.

RefSeq is a very redundant database and contains lots of tentative splice isoforms and unconfirmed ORFs. As a result, a lot of peptides (~60%) are shared across multiple RefSeq accessions. This adds to uncertainity which exactly protein we are quantifying. One solution is to switch to less redundant annotations like gene symbol. Moreover gene symbols are more human-interpretable and quite commonly used in downstream analysis.

msnid <- remap_accessions_refseq_to_gene(msnid, 
                                         organism_name="Rattus norvegicus")

This step converts sequences names in the fasta file used for MS/MS searches from RefSeq to gene symbols. This is necessary if we are going to count the number of peptide identifications per protein or protein sequence coverage with accessions switched from RefSeq to gene symbols.

path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
path_to_FASTA_gene <- remap_accessions_refseq_to_gene_fasta(
   path_to_FASTA,
   organism_name = "Rattus norvegicus"
)

MS/MS ID filter at protein level

A while ago proteomics field established hard-and-fast two peptide per protein rule. That is if a protein identified with one peptide it is not considered a confident identification. It works OK, but there is a room for improvement. For example this rule really penalizes short protens. Then there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in reversed sequence. Thus, here we propose to normalize number of peptides per protein length and use that as a filtering criterion.

msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA_gene)
show(msnid)
msnid <- filter_msgf_data_protein_level(msnid, 0.01)
show(msnid)

Inference of parsimonious protein set

The situation when sequence of a certain peptide matches multiple proteins adds complication to the downstream quantitative analysis as it is not clear which protein product this peptide is originating from. There are common ways for selecting a protein set. One is simply retain only uniquely matching peptides (unique_only=TRUE). Shared peptides are simply discarded in this case. Alternatively (in case of unique_only=FALSE) the shared peptides assigned to the proteins with larger number of uniquely mapping peptides. If there is a choice between multiple proteins with equal number of uniquely mapping peptides, the shared peptides is assigned to the first protein according to alphanumeric order.

msnid <- infer_parsimonious_accessions(msnid, unique_only=TRUE)
show(msnid)

Remove decoy accessions

msnid <- apply_filter(msnid, "!isDecoy")

Prepare reporter ion intensities

Read MASIC ouput

MASIC is a tool for extracting ion intensities. With proper parameter settings it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition it reporter a number of other helpful metrics. Of the most interest are interference score of the parent ion level and S/N at reporter ion level. Interference score reflects what proportion of ion population isolated for fragmentation is due to the targeted ion. The other words 1 - InterferenceScore is due to co-isolated species that have similar elution time and parent ion m/z. S/N at reporter level is the metric that computed by Thermo software.

masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score=TRUE)
head(masic_data)

Filtering MASIC data

In this case (currently recommended filters) we require that at least 50% of ion population is due to the targeted ion and no filter at S/N level.

nrow(masic_data)
masic_data <- filter_masic_data(masic_data, 0.5, 0)
nrow(masic_data)

Creating cross-tab

Laying out study desing

To convert from PSMs and reporter ion intensities to meaningful quantitative data it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study desing captured by three tables - fractions, samples, references.

Fractions

Dataset to fraction assigment can be read from a separate file or inferred from the dataset names. This example shows the latter.

library(dplyr)
fractions <- masic_data %>%
   distinct(Dataset) %>%
   mutate(PlexID =sub("MoTrPAC_Pilot_TMT_W_(S\\d).*","\\1",Dataset))
head(fractions)

Samples

This links plex ID, reporter channel with samples. There is a minor difference between ReporterAlias and MeasurementName. MeasurementName is the sample name that will be finally reported. ReporterAlias is an itermediate between ReporterName and MeasurementName and is used for defining what the reference is. There may be a way to query this from DMS. At this point, we’ll read the file from PlexedPiperTestData package.

library(readr)
samples <- read_tsv(system.file("extdata/study_design/samples.txt", 
                                package = "PlexedPiperTestData"))
head(samples,10)

References

Reference can be a certain channel, average of multiple channels or 1. In general form it is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock. QuantBlock can be though of as a way of defining sub-plex. In a typical TMT experiment QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1..5) with reference for each QuantBlock.

references <- read_tsv(system.file("extdata/study_design/references.txt",
                                   package = "PlexedPiperTestData"))
head(references)

Reading Study Desing Tables from DMS Data Package Locations

By convention, the study design tables should be stored in the data package folder in DMS.

study_design_tables <- read_study_design_from_DMS(data_package_num)
fractions <- study_design_tables$fractions
samples <- study_design_tables$samples
references <- study_design_tables$references

Creating quantitative cross-tab

Final step when MS/MS IDs and reporter ions linked together, aggregated down to peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level we suggest aggregation_level <- c("accession","peptide")

aggregation_level <- c("accession")
quant_cross_tab <- create_crosstab(msnid, 
                                   masic_data, 
                                   aggregation_level, 
                                   fractions, samples, references)
dim(quant_cross_tab)
head(quant_cross_tab)