Suggested TMT processing pipeline. Variant 1. • PlexedPiper

PlexedPiper is the main library for the pipeline. MSnID package is a flexible way of handling MS/MS identifications. It handles MS/MS data filtering part.

library(PlexedPiper)
library(MSnID)

Data source

This pipeline shows how to process TMT data that is processed outside of PNNL DMS. This implies that raw files processed by MSGF+ and MASIC tools and the processing results are in corresponding directories. For convienience the results of MSGF+ and MASIC processing are provided in a companion PlexedPiperTestData package.

Prepare MS/MS IDs

Read the MS-GF+ output

This simply reads parsed to text output of MS-GF+ search engine. The text files are collated together and the resulting data.frame used to create MSnID object.

path_to_MSGF_results <- system.file(
   "extdata/global/msgf_output",
   package = "PlexedPiperTestData")
msnid <- read_msgf_data(path_to_MSGF_results)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 1156754 at 31 % FDR
## #peptides: 511617 at 61 % FDR
## #accessions: 128378 at 98 % FDR

This MSnID methods corrects for isotope selection error. Occasionally, instrument selects not the lightest or monoisotopic peak, but a peak with +1 or more C13. While MS-GF+ is still capable of correctly identifying those, the downstream calculations of mass measurement error need to be fixed. This method corrects mass measurement errors accounting for selection of other than monoisotopic peak.

msnid <- correct_peak_selection(msnid)

MS/MS ID filter and peptide level

Main MS/MS filtering step. The filter_msgf_data_peptide_level function wraps-up MSnID::optimize_filter method and uses PepQValue from MS-GF+ output and absolute deviation of mass measurement error of parent ions (in ppm).

show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 1156754 at 31 % FDR
## #peptides: 511617 at 61 % FDR
## #accessions: 128378 at 98 % FDR

msnid <- filter_msgf_data_peptide_level(msnid, 0.01)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 480204 at 0.45 % FDR
## #peptides: 96637 at 1 % FDR
## #accessions: 26926 at 8.7 % FDR

Switching annotation from RefSeq to gene symbols

This RefSeq to gene symbols conversion step can be skipped if the downstream analysis relies on RefSeq IDs.

RefSeq is a very redundant database and contains lots of tentative splice isoforms and unconfirmed ORFs. As a result, a lot of peptides (~60%) are shared across multiple RefSeq accessions. This adds to uncertainity which exactly protein we are quantifying. One solution is to switch to less redundant annotations like gene symbol. Moreover gene symbols are more human-interpretable and quite commonly used in downstream analysis.

msnid <- remap_accessions_refseq_to_gene(msnid, 
                                         organism_name="Rattus norvegicus")

This step converts sequences names in the fasta file used for MS/MS searches from RefSeq to gene symbols. This is necessary if we are going to count the number of peptide identifications per protein or protein sequence coverage with accessions switched from RefSeq to gene symbols.

path_to_FASTA <- system.file(
   "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
   package = "PlexedPiperTestData")
temp_dir <- tempdir()
file.copy(path_to_FASTA, temp_dir)

## [1] TRUE

path_to_FASTA <- file.path(temp_dir, basename(path_to_FASTA))
path_to_FASTA_gene <- remap_accessions_refseq_to_gene_fasta(
   path_to_FASTA, organism_name="Rattus norvegicus")

MS/MS ID filter at protein level

A while ago proteomics field established hard-and-fast two peptide per protein rule. That is if a protein identified with one peptide it is not considered a confident identification. It works OK, but there is a room for improvement. For example this rule really penalizes short protens. Then there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in reversed sequence. Thus, here we propose to normalize number of peptides per protein length and use that as a filtering criterion.

msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA_gene)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 480204 at 0.45 % FDR
## #peptides: 96637 at 1 % FDR
## #accessions: 8127 at 7 % FDR

msnid <- filter_msgf_data_protein_level(msnid, 0.01)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 398737 at 0.18 % FDR
## #peptides: 81623 at 0.31 % FDR
## #accessions: 5483 at 0.99 % FDR

Inference of parsimonious protein set

The situation when sequence of a certain peptide matches multiple proteins adds complication to the downstream quantitative analysis as it is not clear which protein product this peptide is originating from. There are common ways for selecting a protein set. One is simply retain only uniquely matching peptides (unique_only=TRUE). Shared peptides are simply discarded in this case. Alternatively (in case of unique_only=FALSE) the shared peptides assigned to the proteins with larger number of uniquely mapping peptides. If there is a choice between multiple proteins with equal number of uniquely mapping peptides, the shared peptides is assigned to the first protein according to alphanumeric order.

msnid <- infer_parsimonious_accessions(msnid, unique_only=TRUE)
show(msnid)

## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 302729 at 0.22 % FDR
## #peptides: 73332 at 0.31 % FDR
## #accessions: 4920 at 0.96 % FDR

Remove decoy accessions

msnid <- apply_filter(msnid, "!isDecoy")

Prepare reporter ion intensities

Read MASIC ouput

MASIC is a tool for extracting ion intensities. With proper parameter settings it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition it reporter a number of other helpful metrics. Of the most interest are interference score of the parent ion level and S/N at reporter ion level. Interference score reflects what proportion of ion population isolated for fragmentation is due to the targeted ion. The other words 1 - InterferenceScore is due to co-isolated species that have similar elution time and parent ion m/z. S/N at reporter level is the metric that computed by Thermo software.

path_to_MASIC_results <- system.file("extdata/global/masic_output", package = "PlexedPiperTestData")
masic_data <- read_masic_data(path_to_MASIC_results, interference_score=TRUE)
head(masic_data)

##                                             Dataset ScanNumber Collision.Mode
## 1 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          2            hcd
## 2 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          3            hcd
## 3 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          4            hcd
## 4 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          5            hcd
## 5 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          6            hcd
## 6 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02          8            hcd
##   ParentIonMZ BasePeakIntensity BasePeakMZ ReporterIonIntensityMax Ion_126.128
## 1      407.74         227695.44   407.7408                92236.87    70562.39
## 2      396.72          59127.97   529.2943                34294.90    23706.89
## 3      438.23         110444.82   362.2239                14053.40    12459.86
## 4      481.50          37082.72   206.4663                    0.00        0.00
## 5      549.28          21077.05   128.1285                21077.05        0.00
## 6      388.72          40605.85   356.7193                 8087.76     6166.82
##   Ion_127.125 Ion_127.131 Ion_128.128 Ion_128.134 Ion_129.131 Ion_129.138
## 1    24864.62    17165.80    35625.00    92236.87     9640.23     8578.05
## 2    13559.32     5856.83    16322.71    34294.90     4853.11     7938.24
## 3    11785.91    10932.51    10653.32    12328.62     5959.86     9905.82
## 4        0.00        0.00        0.00        0.00        0.00        0.00
## 5    10998.67        0.00    21077.05     2725.50        0.00        0.00
## 6     1371.27     2418.35     8087.76     5485.35        0.00        0.00
##   Ion_130.135 Ion_130.141 Ion_131.138 Weighted.Avg.Pct.Intensity.Correction
## 1     6996.69    11833.07    32281.34                                     0
## 2        0.00     1465.03    18182.27                                     0
## 3     8387.04    11166.70    14053.40                                     0
## 4        0.00        0.00        0.00                                     0
## 5        0.00        0.00     6800.70                                     0
## 6     1543.48     1943.96     7436.60                                     0
##   Ion_126.128_SignalToNoise Ion_127.125_SignalToNoise Ion_127.131_SignalToNoise
## 1                     71.47                     25.17                     17.38
## 2                     26.12                     14.94                      6.45
## 3                     12.40                     11.75                     10.90
## 4                        NA                        NA                        NA
## 5                        NA                      9.19                        NA
## 6                      6.92                      1.54                      2.71
##   Ion_128.128_SignalToNoise Ion_128.134_SignalToNoise Ion_129.131_SignalToNoise
## 1                     36.04                     93.32                      9.75
## 2                     17.97                     37.77                      5.34
## 3                     10.64                     12.31                      5.96
## 4                        NA                        NA                        NA
## 5                     17.57                      2.27                        NA
## 6                      9.04                      6.13                        NA
##   Ion_129.138_SignalToNoise Ion_130.135_SignalToNoise Ion_130.141_SignalToNoise
## 1                      8.67                      7.07                     11.96
## 2                      8.74                        NA                      1.61
## 3                      9.91                      8.40                     11.18
## 4                        NA                        NA                        NA
## 5                        NA                        NA                        NA
## 6                        NA                      1.72                      2.16
##   Ion_131.138_SignalToNoise Ion_126.128_Resolution Ion_127.125_Resolution
## 1                     32.71                  44102                  42700
## 2                     19.93                  42702                  41100
## 3                     14.13                  42006                  40702
## 4                        NA                     NA                     NA
## 5                      5.66                     NA                  40302
## 6                      8.26                  40000                  26400
##   Ion_127.131_Resolution Ion_128.128_Resolution Ion_128.134_Resolution
## 1                  42100                  41800                  44404
## 2                  37000                  40400                  43404
## 3                  41402                  40700                  40400
## 4                     NA                     NA                     NA
## 5                     NA                  42102                  46600
## 6                  30400                  40400                  44300
##   Ion_129.131_Resolution Ion_129.138_Resolution Ion_130.135_Resolution
## 1                  40500                  39500                  36800
## 2                  36400                  39700                     NA
## 3                  38800                  40200                  38900
## 4                     NA                     NA                     NA
## 5                     NA                     NA                     NA
## 6                     NA                     NA                  28800
##   Ion_130.141_Resolution Ion_131.138_Resolution ParentIonIndex       MZ
## 1                  41100                  42302              0 407.7424
## 2                  29800                  41802              1 396.7176
## 3                  40400                  41002              2 438.2270
## 4                     NA                     NA              3 481.5049
## 5                     NA                  40300              4 549.2793
## 6                  28500                  38700              5 388.7199
##   SurveyScanNumber OptimalPeakApexScanNumber PeakApexOverrideParentIonIndex
## 1                1                        12                             -1
## 2                1                        12                             -1
## 3                1                       131                             18
## 4                1                        23                             -1
## 5                1                        15                             -1
## 6                7                        12                             -1
##   CustomSICPeak PeakScanStart PeakScanEnd PeakScanMaxIntensity PeakMaxIntensity
## 1             0             1          19                   12          2901600
## 2             0             1          19                   12          2181900
## 3             0             1         114                  107          8255600
## 4             0             1          52                   23           401824
## 5             0             1          16                   15           363656
## 6             0             1          30                   12           478135
##   PeakSignalToNoiseRatio FWHMInScans  PeakArea ParentIonIntensity
## 1               211.0000           5  50422000            2579600
## 2                19.6900           5  34508000            1690600
## 3                 9.4650          44 444610000             658727
## 4                27.9900          23  16244000             344491
## 5                 0.6998           5   5941200             347071
## 6                28.7100          19  10718000             291189
##   PeakBaselineNoiseLevel PeakBaselineNoiseStDev PeakBaselinePointsUsed
## 1                  13750                  97562                  10113
## 2                 110841                1120000                  10166
## 3                 872195                2620000                  10129
## 4                  14356                  65777                  10109
## 5                 519640                1990000                  10109
## 6                  16653                 142562                  10142
##   StatMomentsArea CenterOfMassScan PeakStDev PeakSkew PeakKSStat
## 1        47031000               11      5.68 -0.18920     0.3528
## 2        31578000               11      5.59 -0.21720     0.3469
## 3       343470000               82     22.80 -0.62610     1.0990
## 4        14899000               26     14.10  0.05387     0.5036
## 5         2528900               15      2.49 -0.58330     0.4302
## 6         9961100               16      8.60 -0.05055     0.2833
##   StatMomentsDataCountUsed InterferenceScore
## 1                       10            0.9961
## 2                       10            0.9925
## 3                       89            1.0000
## 4                       39            1.0000
## 5                        5            1.0000
## 6                       18            0.9688

Filtering MASIC data

In this case (currently recommended filters) we require that at least 50% of ion population is due to the targeted ion and no filter at S/N level.

nrow(masic_data)

## [1] 1598123

masic_data <- filter_masic_data(masic_data, 0.5, 0)
nrow(masic_data)

## [1] 1523732

Creating cross-tab

Laying out study desing

To convert from PSMs and reporter ion intensities to meaningful quantitative data it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study desing captured by three tables - fractions, samples, references.

Fractions

This links dataset to original plex.

library(readr)
fractions <- read_tsv(system.file("extdata/study_design/fractions.txt", package = "PlexedPiperTestData"))
head(fractions)

## # A tibble: 6 × 2
##   Dataset                                           PlexID
##   <chr>                                             <chr> 
## 1 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 S1    
## 2 MoTrPAC_Pilot_TMT_W_S1_02_12Oct17_Elm_AQ-17-09-02 S1    
## 3 MoTrPAC_Pilot_TMT_W_S1_05_12Oct17_Elm_AQ-17-09-02 S1    
## 4 MoTrPAC_Pilot_TMT_W_S1_03_12Oct17_Elm_AQ-17-09-02 S1    
## 5 MoTrPAC_Pilot_TMT_W_S1_04_12Oct17_Elm_AQ-17-09-02 S1    
## 6 MoTrPAC_Pilot_TMT_W_S1_06_12Oct17_Elm_AQ-17-09-02 S1

Samples

This links plex ID, reporter channel with samples. There is a minor difference between ReporterAlias and MeasurementName. MeasurementName is the sample name that will be finally reported. ReporterAlias is an itermediate between ReporterName and MeasurementName and is used for defining what the reference is.

samples <- read_tsv(system.file("extdata/study_design/samples.txt", package = "PlexedPiperTestData"))
head(samples,10)

## # A tibble: 10 × 5
##    PlexID QuantBlock ReporterName ReporterAlias MeasurementName
##    <chr>       <dbl> <chr>        <chr>         <chr>          
##  1 S1              1 126          R_01          R_01           
##  2 S1              1 127N         R_03          R_03           
##  3 S1              1 127C         R_02          R_02           
##  4 S1              1 128N         R_05          R_05           
##  5 S1              1 128C         R_04          R_04           
##  6 S1              1 129N         S_02          S_02           
##  7 S1              1 129C         S_01          S_01           
##  8 S1              1 130N         S_04          S_04           
##  9 S1              1 130C         S_03          S_03           
## 10 S1              1 131          ref           NA

References

Reference can be a certain channel, average of multiple channels or 1. In general form it is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock. QuantBlock can be though of as a way of defining sub-plex. In a typical TMT experiment QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1..5) with reference for each QuantBlock.

references <- read_tsv(system.file("extdata/study_design/references.txt", package = "PlexedPiperTestData"))
head(references)

## # A tibble: 2 × 3
##   PlexID QuantBlock Reference
##   <chr>       <dbl> <chr>    
## 1 S1              1 ref      
## 2 S2              1 ref

Creating quantitative cross-tab

Final step when MS/MS IDs and reporter ions linked together, aggregated down to peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level we suggest aggregation_level <- c("accession","peptide")

aggregation_level <- c("accession")
quant_cross_tab <- create_crosstab(msnid, 
                                   masic_data, 
                                   aggregation_level, 
                                   fractions, samples, references)
dim(quant_cross_tab)

## [1] 4859   18

head(quant_cross_tab)

##              R_01        R_02       R_03         R_04       R_05       R_06
## ATP8  -0.19479833 -0.40351693 -0.7899082 -0.207686309 -0.5735321 -0.6217460
## Aagab -0.33114648 -0.09698122 -0.4610205 -0.204455063 -0.3674259 -0.9622379
## Aak1  -0.12524883 -0.04935325 -0.3961333  0.006220386 -0.1831229 -0.8278655
## Aamdc -0.18262236 -0.14712884 -0.4295235 -0.338216384 -0.5193325 -0.7662433
## Aamp   0.01319057 -0.05586270 -0.4874643 -0.176035341 -0.2877920 -0.9128142
## Aar2  -0.28239827 -0.05731137 -0.4382888 -0.312729380 -0.5601654 -0.7180116
##             R_07       R_08         R_09        S_01       S_02       S_03
## ATP8  -0.6862703 -0.3539212 -0.239588729 -1.23906148 -1.0119484 -0.7436739
## Aagab -0.8219094 -0.6039567 -0.349913889 -0.16655998 -1.0566135 -0.6460940
## Aak1  -0.5477375 -0.4730509  0.032356743 -0.02601848 -0.7284031 -0.5183336
## Aamdc -0.6260984 -0.4093530 -0.004382412 -0.18326078 -0.8802873 -0.5951816
## Aamp  -0.7298059 -0.5725573 -0.127739120  0.03925511 -0.5639567 -0.5788605
## Aar2  -0.6492230 -0.3802387 -0.033934887 -0.27326401 -0.8040756 -0.5338878
##             S_04        S_05       S_06       S_07       S_08       S_09
## ATP8  -0.8203905  0.18483797 -0.1652751 -0.3545360 -0.6421465 -0.4344390
## Aagab -0.9723428 -0.72798837 -0.7156128 -0.1366306 -1.0352613 -0.6912676
## Aak1  -1.0432754 -0.19864328 -0.3533702 -0.3482648 -0.5589898 -0.5497933
## Aamdc -0.9138052 -0.19684213 -0.2303194 -0.2628076 -0.7913746 -0.5035342
## Aamp  -0.9582277 -0.28850123 -0.3668970 -0.5391545 -0.8578482 -0.5955917
## Aar2  -0.9288846 -0.01109715 -0.3226456 -0.2250253 -0.7562436 -0.4057189