Suggested TMT processing pipeline. Variant 1.
Source:vignettes/tmt_pipeline_v1.Rmd
tmt_pipeline_v1.Rmd
PlexedPiper
is the main library for the pipeline.
MSnID
package is a flexible way of handling MS/MS
identifications. It handles MS/MS data filtering part.
library(PlexedPiper)
library(MSnID)
Data source
This pipeline shows how to process TMT data that is processed outside
of PNNL DMS. This implies that raw files processed by MSGF+ and MASIC
tools and the processing results are in corresponding directories. For
convienience the results of MSGF+ and MASIC processing are provided in a
companion PlexedPiperTestData
package.
Prepare MS/MS IDs
Read the MS-GF+ output
This simply reads parsed to text output of MS-GF+ search engine. The
text files are collated together and the resulting
data.frame
used to create MSnID object.
path_to_MSGF_results <- system.file(
"extdata/global/msgf_output",
package = "PlexedPiperTestData")
msnid <- read_msgf_data(path_to_MSGF_results)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 1156754 at 31 % FDR
## #peptides: 511617 at 61 % FDR
## #accessions: 128378 at 98 % FDR
This MSnID
methods corrects for isotope selection error.
Occasionally, instrument selects not the lightest or monoisotopic peak,
but a peak with +1 or more C13. While MS-GF+ is still capable of
correctly identifying those, the downstream calculations of mass
measurement error need to be fixed. This method corrects mass
measurement errors accounting for selection of other than monoisotopic
peak.
msnid <- correct_peak_selection(msnid)
MS/MS ID filter and peptide level
Main MS/MS filtering step. The
filter_msgf_data_peptide_level
function wraps-up
MSnID::optimize_filter
method and uses
PepQValue
from MS-GF+ output and absolute deviation of mass
measurement error of parent ions (in ppm).
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 1156754 at 31 % FDR
## #peptides: 511617 at 61 % FDR
## #accessions: 128378 at 98 % FDR
msnid <- filter_msgf_data_peptide_level(msnid, 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 480204 at 0.45 % FDR
## #peptides: 96637 at 1 % FDR
## #accessions: 26926 at 8.7 % FDR
Switching annotation from RefSeq to gene symbols
This RefSeq to gene symbols conversion step can be skipped if the downstream analysis relies on RefSeq IDs.
RefSeq is a very redundant database and contains lots of tentative splice isoforms and unconfirmed ORFs. As a result, a lot of peptides (~60%) are shared across multiple RefSeq accessions. This adds to uncertainity which exactly protein we are quantifying. One solution is to switch to less redundant annotations like gene symbol. Moreover gene symbols are more human-interpretable and quite commonly used in downstream analysis.
msnid <- remap_accessions_refseq_to_gene(msnid,
organism_name="Rattus norvegicus")
This step converts sequences names in the fasta file used for MS/MS searches from RefSeq to gene symbols. This is necessary if we are going to count the number of peptide identifications per protein or protein sequence coverage with accessions switched from RefSeq to gene symbols.
path_to_FASTA <- system.file(
"extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
package = "PlexedPiperTestData")
temp_dir <- tempdir()
file.copy(path_to_FASTA, temp_dir)
## [1] TRUE
path_to_FASTA <- file.path(temp_dir, basename(path_to_FASTA))
path_to_FASTA_gene <- remap_accessions_refseq_to_gene_fasta(
path_to_FASTA, organism_name="Rattus norvegicus")
MS/MS ID filter at protein level
A while ago proteomics field established hard-and-fast two peptide per protein rule. That is if a protein identified with one peptide it is not considered a confident identification. It works OK, but there is a room for improvement. For example this rule really penalizes short protens. Then there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in reversed sequence. Thus, here we propose to normalize number of peptides per protein length and use that as a filtering criterion.
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA_gene)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 480204 at 0.45 % FDR
## #peptides: 96637 at 1 % FDR
## #accessions: 8127 at 7 % FDR
msnid <- filter_msgf_data_protein_level(msnid, 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 398737 at 0.18 % FDR
## #peptides: 81623 at 0.31 % FDR
## #accessions: 5483 at 0.99 % FDR
Inference of parsimonious protein set
The situation when sequence of a certain peptide matches multiple
proteins adds complication to the downstream quantitative analysis as it
is not clear which protein product this peptide is originating from.
There are common ways for selecting a protein set. One is simply retain
only uniquely matching peptides (unique_only=TRUE
). Shared
peptides are simply discarded in this case. Alternatively (in case of
unique_only=FALSE
) the shared peptides assigned to the
proteins with larger number of uniquely mapping peptides. If there is a
choice between multiple proteins with equal number of uniquely mapping
peptides, the shared peptides is assigned to the first protein according
to alphanumeric order.
msnid <- infer_parsimonious_accessions(msnid, unique_only=TRUE)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 48
## #PSMs: 302729 at 0.22 % FDR
## #peptides: 73332 at 0.31 % FDR
## #accessions: 4920 at 0.96 % FDR
Remove decoy accessions
msnid <- apply_filter(msnid, "!isDecoy")
Prepare reporter ion intensities
Read MASIC ouput
MASIC is a tool for extracting ion intensities. With proper parameter settings it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition it reporter a number of other helpful metrics. Of the most interest are interference score of the parent ion level and S/N at reporter ion level. Interference score reflects what proportion of ion population isolated for fragmentation is due to the targeted ion. The other words 1 - InterferenceScore is due to co-isolated species that have similar elution time and parent ion m/z. S/N at reporter level is the metric that computed by Thermo software.
path_to_MASIC_results <- system.file("extdata/global/masic_output", package = "PlexedPiperTestData")
masic_data <- read_masic_data(path_to_MASIC_results, interference_score=TRUE)
head(masic_data)
## Dataset ScanNumber Collision.Mode
## 1 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 2 hcd
## 2 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 3 hcd
## 3 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 4 hcd
## 4 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 5 hcd
## 5 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 6 hcd
## 6 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 8 hcd
## ParentIonMZ BasePeakIntensity BasePeakMZ ReporterIonIntensityMax Ion_126.128
## 1 407.74 227695.44 407.7408 92236.87 70562.39
## 2 396.72 59127.97 529.2943 34294.90 23706.89
## 3 438.23 110444.82 362.2239 14053.40 12459.86
## 4 481.50 37082.72 206.4663 0.00 0.00
## 5 549.28 21077.05 128.1285 21077.05 0.00
## 6 388.72 40605.85 356.7193 8087.76 6166.82
## Ion_127.125 Ion_127.131 Ion_128.128 Ion_128.134 Ion_129.131 Ion_129.138
## 1 24864.62 17165.80 35625.00 92236.87 9640.23 8578.05
## 2 13559.32 5856.83 16322.71 34294.90 4853.11 7938.24
## 3 11785.91 10932.51 10653.32 12328.62 5959.86 9905.82
## 4 0.00 0.00 0.00 0.00 0.00 0.00
## 5 10998.67 0.00 21077.05 2725.50 0.00 0.00
## 6 1371.27 2418.35 8087.76 5485.35 0.00 0.00
## Ion_130.135 Ion_130.141 Ion_131.138 Weighted.Avg.Pct.Intensity.Correction
## 1 6996.69 11833.07 32281.34 0
## 2 0.00 1465.03 18182.27 0
## 3 8387.04 11166.70 14053.40 0
## 4 0.00 0.00 0.00 0
## 5 0.00 0.00 6800.70 0
## 6 1543.48 1943.96 7436.60 0
## Ion_126.128_SignalToNoise Ion_127.125_SignalToNoise Ion_127.131_SignalToNoise
## 1 71.47 25.17 17.38
## 2 26.12 14.94 6.45
## 3 12.40 11.75 10.90
## 4 NA NA NA
## 5 NA 9.19 NA
## 6 6.92 1.54 2.71
## Ion_128.128_SignalToNoise Ion_128.134_SignalToNoise Ion_129.131_SignalToNoise
## 1 36.04 93.32 9.75
## 2 17.97 37.77 5.34
## 3 10.64 12.31 5.96
## 4 NA NA NA
## 5 17.57 2.27 NA
## 6 9.04 6.13 NA
## Ion_129.138_SignalToNoise Ion_130.135_SignalToNoise Ion_130.141_SignalToNoise
## 1 8.67 7.07 11.96
## 2 8.74 NA 1.61
## 3 9.91 8.40 11.18
## 4 NA NA NA
## 5 NA NA NA
## 6 NA 1.72 2.16
## Ion_131.138_SignalToNoise Ion_126.128_Resolution Ion_127.125_Resolution
## 1 32.71 44102 42700
## 2 19.93 42702 41100
## 3 14.13 42006 40702
## 4 NA NA NA
## 5 5.66 NA 40302
## 6 8.26 40000 26400
## Ion_127.131_Resolution Ion_128.128_Resolution Ion_128.134_Resolution
## 1 42100 41800 44404
## 2 37000 40400 43404
## 3 41402 40700 40400
## 4 NA NA NA
## 5 NA 42102 46600
## 6 30400 40400 44300
## Ion_129.131_Resolution Ion_129.138_Resolution Ion_130.135_Resolution
## 1 40500 39500 36800
## 2 36400 39700 NA
## 3 38800 40200 38900
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA 28800
## Ion_130.141_Resolution Ion_131.138_Resolution ParentIonIndex MZ
## 1 41100 42302 0 407.7424
## 2 29800 41802 1 396.7176
## 3 40400 41002 2 438.2270
## 4 NA NA 3 481.5049
## 5 NA 40300 4 549.2793
## 6 28500 38700 5 388.7199
## SurveyScanNumber OptimalPeakApexScanNumber PeakApexOverrideParentIonIndex
## 1 1 12 -1
## 2 1 12 -1
## 3 1 131 18
## 4 1 23 -1
## 5 1 15 -1
## 6 7 12 -1
## CustomSICPeak PeakScanStart PeakScanEnd PeakScanMaxIntensity PeakMaxIntensity
## 1 0 1 19 12 2901600
## 2 0 1 19 12 2181900
## 3 0 1 114 107 8255600
## 4 0 1 52 23 401824
## 5 0 1 16 15 363656
## 6 0 1 30 12 478135
## PeakSignalToNoiseRatio FWHMInScans PeakArea ParentIonIntensity
## 1 211.0000 5 50422000 2579600
## 2 19.6900 5 34508000 1690600
## 3 9.4650 44 444610000 658727
## 4 27.9900 23 16244000 344491
## 5 0.6998 5 5941200 347071
## 6 28.7100 19 10718000 291189
## PeakBaselineNoiseLevel PeakBaselineNoiseStDev PeakBaselinePointsUsed
## 1 13750 97562 10113
## 2 110841 1120000 10166
## 3 872195 2620000 10129
## 4 14356 65777 10109
## 5 519640 1990000 10109
## 6 16653 142562 10142
## StatMomentsArea CenterOfMassScan PeakStDev PeakSkew PeakKSStat
## 1 47031000 11 5.68 -0.18920 0.3528
## 2 31578000 11 5.59 -0.21720 0.3469
## 3 343470000 82 22.80 -0.62610 1.0990
## 4 14899000 26 14.10 0.05387 0.5036
## 5 2528900 15 2.49 -0.58330 0.4302
## 6 9961100 16 8.60 -0.05055 0.2833
## StatMomentsDataCountUsed InterferenceScore
## 1 10 0.9961
## 2 10 0.9925
## 3 89 1.0000
## 4 39 1.0000
## 5 5 1.0000
## 6 18 0.9688
Filtering MASIC data
In this case (currently recommended filters) we require that at least 50% of ion population is due to the targeted ion and no filter at S/N level.
nrow(masic_data)
## [1] 1598123
masic_data <- filter_masic_data(masic_data, 0.5, 0)
nrow(masic_data)
## [1] 1523732
Creating cross-tab
Laying out study desing
To convert from PSMs and reporter ion intensities to meaningful quantitative data it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study desing captured by three tables - fractions, samples, references.
Fractions
This links dataset to original plex.
library(readr)
fractions <- read_tsv(system.file("extdata/study_design/fractions.txt", package = "PlexedPiperTestData"))
head(fractions)
## # A tibble: 6 × 2
## Dataset PlexID
## <chr> <chr>
## 1 MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 S1
## 2 MoTrPAC_Pilot_TMT_W_S1_02_12Oct17_Elm_AQ-17-09-02 S1
## 3 MoTrPAC_Pilot_TMT_W_S1_05_12Oct17_Elm_AQ-17-09-02 S1
## 4 MoTrPAC_Pilot_TMT_W_S1_03_12Oct17_Elm_AQ-17-09-02 S1
## 5 MoTrPAC_Pilot_TMT_W_S1_04_12Oct17_Elm_AQ-17-09-02 S1
## 6 MoTrPAC_Pilot_TMT_W_S1_06_12Oct17_Elm_AQ-17-09-02 S1
Samples
This links plex ID, reporter channel with samples. There is a minor difference between ReporterAlias and MeasurementName. MeasurementName is the sample name that will be finally reported. ReporterAlias is an itermediate between ReporterName and MeasurementName and is used for defining what the reference is.
samples <- read_tsv(system.file("extdata/study_design/samples.txt", package = "PlexedPiperTestData"))
head(samples,10)
## # A tibble: 10 × 5
## PlexID QuantBlock ReporterName ReporterAlias MeasurementName
## <chr> <dbl> <chr> <chr> <chr>
## 1 S1 1 126 R_01 R_01
## 2 S1 1 127N R_03 R_03
## 3 S1 1 127C R_02 R_02
## 4 S1 1 128N R_05 R_05
## 5 S1 1 128C R_04 R_04
## 6 S1 1 129N S_02 S_02
## 7 S1 1 129C S_01 S_01
## 8 S1 1 130N S_04 S_04
## 9 S1 1 130C S_03 S_03
## 10 S1 1 131 ref NA
References
Reference can be a certain channel, average of multiple channels or 1. In general form it is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock. QuantBlock can be though of as a way of defining sub-plex. In a typical TMT experiment QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1..5) with reference for each QuantBlock.
references <- read_tsv(system.file("extdata/study_design/references.txt", package = "PlexedPiperTestData"))
head(references)
## # A tibble: 2 × 3
## PlexID QuantBlock Reference
## <chr> <dbl> <chr>
## 1 S1 1 ref
## 2 S2 1 ref
Creating quantitative cross-tab
Final step when MS/MS IDs and reporter ions linked together,
aggregated down to peptide or accession (i.e. protein) level. To retain
protein IDs while aggregating to peptide level we suggest
aggregation_level <- c("accession","peptide")
aggregation_level <- c("accession")
quant_cross_tab <- create_crosstab(msnid,
masic_data,
aggregation_level,
fractions, samples, references)
dim(quant_cross_tab)
## [1] 4859 18
head(quant_cross_tab)
## R_01 R_02 R_03 R_04 R_05 R_06
## ATP8 -0.19479833 -0.40351693 -0.7899082 -0.207686309 -0.5735321 -0.6217460
## Aagab -0.33114648 -0.09698122 -0.4610205 -0.204455063 -0.3674259 -0.9622379
## Aak1 -0.12524883 -0.04935325 -0.3961333 0.006220386 -0.1831229 -0.8278655
## Aamdc -0.18262236 -0.14712884 -0.4295235 -0.338216384 -0.5193325 -0.7662433
## Aamp 0.01319057 -0.05586270 -0.4874643 -0.176035341 -0.2877920 -0.9128142
## Aar2 -0.28239827 -0.05731137 -0.4382888 -0.312729380 -0.5601654 -0.7180116
## R_07 R_08 R_09 S_01 S_02 S_03
## ATP8 -0.6862703 -0.3539212 -0.239588729 -1.23906148 -1.0119484 -0.7436739
## Aagab -0.8219094 -0.6039567 -0.349913889 -0.16655998 -1.0566135 -0.6460940
## Aak1 -0.5477375 -0.4730509 0.032356743 -0.02601848 -0.7284031 -0.5183336
## Aamdc -0.6260984 -0.4093530 -0.004382412 -0.18326078 -0.8802873 -0.5951816
## Aamp -0.7298059 -0.5725573 -0.127739120 0.03925511 -0.5639567 -0.5788605
## Aar2 -0.6492230 -0.3802387 -0.033934887 -0.27326401 -0.8040756 -0.5338878
## S_04 S_05 S_06 S_07 S_08 S_09
## ATP8 -0.8203905 0.18483797 -0.1652751 -0.3545360 -0.6421465 -0.4344390
## Aagab -0.9723428 -0.72798837 -0.7156128 -0.1366306 -1.0352613 -0.6912676
## Aak1 -1.0432754 -0.19864328 -0.3533702 -0.3482648 -0.5589898 -0.5497933
## Aamdc -0.9138052 -0.19684213 -0.2303194 -0.2628076 -0.7913746 -0.5035342
## Aamp -0.9582277 -0.28850123 -0.3668970 -0.5391545 -0.8578482 -0.5955917
## Aar2 -0.9288846 -0.01109715 -0.3226456 -0.2250253 -0.7562436 -0.4057189