1.1 Global Proteomics Data

This pipeline shows how to process TMT data that is processed outside of PNNL’s DMS. Section 1.2 shows how to process data from the DMS. For convenience, the results of MS-GF+ and MASIC processing are provided in a companion PlexedPiperTestData package. In addition, we will need PlexedPiper for isobaric quantification, dplyr to manipulate data frames, and MSnbase to create MSnSet objects.

# Setup
library(PlexedPiper)
library(PlexedPiperTestData)
library(dplyr)

The pipeline can be broken up into 4 major chunks: prepare MS/MS identifications, prepare reporter ion intensities, create a quantitative cross-tab, and create an MSnSet object.

1.1.1 Read MS-GF+ Data

The first step in the preparation of the MS/MS identifications is to fetch the data. In this case, the data exists in the PlexedPiperTestData package in a local folder, so we use system.file to get the file path and read_msgf_data to read the MS-GF+ output.

# Get file path
path_to_MSGF_results <- system.file("extdata/global/msgf_output", 
                                    package = "PlexedPiperTestData")
# Read MS-GF+ data from path
msnid <- read_msgf_data(path_to_MSGF_results)

Normally, this would display a progress bar in the console as the data is being fetched. However, the output was suppressed to save space. We can view a summary of the MSnID object with the show() function.

show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 1156754 at 31 % FDR
## #peptides: 511617 at 61 % FDR
## #accessions: 128378 at 98 % FDR

This summary tells us that msnid consists of 4 spectrum files (datasets), and contains a total of 1,156,754 peptide-spectrum-matches (PSMs), 511,617 total peptides, and 128,378 total accessions (proteins). The reported FDR is the empirical false-discovery rate, which is calculated as the ratio of the number of false (decoy) PSMs, peptides, or accessions to their true (non-decoy) counterparts. Calculation of these counts and their FDRs is shown below.

# How to calculate the counts and FDRs from the show() output
## Spectrum Files
# Count
psms(msnid) %>% 
  distinct(Dataset) %>%
  nrow() # 48

## PSMs
# Count
psms(msnid) %>% 
  distinct(Dataset, Scan, peptide, isDecoy) %>%
  # Assign intermediate to variable
  assign("x_psm", ., envir = globalenv()) %>% 
  nrow() # 1156754
# FDR
nrow(x_psm[x_psm$isDecoy == TRUE, ]) / 
  nrow(x_psm[x_psm$isDecoy == FALSE, ]) 
# 0.3127463 = 31%

## peptides
# Count
psms(msnid) %>% 
  distinct(peptide, isDecoy) %>% 
  assign("x_peptide", ., envir = globalenv()) %>%
  nrow() # 511617
# FDR
nrow(x_peptide[x_peptide$isDecoy == TRUE, ]) / 
  nrow(x_peptide[x_peptide$isDecoy == FALSE, ])
# 0.611245 = 61%

## accessions
# Count
length(accessions(msnid)) # or
psms(msnid) %>% 
  distinct(accession, isDecoy) %>% 
  assign("x_acc", ., envir = globalenv()) %>%
  nrow() # 128378
# FDR
nrow(x_acc[x_acc$isDecoy == TRUE, ]) / 
  nrow(x_acc[x_acc$isDecoy == FALSE, ])
# 0.9827024 = 98%

Now that we have an MSnID object, we need to process it. We begin by correcting for the isotope selection error.

1.1.2 Correct Isotope Selection Error

Carbon has two stable isotopes: \(^{12}\text{C}\) and \(^{13}\text{C}\), with natural abundances of 98.93% and 1.07%, respectively. That is, we expect that about 1 out of every 100 carbon atoms is naturally going to be a \(^{13}\text{C}\), while the rest are \(^{12}\text{C}\). In larger peptides with many carbon atoms, it is more likely that at least one atom will be a \(^{13}\text{C}\) than all atoms will be \(^{12}\text{C}\). In cases such as these, a non-monoisotopic ion will be selected by the instrument for fragmentation.

MS1 spectra with peak at non-monoisotopic precursor ion.

(#fig:MS1_peak)MS1 spectra with peak at non-monoisotopic precursor ion.

In Figure @ref(fig:MS1_peak), the monoisotopic ion (m/z of 1427.29) is not the most abundant, so it is not selected as the precursor. Instead, the ion with a \(^{13}\text{C}\) in place of a \(^{12}\text{C}\) is selected for fragmentation. We calculate the mass difference between these two ions as the difference between the mass-to-charge ratios multiplied by the ion charge. In this case, the mass difference is 1 Dalton, or about the difference between \(^{13}\text{C}\) and \(^{12}\text{C}\). (More accurately, the difference between these isotopes is 1.0033548378 Da.) While MS-GF+ is still capable of correctly identifying these peptides, the downstream calculations of mass measurement error need to be fixed because they are used for filtering later on (Section 1.1.4). The correct_peak_selection function corrects these mass measurement errors, and Figure 1.1 shows the distribution of the mass measurement errors before and after correction.

# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
Histogram of mass measurement errors before and after correction.

Figure 1.1: Histogram of mass measurement errors before and after correction.

1.1.3 Remove Contaminants

Now, we will remove contaminants such as the trypsin that was used for protein digestion. We can use grepl to search for all accessions that contain the string "Contaminant". Displaying these contaminants is not necessary during processing. This is just for demonstration purposes to see what will be removed.

# All unique contaminants
accessions(msnid)[grepl("Contaminant", accessions(msnid))]
##  [1] "Contaminant_K2C1_HUMAN"      "Contaminant_K1C9_HUMAN"     
##  [3] "Contaminant_ALBU_HUMAN"      "Contaminant_ALBU_BOVIN"     
##  [5] "Contaminant_TRYP_PIG"        "Contaminant_K1C10_HUMAN"    
##  [7] "XXX_Contaminant_K1C9_HUMAN"  "Contaminant_K22E_HUMAN"     
##  [9] "Contaminant_Trypa3"          "Contaminant_Trypa5"         
## [11] "XXX_Contaminant_K1C10_HUMAN" "XXX_Contaminant_K22E_HUMAN" 
## [13] "XXX_Contaminant_K2C1_HUMAN"  "Contaminant_TRYP_BOVIN"     
## [15] "XXX_Contaminant_ALBU_HUMAN"  "XXX_Contaminant_ALBU_BOVIN" 
## [17] "XXX_Contaminant_TRYP_BOVIN"  "Contaminant_CTRB_BOVIN"     
## [19] "Contaminant_Trypa1"          "Contaminant_Trypa6"         
## [21] "Contaminant_CTRA_BOVIN"      "XXX_Contaminant_TRYP_PIG"   
## [23] "XXX_Contaminant_CTRB_BOVIN"  "XXX_Contaminant_CTRA_BOVIN" 
## [25] "Contaminant_Trypa2"

To remove contaminants, we use apply_filter with an appropriate character string that tells the function what rows to keep. In this case, we keep rows where the accession does not contain “Contaminant.” We will use show to see how the counts change.

# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 1155442 at 31 % FDR
## #peptides: 511196 at 61 % FDR
## #accessions: 128353 at 98 % FDR

We can see that the number of PSMs decreased by about 1300, peptides by ~400, and proteins by 25 (the 25 contaminants that were displayed).

1.1.4 MS/MS ID Filter: Peptide Level

The next step is to filter the MS/MS identifications such that the empirical peptide-level FDR is less than some threshold and the number of MS/MS IDs is maximized. We will use the \(-log_{10}\) of the PepQValue column as one of our filtering criteria and assign it to a new column in psms(msnid) called msmsScore. The PepQValue column is the MS-GF+ Spectrum E-value, which reflects how well the theoretical and experimental fragmentation spectra match; therefore, high values of msmsScore indicate a good match (see Figure 1.2).

Density plot of msmsScore.

Figure 1.2: Density plot of msmsScore.


The other filtering criteria is the absolute deviation of the mass measurement error of the precursor ions in parts-per-million (ppm), which is assigned to the absParentMassErrorPPM column in psms(msnid) (see Figure 1.3).

Density plot of absParentMassErrorPPM.

Figure 1.3: Density plot of absParentMassErrorPPM.


Now, we will filter the PSMs.

# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
                          level = "peptide",
                          fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 464474 at 0.45 % FDR
## #peptides: 96485 at 1 % FDR
## #accessions: 27119 at 9.2 % FDR

We can see that filtering drastically reduces the number of PSMs, and the empirical peptide-level FDR is now 1%. However, notice that the empirical protein-level FDR is still fairly high.

1.1.5 MS/MS ID Filter: Protein Level

Now, we need to filter proteins so that the FDR is at most 1%. A while ago, the proteomics field established the hard-and-fast two-peptides-per-protein rule. That is, the confident identification of a protein requires the confident identification of at least 2 peptides. This rule penalizes short proteins and doesn’t consider that there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in the reversed sequence. Thus, we propose to normalize the number of peptides per protein length and use that as a filtering criterion (Figure 1.5).

In order to get the protein lengths, we need the FASTA (pronounced FAST-AYE) file that contains the protein sequences used in the database search. The first three entries of the FASTA file are shown in Figure 1.4.

First three entries of the FASTA file.

Figure 1.4: First three entries of the FASTA file.


For each protein, we divide the number of associated peptides by the length of that protein and multiply this value by 1000. This new peptides_per_1000aa column is used as the filter criteria.

# Get path to FASTA file
path_to_FASTA <- system.file(
  "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz", 
  package = "PlexedPiperTestData"
)

# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
Density plot of peptides_per_1000aa. The plot area has been zoomed in.

Figure 1.5: Density plot of peptides_per_1000aa. The plot area has been zoomed in.


Now, we filter the proteins to 1% FDR.

# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
                          level = "accession",
                          fdr.max = 0.01)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 458090 at 0.16 % FDR
## #peptides: 92036 at 0.32 % FDR
## #accessions: 15630 at 1 % FDR

1.1.6 Inference of Parsimonious Protein Set

The situation when a certain peptide sequence matches multiple proteins adds complication to the downstream quantitative analysis, as it is not clear which protein this peptide is originating from. There are common ways for dealing with this. One is to simply retain uniquely matching peptides and discard shared peptides (unique_only = TRUE). Alternatively, assign the shared peptides to the proteins with the larger number of uniquely mapping peptides (unique_only = FALSE). If there is a choice between multiple proteins with equal numbers of uniquely mapping peptides, the shared peptides are assigned to the first protein according to alphanumeric order (Figure 1.6).

Visual explanation of the inference of the parsimonious protein set.

Figure 1.6: Visual explanation of the inference of the parsimonious protein set.


# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 444999 at 0.15 % FDR
## #peptides: 90478 at 0.27 % FDR
## #accessions: 5251 at 1.1 % FDR

Notice that the protein-level FDR increased slightly above the 1% threshold. In this case, the difference isn’t significant, so we can ignore it.

Note:

If the peptide or accession-level FDR increases significantly above 1% after inference of the parsimonious protein set, consider lowering the FDR cutoff (for example, to 0.9%) and redoing the previous processing steps. Filtering at the peptide and accession level should each be done a single time.

1.1.7 Remove Decoy PSMs

The final step in preparing the MS/MS identifications is to remove the decoy PSMs. We use the apply_filter function again and only keep entries where isDecoy is FALSE.

# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files:  48 
## #PSMs: 444338 at 0 % FDR
## #peptides: 90232 at 0 % FDR
## #accessions: 5196 at 0 % FDR

After processing, we are left with 444,345 PSMs, 90,232 peptides, and 5,196 proteins. Table 1.1 shows the first 6 rows of the processed MS-GF+ output.

Table 1.1: First 6 rows of the processed MS-GF+ results.
Dataset ResultID Scan FragMethod SpecIndex Charge PrecursorMZ DelM DelM_PPM MH peptide Protein NTT DeNovoScore MSGFScore MSGFDB_SpecEValue Rank_MSGFDB_SpecEValue EValue QValue PepQValue IsotopeError accession calculatedMassToCharge chargeState experimentalMassToCharge isDecoy spectrumFile spectrumID pepSeq msmsScore absParentMassErrorPPM peptides_per_1000aa
MoTrPAC_Pilot_TMT_W_S1_07_12Oct17_Elm_AQ-17-09-02 1862 27707 HCD 324 2 928.541 -0.001 -0.526 1856.075 R.AAAAAAAAAAAAAAGAAGK.E NP_113986.1 2 285 282 0 1 0 0.000 0.000 0 NP_113986.1 928.541 2 928.541 FALSE MoTrPAC_Pilot_TMT_W_S1_07_12Oct17_Elm_AQ-17-09-02 27707 AAAAAAAAAAAAAAGAAGK Inf 0.589 25.769
MoTrPAC_Pilot_TMT_W_S1_07_12Oct17_Elm_AQ-17-09-02 4192 27684 HCD 906 3 619.363 -0.002 -0.887 1856.075 R.AAAAAAAAAAAAAAGAAGK.E NP_113986.1 2 156 144 0 1 0 0.000 0.000 0 NP_113986.1 619.363 3 619.363 FALSE MoTrPAC_Pilot_TMT_W_S1_07_12Oct17_Elm_AQ-17-09-02 27684 AAAAAAAAAAAAAAGAAGK Inf 0.991 25.769
MoTrPAC_Pilot_TMT_W_S2_06_12Oct17_Elm_AQ-17-09-02 26263 27336 HCD 5187 3 619.363 0.000 0.197 1856.075 R.AAAAAAAAAAAAAAGAAGK.E NP_113986.1 2 118 85 0 1 0 0.000 0.000 0 NP_113986.1 619.363 3 619.363 FALSE MoTrPAC_Pilot_TMT_W_S2_06_12Oct17_Elm_AQ-17-09-02 27336 AAAAAAAAAAAAAAGAAGK Inf 0.091 25.769
MoTrPAC_Pilot_TMT_W_S2_07_12Oct17_Elm_AQ-17-09-02 1471 27096 HCD 415 3 619.363 -0.001 -0.591 1856.075 R.AAAAAAAAAAAAAAGAAGK.E NP_113986.1 2 157 156 0 1 0 0.000 0.000 0 NP_113986.1 619.363 3 619.363 FALSE MoTrPAC_Pilot_TMT_W_S2_07_12Oct17_Elm_AQ-17-09-02 27096 AAAAAAAAAAAAAAGAAGK Inf 0.684 25.769
MoTrPAC_Pilot_TMT_W_S2_05_12Oct17_Elm_AQ-17-09-02 28664 10441 HCD 4849 2 586.832 -0.001 -0.728 1172.659 R.AAAAADLANR.S NP_001007804.1 2 124 124 0 1 0 0.002 0.003 0 NP_001007804.1 586.833 2 586.832 FALSE MoTrPAC_Pilot_TMT_W_S2_05_12Oct17_Elm_AQ-17-09-02 10441 AAAAADLANR 2.480 0.746 34.755
MoTrPAC_Pilot_TMT_W_S1_24_12Oct17_Elm_AQ-17-09-02 41775 8033 HCD 7889 2 831.447 0.000 0.000 1661.886 G.AAAAAEAESGGGGGK.K NP_001128630.1 1 176 76 0 1 0 0.001 0.003 0 NP_001128630.1 831.447 2 831.447 FALSE MoTrPAC_Pilot_TMT_W_S1_24_12Oct17_Elm_AQ-17-09-02 8033 AAAAAEAESGGGGGK 2.583 0.106 580.844


1.1.8 Read MASIC Output

MASIC is a tool for extracting ion intensities. With proper parameter settings, it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition, it reports a number of other helpful metrics. Notably, the interference score at the precursor ion level and the signal-to-noise ratio (S/N) at the reporter ion level (computed by Thermo software). The interference score reflects the proportion of the ion population that was isolated for fragmentation that is due to the targeted ion. In other words, 1 - InterferenceScore is due to co-isolated species that have similar elution time and precursor ion m/z. The first step in the preparation of the reporter ion intensity data is to read the MASIC results. We use a local file path and the read_masic_data function. By default, the interference score is not included, so we need to set that argument to TRUE in order to filter the results.

# Path to MASIC data
path_to_MASIC_results <- system.file("extdata/global/masic_output", 
                                     package = "PlexedPiperTestData")

# Read MASIC data
masic_data <- read_masic_data(path_to_MASIC_results, interference_score = TRUE)

Normally, this would display progress bars in the console as the data is being fetched. However, the output was suppressed to save space.

Table 1.2: First 6 rows of the MASIC data.
Dataset ScanNumber Collision.Mode ParentIonMZ BasePeakIntensity BasePeakMZ ReporterIonIntensityMax Ion_126.128 Ion_127.125 Ion_127.131 Ion_128.128 Ion_128.134 Ion_129.131 Ion_129.138 Ion_130.135 Ion_130.141 Ion_131.138 Weighted.Avg.Pct.Intensity.Correction Ion_126.128_SignalToNoise Ion_127.125_SignalToNoise Ion_127.131_SignalToNoise Ion_128.128_SignalToNoise Ion_128.134_SignalToNoise Ion_129.131_SignalToNoise Ion_129.138_SignalToNoise Ion_130.135_SignalToNoise Ion_130.141_SignalToNoise Ion_131.138_SignalToNoise Ion_126.128_Resolution Ion_127.125_Resolution Ion_127.131_Resolution Ion_128.128_Resolution Ion_128.134_Resolution Ion_129.131_Resolution Ion_129.138_Resolution Ion_130.135_Resolution Ion_130.141_Resolution Ion_131.138_Resolution ParentIonIndex MZ SurveyScanNumber OptimalPeakApexScanNumber PeakApexOverrideParentIonIndex CustomSICPeak PeakScanStart PeakScanEnd PeakScanMaxIntensity PeakMaxIntensity PeakSignalToNoiseRatio FWHMInScans PeakArea ParentIonIntensity PeakBaselineNoiseLevel PeakBaselineNoiseStDev PeakBaselinePointsUsed StatMomentsArea CenterOfMassScan PeakStDev PeakSkew PeakKSStat StatMomentsDataCountUsed InterferenceScore
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 2 hcd 407.74 227695.44 407.741 92236.87 70562.39 24864.62 17165.80 35625.00 92236.87 9640.23 8578.05 6996.69 11833.07 32281.34 0 71.47 25.17 17.38 36.04 93.32 9.75 8.67 7.07 11.96 32.71 44102 42700 42100 41800 44404 40500 39500 36800 41100 42302 0 407.742 1 12 -1 0 1 19 12 2901600 211.000 5 50422000 2579600 13750 97562 10113 47031000 11 5.68 -0.189 0.353 10 0.996
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 3 hcd 396.72 59127.97 529.294 34294.90 23706.89 13559.32 5856.83 16322.71 34294.90 4853.11 7938.24 0.00 1465.03 18182.27 0 26.12 14.94 6.45 17.97 37.77 5.34 8.74 NA 1.61 19.93 42702 41100 37000 40400 43404 36400 39700 NA 29800 41802 1 396.718 1 12 -1 0 1 19 12 2181900 19.690 5 34508000 1690600 110841 1120000 10166 31578000 11 5.59 -0.217 0.347 10 0.993
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 4 hcd 438.23 110444.82 362.224 14053.40 12459.86 11785.91 10932.51 10653.32 12328.62 5959.86 9905.82 8387.04 11166.70 14053.40 0 12.40 11.75 10.90 10.64 12.31 5.96 9.91 8.40 11.18 14.13 42006 40702 41402 40700 40400 38800 40200 38900 40400 41002 2 438.227 1 131 18 0 1 114 107 8255600 9.465 44 444610000 658727 872195 2620000 10129 343470000 82 22.80 -0.626 1.099 89 1.000
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 5 hcd 481.50 37082.72 206.466 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 481.505 1 23 -1 0 1 52 23 401824 27.990 23 16244000 344491 14356 65777 10109 14899000 26 14.10 0.054 0.504 39 1.000
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 6 hcd 549.28 21077.05 128.129 21077.05 0.00 10998.67 0.00 21077.05 2725.50 0.00 0.00 0.00 0.00 6800.70 0 NA 9.19 NA 17.57 2.27 NA NA NA NA 5.66 NA 40302 NA 42102 46600 NA NA NA NA 40300 4 549.279 1 15 -1 0 1 16 15 363656 0.700 5 5941200 347071 519640 1990000 10109 2528900 15 2.49 -0.583 0.430 5 1.000
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 8 hcd 388.72 40605.85 356.719 8087.76 6166.82 1371.27 2418.35 8087.76 5485.35 0.00 0.00 1543.48 1943.96 7436.60 0 6.92 1.54 2.71 9.04 6.13 NA NA 1.72 2.16 8.26 40000 26400 30400 40400 44300 NA NA 28800 28500 38700 5 388.720 7 12 -1 0 1 30 12 478135 28.710 19 10718000 291189 16653 142562 10142 9961100 16 8.60 -0.051 0.283 18 0.969


Table 1.2 shows the first 6 rows of masic_data.

1.1.9 Filter MASIC Data

The only other step in reporter ion intensity data preparation is to filter the results. Currently, we recommend keeping entries where at least 50% of the ion population is due to the targeted ion (interference score \(\geq\) 0.5) and not filtering by S/N.

# Filter MASIC data
masic_data <- filter_masic_data(masic_data, 
                                interference_score_threshold = 0.5,
                                s2n_threshold = 0)

1.1.10 Create Study Design Tables

To convert from PSMs and reporter ion intensities to meaningful quantitative data, it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study design is captured by three tables - fractions, samples, references. With newly processed data, these typically do not exist, and must be created. The next sections show how to create these tables.

1.1.11 Fractions

The fractions table consists of two columns: Dataset and PlexID. The Dataset column contains all of the unique datasets from msnid$Dataset or masic_data$Dataset. The PlexID column contains the plex ID associated with each dataset, and is typically an “S” followed by a number (“S1,” “S2,” etc.). We can extract the plex ID from the datasets. In this case, the plex ID always comes after “W,” so we can use a regular expression (regex) to capture it (the first argument of gsub). The regex below says to capture an “S” followed by a single digit that appears after “W” and before an underscore.

# Create fractions table
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>% 
  mutate(PlexID = gsub(".*_W_(S\\d{1})_.*", "\\1", Dataset))
Table 1.3: First 10 rows of the fractions table.
Dataset PlexID
MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_02_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_03_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_04_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_05_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_06_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_07_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_08_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_09_12Oct17_Elm_AQ-17-09-02 S1
MoTrPAC_Pilot_TMT_W_S1_10_12Oct17_Elm_AQ-17-09-02 S1


Table 1.3 shows the first 10 rows of fractions.

1.1.12 Samples

The samples table contains columns PlexID, QuantBlock, ReporterName, ReporterAlias, and MeasurementName. The plex ID must be the same as the plex ID in the fractions table. ReporterName is the reporter ion name (“126,” “127N,” “127C,” etc.). ReporterAlias is the intermediate between ReporterName and MeasurementName and is used for defining the reference. MeasurementName determines the column names for the final cross-tab, and must be unique and begin with a letter. Finally, QuantBlock can be thought of as a way of defining sub-plex. In a typical TMT experiment, QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1-5) with a reference for each QuantBlock.

For this experiment, channel 131 will serve as the reference, so we set MeasurementName to NA when ReporterName is "131". This will make the reference channel absent from the quantitative cross-tab. In cases where reporter ion intensities are not normalized by a reference channel (reference = 1) or they are normalized by the average of select channels, do not set any MeasurementName to NA.

# Create samples table
samples <- read.delim("data/MoTrPAC_pilot_TMT_labeling.txt") %>% 
  dplyr::rename(ReporterName = TMT10_channel, 
                ReporterAlias = sample_ID) %>% 
  mutate(QuantBlock = 1,
         MeasurementName = ifelse(ReporterAlias == "ref", NA, ReporterAlias))
Table 1.4: First 10 rows of the samples table.
ReporterAlias ReporterName PlexID QuantBlock MeasurementName
S_01 129C S1 1 S_01
S_02 129N S1 1 S_02
S_03 130C S1 1 S_03
S_04 130N S1 1 S_04
S_05 126 S2 1 S_05
S_06 127C S2 1 S_06
S_07 127N S2 1 S_07
S_08 128C S2 1 S_08
S_09 128N S2 1 S_09
ref 131 S1 1 NA


Table 1.4 shows the first 10 rows of samples.

1.1.13 References

Reference can be a certain channel, average of multiple channels, or 1. The general form is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock.

# Create references table
references <- samples %>% 
  # Filter to reference channel (ReporterName == "131", ReporterAlias == "ref")
  filter(ReporterName == "131") %>% 
  # Select required columns and rename ReporterAlias to Reference
  select(PlexID, Reference = ReporterAlias, QuantBlock)
Table 1.5: References table.
PlexID Reference QuantBlock
S1 ref 1
S2 ref 1


Table 1.5 shows the first 10 rows of references. The code to use the geometric average instead of a single channel as the reference is shown below. The geometric average is the product of the reporter ion channels to the power of (1/number of channels). For each PlexID group, collapse the vector of reporter ion names with *, surround them in parentheses, and raise to the power of (1/number of channels).

# Use geometric average as reference
references <- samples %>%
  group_by(PlexID, QuantBlock) %>%
  summarise(Reference = sprintf("(%s)^(1/%d)", 
                                paste(ReporterAlias, collapse = "*"), n()))

# Do not normalize by reference channel (use 1 as the reference)
references <- samples %>% 
  distinct(PlexID, QuantBlock) %>% 
  mutate(Reference = 1)

Now that we have the three study design tables, we should save them.

# Save study design tables
write.table(fractions, file = "data/fractions.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(samples, file = "data/samples.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(references, file = "data/references.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)

1.1.14 Create Quantitative Cross-tab

This is the step where MS/MS IDs and reporter ions are linked together and aggregated to the peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level, set aggregation_level <- c("accession","peptide"). The abundances are converted to relative abundances by dividing by the reference and then \(log_2\)-transformed.

# Set aggregation level
aggregation_level <- c("accession")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
                            aggregation_level = aggregation_level,
                            fractions, samples, references)
Table 1.6: First 6 rows of the global quantitative cross-tab.
R_01 R_02 R_03 R_04 R_05 R_06 R_07 R_08 R_09 S_01 S_02 S_03 S_04 S_05 S_06 S_07 S_08 S_09
AP_004893.1 0.1419768 0.7628195 0.1653552 0.8662554 0.9453172 -0.8794712 -0.1912097 0.3964607 -0.2440478 -0.6460065 -1.9294467 -0.4321433 -1.2831873 -1.0271227 -0.9390945 0.4883309 -1.7148628 -0.7029685
AP_004894.1 0.8092676 -0.0976095 -0.3113350 0.3215692 0.2171255 -1.3515366 -0.7462995 -0.8338103 -0.2227493 -0.3678781 -0.1638689 -0.6696829 -1.2039041 -0.5124954 -0.2364175 -0.4428327 -1.3730408 -0.6711809
AP_004895.1 0.2078433 -0.2867209 -0.6089756 -0.1162062 -0.3840271 -0.6780284 -0.6102404 -0.3896190 -0.1548544 -1.1240967 -0.6908468 -0.6652575 -0.7140383 0.2717217 -0.1448289 -0.1200736 -0.6435709 -0.4287771
AP_004896.1 -0.1494849 -0.3664339 -0.7314368 -0.1742391 -0.5352280 -0.6543311 -0.6741064 -0.3994149 -0.0441485 -1.2945071 -1.0372327 -0.7060783 -0.8299749 0.1939540 -0.2274358 -0.1688422 -0.5251264 -0.4222698
AP_004898.1 0.0362964 0.7497227 0.4252227 0.4913660 1.1580326 -1.8439756 -0.1774225 -1.1083199 -0.4175363 0.1211536 -0.3640632 -0.3019505 -0.8291744 -0.8407749 -0.2796091 -0.4130732 -1.5747761 -0.9449498
AP_004899.1 0.7140968 -0.1781542 -0.3732752 0.3494902 -0.0615626 -1.0482424 -0.8052899 -0.6675429 -0.3959923 -2.1679002 -0.8550940 -0.9026145 -1.4519278 -0.3158081 -0.4056811 -0.4644758 -0.2805080 -0.9023044


Now that we have the cross-tab, we should save it.

# Save cross-tab
write.table(crosstab, file = "data/global_quant_crosstab.txt",
            sep = "\t", quote = FALSE, row.names = TRUE)

We will also save the proteins (row names) of this cross-tab in order to demonstrate prioritized inference in Section 1.2.

# Save global proteins
global_proteins <- rownames(crosstab)
save(global_proteins, file = "data/3442_global_proteins.RData")

1.1.15 Create MSnSet Object

The final step is to create an MSnSet object, which is necessary for downstream data analysis tools such as the plot_pca_* or limma_* functions in MSnSet.utils. An MSnSet combines three tables into one object: exprs, fData, and pData. exprs is a matrix of protein abundance data with proteins as rows and samples as columns. fData is used to convert between different feature IDs such as RefSeq, UniProt accession, Entrez gene ID, gene symbol, etc. It is optional, but the rownames of fData must be the same as the row names of exprs. Lastly, pData contains the metadata with samples as rows, and the row names must be the same as the column names of exprs (same order as well). For this example, we don’t have an actual metadata table. We just know group assignment from the sample names. Samples with an “R” are part of the exeRcised group, while samples with an “S” are part of the Sedentary group. We can make a metadata table with a single column called group.

metadata <- data.frame(group = gsub("([RS]).*", "\\1", colnames(crosstab))) %>% 
  mutate(group = ifelse(group == "R", "exercised", "sedentary"))
rownames(metadata) <- colnames(crosstab)
Table 1.7: Metadata for MSnSet.
group
R_01 exercised
R_02 exercised
R_03 exercised
R_04 exercised
R_05 exercised
R_06 exercised
R_07 exercised
R_08 exercised
R_09 exercised
S_01 sedentary
S_02 sedentary
S_03 sedentary
S_04 sedentary
S_05 sedentary
S_06 sedentary
S_07 sedentary
S_08 sedentary
S_09 sedentary

Since we used the sample data to create the metadata, everything is in the right order; however, we usually need to reorder the rows of the metadata. We can do this with

# Reorder rows of metadata to match sample order in crosstab
# Need to add drop = FALSE if the metadata is a single column
metadata <- metadata[colnames(crosstab), , drop = FALSE]

We have the metadata and protein abundance data, so we can now create the MSnSet. We will call it m1 and we should save it as an .RData file.

# Create MSnSet
m1 <- MSnbase::MSnSet(exprs = as.matrix(crosstab), pData = metadata)
validObject(m1)
## [1] TRUE

If we type the name of the MSnSet in the console, a summary will be displayed. This will show information about each of the tables, and the MSnSet will keep track of all processing done such as filtering at the bottom under “Processing information.”

# Display summary
m1
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5178 features, 18 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: R_01 R_02 ... S_09 (18 total)
##   varLabels: group
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.18.0
# Save unprocessed MSnSet
save(m1, file = "data/global_msnset.RData")