8.3 Gene Set Enrichment Analysis

8.3.1 Overview

(More details to be added at a later date.)

Gene set enrichment analysis (GSEA) is a rank-based approach that determines whether predefined groups of genes/proteins/etc. are primarily up or down in one condition relative to another (Vamsi K. Mootha et al., 2003; Subramanian et al., 2005). It is typically performed as a follow-up to differential analysis, and is preferred to ORA (Section 8.2).

8.3.2 Examples

These examples will show how to run Fast GSEA (FGSEA) in R, which is based on the gene permutation approach (Korotkevich et al., 2016).

The input of FGSEA is a list of gene sets/pathways to check and a uniquely-named vector of ranking metric values sorted in descending order. The ranking metric that we will use is \(-log_{10}(\text{p-value}) * sign(\text{logFC})\), but we could have easily used t-statistics or some other metric. We will perform differential analysis on the cptac_oca data from the MSnSet.utils package and use the results to create the ranking metric vector. We start off by remapping features from RefSeq to Entrez ID.

## Fetch REFSEQ to ENTREZID conversion table
library(MSnID)
conv_tbl <- fetch_conversion_table(organism_name = "Homo sapiens", 
                                   from = "REFSEQ", to = "ENTREZID")
head(conv_tbl)

Now, we will create the differential analysis results and add the ENTREZID column.

library(MSnSet.utils)
data("cptac_oca")
m1 <- oca.set

# Differential analysis
res <- limma_a_b(eset = m1, 
                 model.str = "~ PLATINUM.STATUS",
                 coef.str = "PLATINUM.STATUS")
# table(res$adj.P.Val < 0.05) # 0
# hist(res$P.Value, breaks = seq(0, 1, 0.05)) # looks uniform
head(res)

library(dplyr)
# Add ENTREZID column
res <- res %>% 
  mutate(REFSEQ = sub("\\.\\d+", "", rownames(.))) %>% 
  left_join(conv_tbl, by = "REFSEQ")
head(res)

We need one ranking metric value per Entrez ID, so we will calculate the ranking metric and then take the gene-wise average. Then, we sort the genes in descending order by their ranking metric values and convert to a named vector called geneList.

## Ranking metric vector for GSEA
geneList <- res %>%
  filter(!is.na(ENTREZID), !is.na(logFC)) %>% 
  mutate(ranking_metric = -log10(P.Value)*sign(logFC)) %>% 
  group_by(ENTREZID) %>% 
  summarise(ranking_metric = mean(ranking_metric, na.rm = TRUE)) %>% 
  arrange(-ranking_metric) %>% # sort descending (important!)
  tibble::deframe() # convert to named vector
head(geneList)
tail(geneList)

Now that we have the uniquely-named ranking metric vector, we can proceed with FGSEA. For these examples, we will show how to perform FGSEA with the fgsea, clusterProfiler, and ReactomePA packages. The databases that we will cover are Gene Ontology Biological Processes and Reactome, and we will only consider gene sets/pathways with at least 15 and no more than 500 genes. The maximum gene set size, as well as the number of permutations, affect the normalized enrichment scores and p-values. For details on these different annotation databases, please see Section 8.1 (Annotation Databases).

8.3.2.1 GSEA with `fgsea`

To perform FGSEA with the fgsea package, we need a list of gene sets/pathways and the ranking metric vector. Below is one way to get the gene set list.

# MSigDB R package
library(msigdbr)
msigdbr::msigdbr_collections() # available collections
# Subset to Human GO-BP sets
BP_db <- msigdbr(species = "Homo sapiens", 
                 category = "C5", subcategory = "GO:BP")
head(BP_db)

# Convert to a list of gene sets
BP_conv <- unique(BP_db[, c("entrez_gene", "gs_exact_source")])
BP_list <- split(x = BP_conv$entrez_gene, f = BP_conv$gs_exact_source)
# First ~6 IDs of first 3 terms
lapply(head(BP_list, 3), head)

The fgseaMultilevel function uses the adaptive multilevel split Monte Carlo approach described in the original FGSEA paper (Korotkevich et al., 2016).

## GO-BP FGSEA with fgsea package
library(fgsea)
set.seed(99)
system.time( # keep track of elapsed time
  fgsea_res <- fgseaMultilevel(pathways = BP_list,
                               stats = geneList, 
                               minSize = 15, 
                               maxSize = 500, 
                               eps = 0, 
                               nPermSimple = 10000)
)
# First 6 rows with lowest enrichment p-values
fgsea_res %>% 
  # Add additional columns from BP_db
  left_join(distinct(BP_db, gs_subcat, gs_exact_source, 
                     gs_name, gs_description),
            by = c("pathway" = "gs_exact_source")) %>% 
  # Reformat descriptions
  mutate(gs_name = sub("^GOBP_", "", gs_name),
         gs_name = gsub("_", " ", gs_name)) %>% 
  arrange(padj) %>% 
  head(8)

See ?fgseaMultilevel for a description of the output. The output does not include term descriptions, so we had to add those ourselves.

8.3.2.2 GSEA with `clusterProfiler`/`ReactomePA`

See Section 8.2.4.2 to get a better understanding of these packages. clusterProfiler provides the gseGO and gseKEGG functions (among others) for FGSEA of the GO and KEGG databases, respectively. They are essentially more user-friendly wrapper functions that make use of the fgsea and AnnotationDbi package, but they tend to be much slower.

## GO-BP FGSEA with clusterProfiler package
library(clusterProfiler)
system.time( # keep track of elapsed time
  cgsea_res <- gseGO(geneList = geneList, 
                     ont = "BP", 
                     OrgDb = "org.Hs.eg.db", 
                     minGSSize = 15, 
                     maxGSSize = 500, 
                     eps = 0, 
                     nPermSimple = 10000, 
                     seed = TRUE)
)
# First 8 rows with lowest enrichment p-values
cgsea_res@result %>% 
  arrange(pvalue) %>% 
  head(8)

Notice that the normalized enrichment scores (NES) are not quite the same as what we got when we used fgseaMultiLevel. This has to do with the permutations being different. To increase the precision of the NES and p-values, we can increase nPermSimple, though 10,000 should be more than sufficient.

Now, here is GSEA with the Reactome database using the gsePathway function from the ReactomePA package.

## Reactome FGSEA with ReactomePA package
library(ReactomePA)
fgsea_react <- gsePathway(geneList = geneList, 
                          organism = "human",
                          minGSSize = 15, 
                          maxGSSize = 500, 
                          eps = 0, 
                          nPermSimple = 10000, 
                          seed = TRUE)
# First 6 rows
head(fgsea_react@result)

References

Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., Artyomov, M. N., and Sergushichev, A., Fast Gene Set Enrichment Analysis, Bioinformatics, Jun. 2016.

Mootha, Vamsi K., Lindgren, C. M., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., et al., PGC-1α-Responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes, Nature Genetics, vol. 34, no. 3, pp. 267–73, accessed September 19, 2021, from https://www.nature.com/articles/ng1180, July 2003. DOI: 10.1038/ng1180

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., et al., Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles, Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 43, pp. 15545–50, accessed September 7, 2021, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1239896/, October 2005. DOI: 10.1073/pnas.0506580102