8.3 Gene Set Enrichment Analysis
8.3.1 Overview
(More details to be added at a later date.)
Gene set enrichment analysis (GSEA) is a rank-based approach that determines whether predefined groups of genes/proteins/etc. are primarily up or down in one condition relative to another (Vamsi K. Mootha et al., 2003; Subramanian et al., 2005). It is typically performed as a follow-up to differential analysis, and is preferred to ORA (Section 8.2).
8.3.2 Examples
These examples will show how to run Fast GSEA (FGSEA) in R, which is based on the gene permutation approach (Korotkevich et al., 2016).
The input of FGSEA is a list of gene sets/pathways to check and a uniquely-named vector of ranking metric values sorted in descending order. The ranking metric that we will use is \(-log_{10}(\text{p-value}) * sign(\text{logFC})\), but we could have easily used t-statistics or some other metric. We will perform differential analysis on the cptac_oca
data from the MSnSet.utils
package and use the results to create the ranking metric vector. We start off by remapping features from RefSeq to Entrez ID.
## Fetch REFSEQ to ENTREZID conversion table
library(MSnID)
<- fetch_conversion_table(organism_name = "Homo sapiens",
conv_tbl from = "REFSEQ", to = "ENTREZID")
head(conv_tbl)
Now, we will create the differential analysis results and add the ENTREZID column.
library(MSnSet.utils)
data("cptac_oca")
<- oca.set
m1
# Differential analysis
<- limma_a_b(eset = m1,
res model.str = "~ PLATINUM.STATUS",
coef.str = "PLATINUM.STATUS")
# table(res$adj.P.Val < 0.05) # 0
# hist(res$P.Value, breaks = seq(0, 1, 0.05)) # looks uniform
head(res)
library(dplyr)
# Add ENTREZID column
<- res %>%
res mutate(REFSEQ = sub("\\.\\d+", "", rownames(.))) %>%
left_join(conv_tbl, by = "REFSEQ")
head(res)
We need one ranking metric value per Entrez ID, so we will calculate the ranking metric and then take the gene-wise average. Then, we sort the genes in descending order by their ranking metric values and convert to a named vector called geneList
.
## Ranking metric vector for GSEA
<- res %>%
geneList filter(!is.na(ENTREZID), !is.na(logFC)) %>%
mutate(ranking_metric = -log10(P.Value)*sign(logFC)) %>%
group_by(ENTREZID) %>%
summarise(ranking_metric = mean(ranking_metric, na.rm = TRUE)) %>%
arrange(-ranking_metric) %>% # sort descending (important!)
::deframe() # convert to named vector
tibblehead(geneList)
tail(geneList)
Now that we have the uniquely-named ranking metric vector, we can proceed with FGSEA. For these examples, we will show how to perform FGSEA with the fgsea
, clusterProfiler
, and ReactomePA
packages. The databases that we will cover are Gene Ontology Biological Processes and Reactome, and we will only consider gene sets/pathways with at least 15 and no more than 500 genes. The maximum gene set size, as well as the number of permutations, affect the normalized enrichment scores and p-values. For details on these different annotation databases, please see Section 8.1 (Annotation Databases).
8.3.2.1 GSEA with fgsea
To perform FGSEA with the fgsea
package, we need a list of gene sets/pathways and the ranking metric vector. Below is one way to get the gene set list.
# MSigDB R package
library(msigdbr)
::msigdbr_collections() # available collections
msigdbr# Subset to Human GO-BP sets
<- msigdbr(species = "Homo sapiens",
BP_db category = "C5", subcategory = "GO:BP")
head(BP_db)
# Convert to a list of gene sets
<- unique(BP_db[, c("entrez_gene", "gs_exact_source")])
BP_conv <- split(x = BP_conv$entrez_gene, f = BP_conv$gs_exact_source)
BP_list # First ~6 IDs of first 3 terms
lapply(head(BP_list, 3), head)
The fgseaMultilevel
function uses the adaptive multilevel split Monte Carlo approach described in the original FGSEA paper (Korotkevich et al., 2016).
## GO-BP FGSEA with fgsea package
library(fgsea)
set.seed(99)
system.time( # keep track of elapsed time
<- fgseaMultilevel(pathways = BP_list,
fgsea_res stats = geneList,
minSize = 15,
maxSize = 500,
eps = 0,
nPermSimple = 10000)
)# First 6 rows with lowest enrichment p-values
%>%
fgsea_res # Add additional columns from BP_db
left_join(distinct(BP_db, gs_subcat, gs_exact_source,
gs_name, gs_description),by = c("pathway" = "gs_exact_source")) %>%
# Reformat descriptions
mutate(gs_name = sub("^GOBP_", "", gs_name),
gs_name = gsub("_", " ", gs_name)) %>%
arrange(padj) %>%
head(8)
See ?fgseaMultilevel
for a description of the output. The output does not include term descriptions, so we had to add those ourselves.
8.3.2.2 GSEA with clusterProfiler
/ReactomePA
See Section 8.2.4.2 to get a better understanding of these packages. clusterProfiler
provides the gseGO
and gseKEGG
functions (among others) for FGSEA of the GO and KEGG databases, respectively. They are essentially more user-friendly wrapper functions that make use of the fgsea
and AnnotationDbi
package, but they tend to be much slower.
## GO-BP FGSEA with clusterProfiler package
library(clusterProfiler)
system.time( # keep track of elapsed time
<- gseGO(geneList = geneList,
cgsea_res ont = "BP",
OrgDb = "org.Hs.eg.db",
minGSSize = 15,
maxGSSize = 500,
eps = 0,
nPermSimple = 10000,
seed = TRUE)
)# First 8 rows with lowest enrichment p-values
@result %>%
cgsea_resarrange(pvalue) %>%
head(8)
Notice that the normalized enrichment scores (NES) are not quite the same as what we got when we used fgseaMultiLevel
. This has to do with the permutations being different. To increase the precision of the NES and p-values, we can increase nPermSimple
, though 10,000 should be more than sufficient.
Now, here is GSEA with the Reactome database using the gsePathway
function from the ReactomePA
package.
## Reactome FGSEA with ReactomePA package
library(ReactomePA)
<- gsePathway(geneList = geneList,
fgsea_react organism = "human",
minGSSize = 15,
maxGSSize = 500,
eps = 0,
nPermSimple = 10000,
seed = TRUE)
# First 6 rows
head(fgsea_react@result)