8.2 Over-Representation Analysis
8.2.1 Overview
Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al., 2009).
For example, if 10% of all genes being considered are “interesting” (statistically different between conditions, clustered together, etc.), we expect that about 10% of every gene set will be in this “interesting” group of genes. If any gene sets are significantly more present in this group, we would say they are over-represented. Note that we are not limited to analyzing genes, though that is how most databases are set up. So long as we have a feature-to-feature-set map, we can perform ORA.
I recommend ORA only when GSEA is not appropriate, such as for analyzing clusters (from k-means clustering, WGCNA, Mclust, etc.), where the only information available are the group designations. See Section 8.2.3 (ORA Drawbacks) for details.
8.2.2 Mathematical Details
For each gene set, an enrichment p-value is calculated using the Binomial distribution, Hypergeometric distribution, the Fisher exact test, or the Chi-square test. Although this list is not all-encompassing, these are the most popular statistical methods (Huang et al., 2009). Below is the formula for calculating the enrichment p-value for a particular gene set using the Hypergeometric distribution.
\[ P(X\geq x) = 1 - P(X \leq x-1) = 1 - \sum\limits_{i=0}^{x-1}\frac{\hphantom{}{M \choose i }{N - M \choose n-i}}{N \choose n} \]
In this equation, \(N\) is the number of background genes, \(n\) is the number of “interesting” genes, \(M\) is the number of genes that are annotated to a particular gene set \(S\), and \(x\) is the number of “interesting” genes that are annotated to \(S\). The numerator of the sum is the number of samples of \(n\) genes that can be taken from a population of \(N\) genes where exactly \(i\) of the genes are annotated to \(S\) and \(n-i\) are not annotated to \(S\). The denominator of the sum is the total number of samples of size \(n\) that can be taken from a population of size \(N\).
For example, suppose we have a list of 8000 genes, of which 400 are members of the same cluster \(C\). Also suppose that 100 of the 8000 genes are annotated to a particular gene set \(S\). Of these 100 genes, 20 are members of \(C\). The probability that 20 or more (up to 100) genes annotated to \(S\) are in cluster \(C\) by chance is given by
\[ P(X\geq 20) = 1 - P(X \leq 19) = 1-\sum \limits_{i=0}^{19}\frac{\hphantom{}{100 \choose i}{8000 - 100 \choose 400-i}}{8000 \choose 400} = 7.88 \times 10^{-8} \]
That is, it is extremely unlikely that 20 of the 100 genes from this set are grouped in cluster \(C\) by chance (at least, prior to adjustment for multiple comparisons). The code to calculate this p-value is
phyper(q = 20 - 1, m = 400, n = 8000 - 400, k = 100, lower.tail = FALSE)
After a p-value has been calculated for each of the applicable gene sets, a multiple testing adjustment should be applied.
8.2.3 Drawbacks
ORA is not recommended as a follow-up to differential-expression analysis for the reasons below. Use GSEA instead, if appropriate.
The choice of the threshold for statistical significance and the multiple comparison adjustment method can greatly impact the analysis (Huang et al., 2009).
ORA fails to incorporate direction of change. (Are the genes in a given set mainly up or down-regulated in one condition relative to another?). It is NOT a good idea to split DEA results by the direction of change and apply ORA to the resulting subsets, unless you are specifically asking “which gene sets are over-represented when we only consider genes that are up- or down-regulated?”
If few genes are in the “interesting” group, ORA may not yield useful or reliable results. For example, suppose 30 out of 8000 genes are “interesting”. 100 of the genes are annotated to a particular gene set, of which 3 are “interesting”. The associated Hypergeometric p-value is 0.006, and this set would be considered significantly over-represented at the 0.01 level (at least, prior to p-value adjustment); however, if only 2 of the genes in this set are “interesting”, this p-value increases 10-fold to 0.0536 and is no longer significant even at the 0.05 level.
ORA can not be used if the input contains duplicates. For example, a single feature can not be a member of two or more groups or present multiple times in the same group. This usually happens when attempting to perform gene-level ORA on protein-level differential analysis results, and can lead to artificial over-representation if genes are counted multiple times. Instead, use GSEA and summarize the ranking metric at the gene level (take the average).
8.2.4 Examples
For these examples, we will show how to perform ORA with the fgsea
, clusterProfiler
, ReactomePA
, and GOstats
packages on clustering results. The databases that we will cover are Gene Ontology Biological Processes and Reactome, and we will only consider gene sets/pathways with at least 15 and no more than 300 genes. For details on these different annotation databases, please see Section 8.1 (Annotation Databases).
We will use the gcSample
data from clusterProfiler
and treat the entire list as the gene universe/background. Each gene is represented by a human Entrez gene ID, which is the default keytype used by the clusterProfiler
functions (and the only keytype compatible with ReactomePA::enrichPathway
).
library(clusterProfiler)
data("gcSample") # data for examples
Since the genes in gcSample
are not unique, we will subset to unique genes for the sake of these examples.
# Need to remove duplicates for the examples
<- unlist(gcSample)
all_genes <- all_genes[Biobase::isUnique(all_genes)] # all unique genes
universe
# List with only unique genes
<- lapply(gcSample, function(group_i) {
gcUnique %in% universe]
group_i[group_i })
8.2.4.1 ORA with fgsea
The fgsea
package can be used to perform over-representation analysis with the fora
function, which applies the hypergeometric test. It requires a list of gene sets or pathways, a vector of “interesting” genes to test, and a gene universe vector. For clustering results, we can run this function in lapply
to test each cluster. For this example, we will perform ORA on the non-redundant Gene Ontology biological processes (GO-BP) sets from MSigDB, but any database can be used.
We first need to get the list of gene sets; we do this with the msigdbr
package (Section 8.1.4). We use msigdbr_collections
to determine the category and subcategory and msigdbr
to fetch the data.
# MSigDB R package
library(msigdbr)
::msigdbr_collections() # available collections
msigdbr# Subset to Human GO-BP sets
<- msigdbr(species = "Homo sapiens",
BP_db category = "C5", subcategory = "GO:BP")
head(BP_db)
# Convert to a list of gene sets
<- unique(BP_db[, c("entrez_gene", "gs_exact_source")])
BP_conv <- split(x = BP_conv$entrez_gene, f = BP_conv$gs_exact_source)
BP_list # First ~6 IDs of first 3 terms
lapply(head(BP_list, 3), head)
We have all the required input, so we can move on to ORA. In order to perform ORA on each cluster, we can wrap fora
in lapply
.
## Cluster GO-BP ORA with fgsea package
library(fgsea)
library(dplyr)
# For each cluster i, perform ORA
<- lapply(seq_along(gcUnique), function(i) {
fgsea_ora fora(pathways = BP_list,
genes = gcUnique[[i]], # genes in cluster i
universe = universe, # all genes
minSize = 15,
maxSize = 500) %>%
mutate(cluster = names(gcUnique)[i]) # add cluster column
%>%
}) ::rbindlist() %>% # combine tables
data.tablefilter(padj < 0.05) %>%
arrange(cluster, padj) %>%
# Add additional columns from BP_db
left_join(distinct(BP_db, gs_subcat, gs_exact_source,
gs_name, gs_description),by = c("pathway" = "gs_exact_source")) %>%
# Reformat descriptions
mutate(gs_name = sub("^GOBP_", "", gs_name),
gs_name = gsub("_", " ", gs_name))
# First 6 rows
head(fgsea_ora)
See ?fora
for a description of the output. The base output does not include term descriptions, so we had to add those ourselves.
8.2.4.2 ORA with clusterProfiler
/ReactomePA
clusterProfiler
and ReactomePA
are convenience packages that use fgsea
and AnnotationDbi
as the basis for their pathway analysis functions. The functions are a bit more user-friendly, as the user does not need to fetch the gene lists themselves, but they are much slower, and the resulting objects are not simple data.frames.
clusterProfiler
provides the enrichGO
and enrichKEGG
functions for GO and KEGG ORA, respectively (among others). ReactomePA
provides the enrichPathway
function for Reactome pathway ORA. For other databases, the clusterProfiler::enricher
function can be used (though this is slower than just using fgsea::fora
with a list of gene sets). To perform ORA on clustering results, we use clusterProfiler::compareCluster
and tell it to use enrichGO
as the ORA function.
## Cluster GO-BP ORA with clusterProfiler package
<- compareCluster(
cp_ora geneClusters = gcUnique,
fun = "enrichGO", # ORA function to apply to each cluster
# Arguments below are passed to enrichGO
OrgDb = "org.Hs.eg.db",
keyType = "ENTREZID",
ont = "BP", # BP, CC, MF, or ALL for all ontologies
pvalueCutoff = 0.05,
qvalueCutoff = 1, # do not filter by q-value
pAdjustMethod = "BH", # p-values are adjusted within clusters
universe = universe, # all genes
minGSSize = 15,
maxGSSize = 500
)
# First 6 entries sorted by cluster and p-value
@compareClusterResult %>%
cp_oraarrange(Cluster, pvalue) %>%
head()
Unlike the fgsea::fora
results, these include the description of each term. As for the other columns: GeneRatio
is the same as overlap
(from the fora
results) divided by the cluster size, BgRatio
is the set size divided by the universe size, pvalue
is the raw p-value, p.adjust
is the BH-adjusted p-value, qvalue
is the q-value, geneID
is the same as overlapGenes
from fora
, and Count
is the overlap size.
Below is an example of how to perform Reactome ORA with ReactomePA::enrichPathway
.
## Reactome ORA with ReactomePA package
library(ReactomePA)
<- compareCluster(
react_ora geneClusters = gcUnique,
fun = "enrichPathway", # ORA function to apply to each cluster
# Arguments below are passed to enrichPathway
organism = "human",
pvalueCutoff = 1, # Do not filter by p-value
qvalueCutoff = 1, # Do not filter by q-value
pAdjustMethod = "BH", # p-values are adjusted within clusters
universe = universe, # all genes
minGSSize = 15,
maxGSSize = 500
)# First 6 rows
head(react_ora@compareClusterResult)
8.2.4.3 ORA with GOstats
In the previous ORA examples, the Hypergeometric test is performed independently for each gene set; however, this does not capture the relationship between GO terms (described in Section 8.1.1). Since “each GO term inherits all annotations from its more specific descendants,” results tend to be redundant (except when using MSigDB), as they include directly-related GO terms with a high degree of overlap (S. Falcon et al., 2007). One way to handle this is with a procedure that conditions on the GO structure, like the one described by S. Falcon and R. Gentleman (2007):
Given a subgraph of one of the three GO ontologies [BP, MF, or CC], we test the leaves of the graph, that is, those terms with no child terms. Before testing the terms whose children have already been tested, we remove all genes annotated at significant children [
pvalueCutoff = 0.05
in the code below] from the parent’s gene list. This continues until all terms have been tested.
This approach is implemented in the GOstats
package by setting conditional = TRUE
when creating a new object of class GOHyperGParams
(See help("GOHyperGParams-class", package = "Category")
for more details). Below, we will perform GO-BP ORA just for the first cluster of gcUnique
because it is time-consuming, but this could be wrapped in lapply
to get results for each cluster (like in Section 8.2.4.1).
## Cluster GO-BP ORA with GOstats package
library(GOstats)
library(org.Hs.eg.db)
library(dplyr)
# For cluster 1, perform conditional ORA
<- new(Class = "GOHyperGParams",
gostats_ora ontology = "BP",
geneIds = gcUnique[[1]],
universeGeneIds = universe,
annotation = "org.Hs.eg.db",
pvalueCutoff = 0.05,
testDirection = "over",
conditional = TRUE, # condition on GO structure
minSizeCutoff = 15,
maxSizeCutoff = 500) %>%
hyperGTest() %>% # Hypergeometric testing
summary() %>% # extract results
# adjust p-values
mutate(Padj = p.adjust(Pvalue, method = "BH")) %>%
filter(Padj < 0.05) %>%
arrange(Padj)
# First 6 rows
head(gostats_ora)
If we compare this table to the results from the other packages, we see that only GO:0007600 and GO:0061844 from the top 6 terms made it into the above table.