8.2 Over-Representation Analysis

8.2.1 Overview

Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes than what would be expected by chance (Huang et al., 2009).

For example, if 10% of all genes being considered are “interesting” (statistically different between conditions, clustered together, etc.), we expect that about 10% of every gene set will be in this “interesting” group of genes. If any gene sets are significantly more present in this group, we would say they are over-represented. Note that we are not limited to analyzing genes, though that is how most databases are set up. So long as we have a feature-to-feature-set map, we can perform ORA.

I recommend ORA only when GSEA is not appropriate, such as for analyzing clusters (from k-means clustering, WGCNA, Mclust, etc.), where the only information available are the group designations. See Section 8.2.3 (ORA Drawbacks) for details.

8.2.2 Mathematical Details

For each gene set, an enrichment p-value is calculated using the Binomial distribution, Hypergeometric distribution, the Fisher exact test, or the Chi-square test. Although this list is not all-encompassing, these are the most popular statistical methods (Huang et al., 2009). Below is the formula for calculating the enrichment p-value for a particular gene set using the Hypergeometric distribution.

\[ P(X\geq x) = 1 - P(X \leq x-1) = 1 - \sum\limits_{i=0}^{x-1}\frac{\hphantom{}{M \choose i }{N - M \choose n-i}}{N \choose n} \]

In this equation, \(N\) is the number of background genes, \(n\) is the number of “interesting” genes, \(M\) is the number of genes that are annotated to a particular gene set \(S\), and \(x\) is the number of “interesting” genes that are annotated to \(S\). The numerator of the sum is the number of samples of \(n\) genes that can be taken from a population of \(N\) genes where exactly \(i\) of the genes are annotated to \(S\) and \(n-i\) are not annotated to \(S\). The denominator of the sum is the total number of samples of size \(n\) that can be taken from a population of size \(N\).

For example, suppose we have a list of 8000 genes, of which 400 are members of the same cluster \(C\). Also suppose that 100 of the 8000 genes are annotated to a particular gene set \(S\). Of these 100 genes, 20 are members of \(C\). The probability that 20 or more (up to 100) genes annotated to \(S\) are in cluster \(C\) by chance is given by

\[ P(X\geq 20) = 1 - P(X \leq 19) = 1-\sum \limits_{i=0}^{19}\frac{\hphantom{}{100 \choose i}{8000 - 100 \choose 400-i}}{8000 \choose 400} = 7.88 \times 10^{-8} \]

That is, it is extremely unlikely that 20 of the 100 genes from this set are grouped in cluster \(C\) by chance (at least, prior to adjustment for multiple comparisons). The code to calculate this p-value is

phyper(q = 20 - 1, m = 400, n = 8000 - 400, k = 100, lower.tail = FALSE)

After a p-value has been calculated for each of the applicable gene sets, a multiple testing adjustment should be applied.

8.2.3 Drawbacks

ORA is not recommended as a follow-up to differential-expression analysis for the reasons below. Use GSEA instead, if appropriate.

  1. The choice of the threshold for statistical significance and the multiple comparison adjustment method can greatly impact the analysis (Huang et al., 2009).

  2. ORA fails to incorporate direction of change. (Are the genes in a given set mainly up or down-regulated in one condition relative to another?). It is NOT a good idea to split DEA results by the direction of change and apply ORA to the resulting subsets, unless you are specifically asking “which gene sets are over-represented when we only consider genes that are up- or down-regulated?”

  3. If few genes are in the “interesting” group, ORA may not yield useful or reliable results. For example, suppose 30 out of 8000 genes are “interesting”. 100 of the genes are annotated to a particular gene set, of which 3 are “interesting”. The associated Hypergeometric p-value is 0.006, and this set would be considered significantly over-represented at the 0.01 level (at least, prior to p-value adjustment); however, if only 2 of the genes in this set are “interesting”, this p-value increases 10-fold to 0.0536 and is no longer significant even at the 0.05 level.

  4. ORA can not be used if the input contains duplicates. For example, a single feature can not be a member of two or more groups or present multiple times in the same group. This usually happens when attempting to perform gene-level ORA on protein-level differential analysis results, and can lead to artificial over-representation if genes are counted multiple times. Instead, use GSEA and summarize the ranking metric at the gene level (take the average).

8.2.4 Examples

For these examples, we will show how to perform ORA with the fgsea, clusterProfiler, ReactomePA, and GOstats packages on clustering results. The databases that we will cover are Gene Ontology Biological Processes and Reactome, and we will only consider gene sets/pathways with at least 15 and no more than 300 genes. For details on these different annotation databases, please see Section 8.1 (Annotation Databases).

We will use the gcSample data from clusterProfiler and treat the entire list as the gene universe/background. Each gene is represented by a human Entrez gene ID, which is the default keytype used by the clusterProfiler functions (and the only keytype compatible with ReactomePA::enrichPathway).

library(clusterProfiler)
data("gcSample") # data for examples

Since the genes in gcSample are not unique, we will subset to unique genes for the sake of these examples.

# Need to remove duplicates for the examples
all_genes <- unlist(gcSample)
universe <- all_genes[Biobase::isUnique(all_genes)] # all unique genes

# List with only unique genes
gcUnique <- lapply(gcSample, function(group_i) {
  group_i[group_i %in% universe]
})

8.2.4.1 ORA with fgsea

The fgsea package can be used to perform over-representation analysis with the fora function, which applies the hypergeometric test. It requires a list of gene sets or pathways, a vector of “interesting” genes to test, and a gene universe vector. For clustering results, we can run this function in lapply to test each cluster. For this example, we will perform ORA on the non-redundant Gene Ontology biological processes (GO-BP) sets from MSigDB, but any database can be used.

We first need to get the list of gene sets; we do this with the msigdbr package (Section 8.1.4). We use msigdbr_collections to determine the category and subcategory and msigdbr to fetch the data.

# MSigDB R package
library(msigdbr)
msigdbr::msigdbr_collections() # available collections
# Subset to Human GO-BP sets
BP_db <- msigdbr(species = "Homo sapiens", 
                 category = "C5", subcategory = "GO:BP")
head(BP_db)
# Convert to a list of gene sets
BP_conv <- unique(BP_db[, c("entrez_gene", "gs_exact_source")])
BP_list <- split(x = BP_conv$entrez_gene, f = BP_conv$gs_exact_source)
# First ~6 IDs of first 3 terms
lapply(head(BP_list, 3), head)

We have all the required input, so we can move on to ORA. In order to perform ORA on each cluster, we can wrap fora in lapply.

## Cluster GO-BP ORA with fgsea package
library(fgsea)
library(dplyr)

# For each cluster i, perform ORA
fgsea_ora <- lapply(seq_along(gcUnique), function(i) {
  fora(pathways = BP_list, 
       genes = gcUnique[[i]], # genes in cluster i
       universe = universe, # all genes
       minSize = 15, 
       maxSize = 500) %>% 
    mutate(cluster = names(gcUnique)[i]) # add cluster column
}) %>% 
  data.table::rbindlist() %>% # combine tables
  filter(padj < 0.05) %>% 
  arrange(cluster, padj) %>% 
  # Add additional columns from BP_db
  left_join(distinct(BP_db, gs_subcat, gs_exact_source, 
                     gs_name, gs_description),
            by = c("pathway" = "gs_exact_source")) %>% 
  # Reformat descriptions
  mutate(gs_name = sub("^GOBP_", "", gs_name),
         gs_name = gsub("_", " ", gs_name))

# First 6 rows
head(fgsea_ora)

See ?fora for a description of the output. The base output does not include term descriptions, so we had to add those ourselves.

8.2.4.2 ORA with clusterProfiler/ReactomePA

clusterProfiler and ReactomePA are convenience packages that use fgsea and AnnotationDbi as the basis for their pathway analysis functions. The functions are a bit more user-friendly, as the user does not need to fetch the gene lists themselves, but they are much slower, and the resulting objects are not simple data.frames.

clusterProfiler provides the enrichGO and enrichKEGG functions for GO and KEGG ORA, respectively (among others). ReactomePA provides the enrichPathway function for Reactome pathway ORA. For other databases, the clusterProfiler::enricher function can be used (though this is slower than just using fgsea::fora with a list of gene sets). To perform ORA on clustering results, we use clusterProfiler::compareCluster and tell it to use enrichGO as the ORA function.

## Cluster GO-BP ORA with clusterProfiler package
cp_ora <- compareCluster(
  geneClusters = gcUnique, 
  fun = "enrichGO", # ORA function to apply to each cluster
  # Arguments below are passed to enrichGO
  OrgDb = "org.Hs.eg.db", 
  keyType = "ENTREZID", 
  ont = "BP", # BP, CC, MF, or ALL for all ontologies
  pvalueCutoff = 0.05,
  qvalueCutoff = 1, # do not filter by q-value
  pAdjustMethod = "BH", # p-values are adjusted within clusters
  universe = universe, # all genes
  minGSSize = 15, 
  maxGSSize = 500
)

# First 6 entries sorted by cluster and p-value
cp_ora@compareClusterResult %>% 
  arrange(Cluster, pvalue) %>% 
  head()

Unlike the fgsea::fora results, these include the description of each term. As for the other columns: GeneRatio is the same as overlap (from the fora results) divided by the cluster size, BgRatio is the set size divided by the universe size, pvalue is the raw p-value, p.adjust is the BH-adjusted p-value, qvalue is the q-value, geneID is the same as overlapGenes from fora, and Count is the overlap size.

Below is an example of how to perform Reactome ORA with ReactomePA::enrichPathway.

## Reactome ORA with ReactomePA package
library(ReactomePA)
react_ora <- compareCluster(
  geneClusters = gcUnique, 
  fun = "enrichPathway", # ORA function to apply to each cluster
  # Arguments below are passed to enrichPathway
  organism = "human",
  pvalueCutoff = 1, # Do not filter by p-value
  qvalueCutoff = 1, # Do not filter by q-value
  pAdjustMethod = "BH", # p-values are adjusted within clusters
  universe = universe, # all genes
  minGSSize = 15, 
  maxGSSize = 500
)
# First 6 rows
head(react_ora@compareClusterResult)

8.2.4.3 ORA with GOstats

In the previous ORA examples, the Hypergeometric test is performed independently for each gene set; however, this does not capture the relationship between GO terms (described in Section 8.1.1). Since “each GO term inherits all annotations from its more specific descendants,” results tend to be redundant (except when using MSigDB), as they include directly-related GO terms with a high degree of overlap (S. Falcon et al., 2007). One way to handle this is with a procedure that conditions on the GO structure, like the one described by S. Falcon and R. Gentleman (2007):

Given a subgraph of one of the three GO ontologies [BP, MF, or CC], we test the leaves of the graph, that is, those terms with no child terms. Before testing the terms whose children have already been tested, we remove all genes annotated at significant children [pvalueCutoff = 0.05 in the code below] from the parent’s gene list. This continues until all terms have been tested.

This approach is implemented in the GOstats package by setting conditional = TRUE when creating a new object of class GOHyperGParams (See help("GOHyperGParams-class", package = "Category") for more details). Below, we will perform GO-BP ORA just for the first cluster of gcUnique because it is time-consuming, but this could be wrapped in lapply to get results for each cluster (like in Section 8.2.4.1).

## Cluster GO-BP ORA with GOstats package
library(GOstats)
library(org.Hs.eg.db)
library(dplyr)
# For cluster 1, perform conditional ORA
gostats_ora <- new(Class = "GOHyperGParams",
                   ontology = "BP",
                   geneIds = gcUnique[[1]],
                   universeGeneIds = universe,
                   annotation = "org.Hs.eg.db",
                   pvalueCutoff = 0.05,
                   testDirection = "over",
                   conditional = TRUE, # condition on GO structure
                   minSizeCutoff = 15,
                   maxSizeCutoff = 500) %>% 
  hyperGTest() %>% # Hypergeometric testing
  summary() %>% # extract results
  # adjust p-values
  mutate(Padj = p.adjust(Pvalue, method = "BH")) %>% 
  filter(Padj < 0.05) %>% 
  arrange(Padj)
# First 6 rows
head(gostats_ora)

If we compare this table to the results from the other packages, we see that only GO:0007600 and GO:0061844 from the top 6 terms made it into the above table.

References

Falcon, S. and Gentleman, R., Using GOstats to Test Gene Lists for GO Term Association, Bioinformatics, vol. 23, no. 2, pp. 257–58, accessed May 5, 2022, from https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btl567, January 2007. DOI: 10.1093/bioinformatics/btl567
Huang, D. W., Sherman, B. T. and Lempicki, R. A., Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists, Nucleic Acids Research, vol. 37, no. 1, pp. 1–13, accessed September 7, 2021, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2615629/, January 2009. DOI: 10.1093/nar/gkn923