8.1 Annotation Databases

In this section, we will explore a few of the common annotation databases used for pathway analysis.

8.1.1 Gene Ontology

The Gene Ontology (GO) database is divided into three separate domains: Biological Process, Cellular Component, and Molecular Function (see the Gene Ontology overview for more details regarding each domain). Each domain is structured as a directed acyclic graph (DAG) where nodes are terms and edges are the relations between the terms (part of, is a, has part, regulates). Nodes can be connected to multiple child and parent nodes, where the group of genes annotated to a child node is a subset of those that are annotated to its parent node(s) (2021; Goeman et al., 2008).

8.1.1.1 Semantic Similarity

Due to the DAG structure of each domain, there is often redundancy in pathway analysis results. For example, suppose terms GO:0006119, GO:0009060, and GO:0046034 are significantly over-represented biological processes. GO:0009060 and GO:0046034 are the parent terms of GO:0006119. Due to this relationship, the terms likely provide much of the same information, so the inclusion of all three terms in the output is unnecessary. In order to resolve this redundancy, we can calculate the semantic similarity between pairs of GO terms, which “assesses the likeness in meaning of two concepts” (Pesquita, 2017). Basically, if two terms are highly related, we can use some other criteria (such as adjusted p-value or level in the DAG) to retain only one of the terms. Below, we use the GOSemSim package to calculate the semantic similarity between the terms.

## Calculate semantic similarity between GO terms
library(GOSemSim)
library(org.Hs.eg.db)

# GO DATA for measuring semantic similarity.
# keytype is "ENTREZID" by default and 
# information content is calculated (computeIC = TRUE)
semData <- godata(OrgDb = "org.Hs.eg.db", ont = "BP")
terms <- c("GO:0006119", "GO:0009060", "GO:0046034")
# measure = "Rel" is the default for clusterProfiler::simplify
# See code for clusterProfiler:::simplify_internal
sim <- mgoSim(GO1 = terms, GO2 = terms, semData = semData, 
              measure = "Rel", combine = NULL)

If measure is "Lin", "Jiang", or "Wang", the semantic similarity of a term with itself will be 1. This is not true for the other methods.

We can see from Table ?? that GO:0009060 and GO:0046034 have low semantic similarity, while GO:0006119 is highly similar to its parent terms. This makes sense because the parent terms are not related/connected in the DAG.

Now that we have the semantic similarities, we can remove redundant terms. clusterProfiler has a function called simplify that will calculate semantic similarity and remove terms. By default, if there are two terms with a Wang semantic similarity greater than 0.7, simplify retains the term with the lowest adjusted p-value. See this post by Guangchuang Yu for more details on clusterProfiler::simplify.

8.1.1.2 GO Subsets/Slims

Another way to handle the redundancy of GO terms is to use a GO slim, which is a subset of more general or research-relevant terms from the GO. GO slims can be downloaded or the biomaRt package can be used to access GO slim accessions.

## Create human GO slim
library(biomaRt)
library(clusterProfiler) # gcSample data
library(dplyr)

mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
                dataset = "hsapiens_gene_ensembl")
# Uncomment to determine which attributes to select in getBM()
# View(listAttributes(mart)) 

# The GO slim columns are goslim_goa_accession and goslim_goa_description.
# We will map from the Entrez IDs in gcSample to these attributes.
data(gcSample)
universe <- unique(unlist(gcSample))
GO_slim <- getBM(filters = "entrezgene_id",
                 attributes = c("entrezgene_id",
                                "goslim_goa_accession",
                                "goslim_goa_description"),
                 values = universe, # Subset to these Entrez IDs
                 mart = mart) %>% 
  # Convert entrezgene_id from integer to character
  mutate_all(as.character)

Unfortunately, not every GO accession maps to a domain when we use biomaRt (unsure why this is the case), so we won’t be able to separate the terms. However, there are two ways that we can still use these GO slim accessions. Either follow the steps for using fgsea::fora (Section 8.2.4) with a gene set list that has been subset to the GO slim accessions, or remove any non GO slim accessions from the final results and readjust the remaining p-values.

8.1.2 Reactome

Home - Reactome Pathway Database

Reactome is a manually-curated database for biological processes and pathways. As of version 80 (April 2021), it contains data on 15 species (notably, H. sapiens, M. musculus, and R. norvegicus). H. sapiens is the most highly-annotated organism with 2580 pathways.

8.1.3 KEGG

The Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG: Kyoto Encyclopedia of Genes and Genomes

8.1.4 MSigDB

The Molecular Signatures Database (MSigDB) is a comprehensive resource of manually-curated gene sets divided into nine collections, as of v7.5.1 (Liberzon et al., 2015). Importantly, it contains non-redundant versions of the most up-to-date Gene Ontology, Reactome, and KEGG databases. The method for eliminating term redundancy is described in the v7.0 Release Notes:

We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into “chunks” using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every “chunk”. First, we kept the largest set in the “chunk” and discarded the smaller sets. This left “chunks” of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).

Note: the Jaccard coefficient is used to measure the similarity of two sets, \(A\) and \(B\), and is calculated as \(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\).

References

Goeman, J. J. and Mansmann, U., Multiple Testing on the Directed Acyclic Graph of Gene Ontology, Bioinformatics, vol. 24, no. 4, pp. 537–44, accessed October 11, 2021, from https://doi.org/10.1093/bioinformatics/btm628, February 2008. DOI: 10.1093/bioinformatics/btm628

Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P. and Tamayo, P., The Molecular Signatures Database Hallmark Gene Set Collection, Cell Systems, vol. 1, no. 6, pp. 417–25, accessed May 6, 2022, from https://linkinghub.elsevier.com/retrieve/pii/S2405471215002185, December 2015. DOI: 10.1016/j.cels.2015.12.004

Pesquita, C., Semantic Similarity in the Gene Ontology, in The Gene Ontology Handbook, C. Dessimoz and N. Škunca, Eds., New York, NY: Springer, accessed October 7, 2021, from https://doi.org/10.1007/978-1-4939-3743-1_12, pp. 161–73, 2017.

Relations in the Gene Ontology, Gene Ontology Resource, October 2021.