8.1 Annotation Databases
In this section, we will explore a few of the common annotation databases used for pathway analysis.
8.1.1 Gene Ontology
The Gene Ontology (GO) database is divided into three separate domains: Biological Process, Cellular Component, and Molecular Function (see the Gene Ontology overview for more details regarding each domain). Each domain is structured as a directed acyclic graph (DAG) where nodes are terms and edges are the relations between the terms (part of, is a, has part, regulates). Nodes can be connected to multiple child and parent nodes, where the group of genes annotated to a child node is a subset of those that are annotated to its parent node(s) (2021; Goeman et al., 2008).
8.1.1.1 Semantic Similarity
Due to the DAG structure of each domain, there is often redundancy in pathway analysis results. For example, suppose terms GO:0006119, GO:0009060, and GO:0046034 are significantly over-represented biological processes. GO:0009060 and GO:0046034 are the parent terms of GO:0006119. Due to this relationship, the terms likely provide much of the same information, so the inclusion of all three terms in the output is unnecessary. In order to resolve this redundancy, we can calculate the semantic similarity between pairs of GO terms, which “assesses the likeness in meaning of two concepts” (Pesquita, 2017). Basically, if two terms are highly related, we can use some other criteria (such as adjusted p-value or level in the DAG) to retain only one of the terms. Below, we use the GOSemSim
package to calculate the semantic similarity between the terms.
## Calculate semantic similarity between GO terms
library(GOSemSim)
library(org.Hs.eg.db)
# GO DATA for measuring semantic similarity.
# keytype is "ENTREZID" by default and
# information content is calculated (computeIC = TRUE)
<- godata(OrgDb = "org.Hs.eg.db", ont = "BP")
semData <- c("GO:0006119", "GO:0009060", "GO:0046034")
terms # measure = "Rel" is the default for clusterProfiler::simplify
# See code for clusterProfiler:::simplify_internal
<- mgoSim(GO1 = terms, GO2 = terms, semData = semData,
sim measure = "Rel", combine = NULL)
If measure
is "Lin"
, "Jiang"
, or "Wang"
, the semantic similarity of a term with itself will be 1. This is not true for the other methods.
We can see from Table ?? that GO:0009060 and GO:0046034 have low semantic similarity, while GO:0006119 is highly similar to its parent terms. This makes sense because the parent terms are not related/connected in the DAG.
Now that we have the semantic similarities, we can remove redundant terms. clusterProfiler
has a function called simplify
that will calculate semantic similarity and remove terms. By default, if there are two terms with a Wang semantic similarity greater than 0.7, simplify
retains the term with the lowest adjusted p-value. See this post by Guangchuang Yu for more details on clusterProfiler::simplify
.
8.1.1.2 GO Subsets/Slims
Another way to handle the redundancy of GO terms is to use a GO slim, which is a subset of more general or research-relevant terms from the GO. GO slims can be downloaded or the biomaRt
package can be used to access GO slim accessions.
## Create human GO slim
library(biomaRt)
library(clusterProfiler) # gcSample data
library(dplyr)
<- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
mart dataset = "hsapiens_gene_ensembl")
# Uncomment to determine which attributes to select in getBM()
# View(listAttributes(mart))
# The GO slim columns are goslim_goa_accession and goslim_goa_description.
# We will map from the Entrez IDs in gcSample to these attributes.
data(gcSample)
<- unique(unlist(gcSample))
universe <- getBM(filters = "entrezgene_id",
GO_slim attributes = c("entrezgene_id",
"goslim_goa_accession",
"goslim_goa_description"),
values = universe, # Subset to these Entrez IDs
mart = mart) %>%
# Convert entrezgene_id from integer to character
mutate_all(as.character)
Unfortunately, not every GO accession maps to a domain when we use biomaRt
(unsure why this is the case), so we won’t be able to separate the terms. However, there are two ways that we can still use these GO slim accessions. Either follow the steps for using fgsea::fora
(Section 8.2.4) with a gene set list that has been subset to the GO slim accessions, or remove any non GO slim accessions from the final results and readjust the remaining p-values.
8.1.2 Reactome
Home - Reactome Pathway Database
Reactome is a manually-curated database for biological processes and pathways. As of version 80 (April 2021), it contains data on 15 species (notably, H. sapiens, M. musculus, and R. norvegicus). H. sapiens is the most highly-annotated organism with 2580 pathways.
8.1.4 MSigDB
The Molecular Signatures Database (MSigDB) is a comprehensive resource of manually-curated gene sets divided into nine collections, as of v7.5.1 (Liberzon et al., 2015). Importantly, it contains non-redundant versions of the most up-to-date Gene Ontology, Reactome, and KEGG databases. The method for eliminating term redundancy is described in the v7.0 Release Notes:
We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into “chunks” using the
hclust
function from the Rstats
package according to their GO terms and applied two rounds of filtering for every “chunk”. First, we kept the largest set in the “chunk” and discarded the smaller sets. This left “chunks” of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).
Note: the Jaccard coefficient is used to measure the similarity of two sets, \(A\) and \(B\), and is calculated as \(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\).