Author: Srikeerthana Kuchi

Contact: Srikeerthana.Kuchi@glasgow.ac.uk

Introduction

Gene Ontology

In simple terms, gene ontology (GO) refers to a set of biological terms (formal vocabulary) generated by the GO consortium, that are used to describe molecular function (MF), biological process (BP) and cellular component (CC) of genes and gene products. The terms are in the form of DAGs (Directed Acyclic Graphs), with edges representing one or more ‘parent’ and ‘child’ relationships. Each term has a unique ID, is well-defined, and has a source from which the term has been obtained.

Annotations

Genes are linked/associated with GO terms/ontologies supported by evidence from literature references, other databases or computational analyses. Downloadable ontologies, definitions, gene-product annotations and much more information can be obtained from the GO website. AmiGO can be used to search both the GO ontology, the GO annotations and details about gene products described in the GO knowledgebase.

The three ontologies

Ontology	Description	Examples	Number of terms (as of June 2024)
Molecular function	What a gene product does	TLR	1154075
Biological process	A biological objective to which the gene product contributes	Cell division	1177586
Cellular component	Where a gene product is found in the cell	Nucleus	1192877

Pathways

Biological pathways are a series of interactions among genes/proteins occurring within a cell/tissue of an organism, describing a biological mechanism and its dependencies, whereas a gene set is simply an unordered set of genes derived based on cell location/associated with a specific biological process/arbitrarily grouped. Pathway analysis provides information about gene interactions, the magnitude and direction of the dysregulated pathways specific to a biological dataset, in this case an RNA-seq experiment.

Common Pathway databases

KEGG pathway database

KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database is a resource consisting of manually curated pathway maps representing molecular interactions, reaction and relation networks derived from various sources. Each pathway has a unique ID providing information about the source and is seamlessly integrated with other databases of KEGG such as Module consisting of gene sets and Network consisting of disease and drug based information.

MSigDb database

The molecular signatures database MSigDb has many annotated genesets such as Hallmark gene sets, canonical pathways etc,. that represent well-defined biological processes, generated both manually and computationally by using various sources of information such as literature, expression signatures, information from other pathway databases, etc,.

Reactome database

Reactome is an open-source manually curated and peer-reviewed pathway database supporting many model organisms such as human, yeast, fruit fly etc,.

Enrichment analysis methods

ORA (over-representation analysis) vs GSEA (Gene-set enrichment analysis)

ORA uses hypergeometric tests (Fisher’s exact test) to determine whether members from any of the pre-determined gene sets (a KEGG or a Reactome pathway) are significantly enriched in an input list of genes specific to a biological experiment (e.g. a differentially expressed (DE) list of genes derived from a RNA-seq experiment). GSEA takes a ranked gene list as an input to compare gene expression profiles to assess whether interdependent genes are significantly dysregulated w.r.t to a phenotype e.g. a diseased transcriptome, by aggregating gene-level summary statistics across genes within a geneset.

Enrichment analysis in R

We will be using clusterProfiler R package to perform gene functional annotation using different sources. It provides a tidy interface to access, manipulate and visualize ORA and GSEA enrichment results for multiple datasets in a single run.

Other R packages such as GSEA, topGO, fgsea can perform ORA and GSEA analysis and might differ in underlying algorithms and statistical tests performed.

GO ORA analysis

We load the necessary packages into R console and use enrichGO() function of clusterProfiler package to perform GO enrichment analysis, which provides a list of FDR corrected statistically significant enriched GO categories in your dataset. The input for this analysis is a simple list of genes.

# Copy the edgeR output to the current working directory
cp /home4/VBG_data/RNASeq/DEG_edgeR.csv .
# Copy cpm data into the current working directory
cp /home4/VBG_data/RNASeq/cpm.csv .

# Load R
/software/R-4.4.1/bin/R

# Load the necessary libraries in R
library(clusterProfiler)
library(org.Hs.eg.db)
library(dplyr)
library(tibble)
library(ggplot2)
library(enrichplot)
library(DOSE)
library(msigdbr)
library(cowplot)
library(paletteer)
library(RColorBrewer)
library(ComplexHeatmap)
library(InteractiveComplexHeatmap)
library(plotly)
library(DESeq2)
library(reshape2)
library(tidyverse)
library(pathview) # BiocManager::install("pathview")

# Read the edgeR output into R
df<-read.table("DEG_edgeR.csv",sep=",",row.names = 1,header=T)
head(df)

# Count the number of DEG
dim(df)

# Extract the first column (gene names) 
genes<-rownames(df)

# Convert from ENSEMBL IDs to ENTREZ IDs
entrez_conv <- data.frame(mapIds(org.Hs.eg.db,
  keys = genes,
  keytype = "ENSEMBL",
  column = "ENTREZID",
  multiVals = "first"),
  stringsAsFactors = FALSE)

# Change rownames into first column and change names of the columns 
entrez_conv<-tibble::rownames_to_column(entrez_conv, "ensembl_id")
names(entrez_conv)[2]<-"entrez_id"

# Remove unmapped ensembl IDs from further analysis
entrez_conv<-entrez_conv %>% dplyr::filter(!is.na(entrez_id))

# Run GO ORA
ego <- enrichGO(gene = entrez_conv$entrez_id,
  OrgDb         = org.Hs.eg.db,
  ont           = "MF",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.01,
  qvalueCutoff  = 0.01,
  readable      = TRUE)

# Visualization
# Dot plot
p1<-dotplot(ego, showCategory=20) + ggtitle("dotplot for ORA")
pdf("ORA_dotplot.pdf",10,10)
p1
dev.off()

# Bar Plot
p2<-barplot(ego, showCategory=20) 
pdf("ORA_barplot.pdf",10,10)
p2
dev.off()

GO GSEA analysis

gseGO() function from the clusterProfiler package can be used to perform GO GSEA analysis. The input must be a named, ranked, ordered (sorted) unique gene list, e.g. a fold-change sorted gene list from a DE analysis of an RNA-seq experiment.

# Add logFC values from the 'df' dataframe to the 'entrez_conv' dataframe
entrez_conv$logFC<-df$logFC[match(entrez_conv$ensembl_id,rownames(df))]

# Create a named, ranked, ordered gene list
geneList1<-as.numeric(entrez_conv$logFC)
names(geneList1)<-as.character(entrez_conv$entrez_id)
geneList1 = sort(geneList1, decreasing = TRUE)

# Run GO GSEA
ego2 <- gseGO(geneList     = geneList1,
              OrgDb        = org.Hs.eg.db,
              ont          = "BP",
              minGSSize    = 10,
              maxGSSize    = 500,
              pvalueCutoff = 0.01,
              nPermSimple = 100000,
              verbose      = FALSE,
              eps = 0)

# Visualisation
dotplot(ego2, showCategory=20) + ggtitle("dotplot for ORA")

# Another way to rank genes
entrez_conv$PValue<-df$PValue[match(entrez_conv$ensembl_id,rownames(df))]
entrez_conv$FCP<- entrez_conv$PValue*sign(entrez_conv$logFC)

geneList2<-as.numeric(entrez_conv$FCP)
names(geneList2)<-as.character(entrez_conv$entrez_id)
geneList2 = sort(geneList2, decreasing = TRUE)

# Run GO GSEA
ego3 <- gseGO(geneList     = geneList2,
              OrgDb        = org.Hs.eg.db,
              ont          = "BP",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.01,
              nPermSimple = 100000,
              verbose      = FALSE)
              
# Visualization
dotplot(ego3, showCategory=20) + ggtitle("dotplot for ORA")

KEGG/Reactome/MSigDb/other GSEA

clusterProfiler package supports various pathway databases such as KEGG, Reactome, MSigDb and other user-defined gene sets to be used as background information, with various functions within the package providing different hypothesis testing options to perform functional annotation analyses. Please refer to https://yulab-smu.top/biomedical-knowledge-mining-book/index.html for a more detailed explanation of all the available options and other geneset information.

# Run KEGG GSEA
kk1 <- gseKEGG(geneList     = geneList1,
              organism    = 'hsa',
              minGSSize    = 30,
              pvalueCutoff = 0.01,
              eps=0)
    
# Visualise a pathway in a browser
#browseKEGG(kk1,'hsa05164')

# Visualise a pathway locally
# BiocManager::install("pathview")
pathview(gene.data = geneList1, 
         pathway.id = "hsa05164", 
         species = "hsa", 
         limit = list(gene=20, cpd=1))
              
# Run MSigDb GSEA
msig_h <- msigdbr(species = "Homo sapiens", category = "H") %>%
  dplyr::select(gs_name, entrez_gene) %>%
  dplyr::rename(ont = gs_name, gene = entrez_gene)

msig_h

msig_gsea1 <- GSEA(geneList1, 
  TERM2GENE = msig_h,
  eps=0,
  nPermSimple = 100000)
  
# ES plot
p1 <- gseaplot2(msig_gsea1, geneSetID = 1:5,pvalue_table = TRUE)
p1

Visualisation of enrichment results

To visualize the magnitude/direction of the genes involved in the significantly enriched terms of the functional annotation results, we will be using cnetplot() and heatplot() functions of clusterProfiler, which links genes with ontologies to be visualized as networks for easy differences between biologically different samples.

# Top 5 positive & negatively enriched molecular signatures 
msig_gsea2 <- arrange(msig_gsea1, abs(NES)) %>% 
        group_by(sign(NES)) %>% 
        slice(1:5)
        
p2<-ggplot(msig_gsea2, 
  aes(NES, fct_reorder(Description, NES), fill=qvalue,
  text=paste("Enrichment Score: " ,NES, "<br>", "Signature: " , fct_reorder(Description, NES), "<br>", "FDR: " , qvalue, "<br>", sep="")), 
  showCategory=10) + 
  geom_col(orientation='y') + 
  scale_fill_continuous(low='red', high='blue') +  
  theme_minimal() + ylab(NULL) 
    
    
p3<-ggplotly(p2,tooltip = c("text")) 

htmlwidgets::saveWidget(
  widget = ggplotly(p3,tooltip = c("text")), #the plotly object
  file = "interactive_ES_plot.html", #the path & file name
  selfcontained = FALSE #creates a html file
)

# cnet plot
options(ggrepel.max.overlaps = 22)
kk2 <- setReadable(kk1, 'org.Hs.eg.db', 'ENTREZID')
## categorySize can be scaled by 'pvalue' or 'geneNum'
p1 <- cnetplot(kk2, categorySize="pvalue", color.params = list(foldChange = geneList1))
p2 <- cnetplot(kk2, color.params = list(foldChange = geneList1, edge = T), circular = TRUE) 

pdf("cnetplots.pdf",width=40,30)
cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:2], rel_widths=c(.8, 1.2))
dev.off()

#heatplot
p3<-heatplot(kk2, foldChange=geneList1, showCategory=5)
p3

# Interactive visualisation of DE results

# Read the cpm values of all genes into R
df1<-read.table("cpm.csv",sep=",",row.names = 1,header=T)
dim(df1)

# Extract DE genes from df1
df2<-df1[match(rownames(df),rownames(df1)),]

# Heatmap visualisation
heatmap(as.matrix(df2))

# A more refined heatmap
# Create annotation for the heatmap
metadata=data.frame(colnames(df2),gsub("..$","",colnames(df2)))
names(metadata)<-c("samplename","sampletype")
rownames(metadata)<-metadata$samplename

# Specify colors for the annotation 
my_colors<-as.character(paletteer::paletteer_d("RColorBrewer::Paired",n=length(levels(factor(metadata$sampletype)))))
names(my_colors) <- levels(factor(metadata$sampletype))

metadata$sampletype<-factor(metadata$sampletype, levels = c("Mock", "IFNb"))

annoCol <- list(sampletype = my_colors)

col_ha1=HeatmapAnnotation(show_annotation_name =F, sampletype=metadata$sampletype, col=annoCol) 

p1<-ComplexHeatmap::pheatmap(as.matrix(df2),
  name= "CPM",
  top_annotation=col_ha1,
  scale = "row",
  color = colorRampPalette(c("darkgreen", "white", "brown"))(40),
  show_rownames = T,
  fontsize = 10,
  fontsize_row = 10,
  cluster_rows = TRUE,
  cluster_cols=FALSE, 
  show_row_dend = FALSE,
  show_column_dend = FALSE,
  column_split = metadata$sampletype,
  cluster_column_slices = TRUE,
  column_title_rot = 90,
  heatmap_legend_param = list(
  legend_direction = "horizontal",
  legend_width = unit(6, "cm")))

ht=draw(p1,heatmap_legend_side="bottom")

pdf("heatmap.pdf",width=5,height=500)
ht
dev.off()

# Interactive heatmap 
#htShiny(ht)

Practicals

Create a new directory “RNA-seq” in your home directory.
In the “RNA-seq” directory, generate a soft link to the following files:

	fastq files - /home4/VBG_data/SUBSAMPLED
	genome - /home4/VBG_data/RNASeq/Human.fa
	GTF - /home4/VBG_data/RNASeq/Homo_sapiens.GRCh38.107.gtf

Perform QC, adapter-trimming, read mapping and generate bam files Generate count data from bam files
Perform DE analysis using edgeR (check whether the order of the samples match the edgeR script, header is absent)
Perform GO (ORA and GSEA) and pathway analysis (MSigDb GSEA)

7th Annual training course on Viral Bioinformatics and Genomics (24 - 28 June 2024)

McCall computer cluster, Garscube campus, University of Glasgow, United Kingdom