vignettes/eRegulon-inference.Rmd
eRegulon-inference.Rmd
In this tutorial, we will guide you through the process of inferring enhancer regulons, known as eRegulons, and enhancer-driven gene regulatory networks or eGRNs, utilizing the STREAM tool. Our demonstration leverages a select subset of data from the publicly accessible 10x Genomics Multiome dataset, specifically focusing on human Peripheral Blood Mononuclear Cells (PBMCs). For those interested in the comprehensive dataset, the raw data captured by the 10X Genomics Multiome ATAC+GEX can be downloaded directly from 10x Genomics. As for our tutorial, the subsetted data we employ, stored in a Seurat object format, is available for download on Figshare.
Initially, we ran STREAM on the PBMC data to infer eRegulons, each of which includes its regulating TF, its included genes and enhancers (and their relations), as well as the eRegulon-active cells. We evaluated these eRegulons from three perspectives, i.e., pathway enrichment of eRegulon genes, overlap of eRegulon enhancers against chromatin regions curated in databases (e.g., histone mark ChIP-seq peaks), and intersection of enhancer-gene relations against relations curated in databases (e.g., EnhancerAtlas 2.0 and scEnhancer). Then, we performed hyper-geometric test for eRegulon-active cells against each cell type. If the eRegulon-active cells are significantly enriched in a cell type (e.g., CD4 TCM), we denoted the corresponding eRegulon as cell-type-active eRegulon (CD4 TCM-active eRegulon). For each cell type, we merged all cell-type-active eRegulons, leading to an eGRN. Finally, by splitting the eGRN into sub-networks according to the regulating TFs, we obtained cell-type-specific eRegulons.
dyn.load(x = "/users/PAS1475/liyang/libs/hdf5_1.10.6/lib/libhdf5_hl.so.100")
suppressWarnings(invisible(library(stream2) ) )
Load a Seurat
object containing jointly profiled
scRNA-seq and scATAC-seq assays from human PBMCs.
## An object of class Seurat
## 6193 features across 1000 samples within 3 assays
## Active assay: peaks (3991 features, 3991 variable features)
## 2 other assays present: RNA, SCT
## 3 dimensional reductions calculated: pca, lsi, umap
Run STREAM on the Seurat object with default parameters.
In the function run_stream
, variable annotations are
presented as follows:
obj
: Seurat
object composed of both RNA
and ATAC assays
qubic.path
: File path of the QUBIC2
biclustering program
peak.assay
: Name of the ATAC assay
var.genes
: Number of highly variable genes for LTMG
modeling
top.peaks
: Number of top-ranked peaks to build
heterogeneous graph
min.cells
: Minimum number of cells for quality
control
org
: Organism assembly version
c.cutoff
: Cutoff of expression consistency among
genes in the core part of hybrid biclusters
distance
: Size of window to connect co-accessible
peaks
BlockOverlap
: Threshold of overlap between prior
identified hybrid biclusters and the current seed
Extension
: Cutoff of expression consistency among
genes in the dual part of hybrid biclusters
When it comes to jointly profiled RNA+ATAC data via different techniques, we suggest different strategies. For datasets with low scRNA-seq depth (e.g., SHARE-seq), we recommend imputation using mainstream approaches like MAGIC, SAVER, and scImpute. For datasets with low coverage in both modalities (e.g., SNARE-seq and PAIRED-seq), we suggest using SEACells to build metacells by pooling similar cells. For PAIRED-seq, which has even lower coverage than SNARE-seq, more stringent SEACells parameters should be used.
# This process can be resource-intensive. We advise against executing it on a personal computer.
time1 <- system.time( en.regs <- run_stream(obj = pbmc,
qubic.path = "/users/PAS1475/liyang/software/QUBIC2/qubic",
peak.assay = "peaks",
var.genes = 3000,
top.peaks = 3000,
min.cells = 10,
org = "hg38",
c.cutoff = 1.0,
distance = 5e+05,
BlockOverlap = 0.50,
Extension = 1.0
) )
## number of iterations= 236
## Progress:0%
## Progress:10%
## Progress:20%
## Progress:30%
## Progress:40%
## Progress:50%
## Progress:60%
## Progress:70%
## Progress:80%
## Progress:90%
## Progress:100%
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================ | 41%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
## [1] "Starting Cicero"
## [1] "Calculating distance_parameter value"
## [1] "Running models"
## [1] "Assembling connections"
## [1] "Successful cicero models: 1469"
## [1] "Other models: "
##
## Zero or one element in range
## 12116
## [1] "Models with errors: 0"
## [1] "Done"
##
|
| | 0%
|
|= | 2%
|
|== | 3%
|
|== | 5%
|
|=== | 7%
|
|==== | 8%
|
|===== | 10%
|
|====== | 12%
|
|======= | 13%
|
|======== | 15%
|
|======== | 17%
|
|========= | 18%
|
|========== | 20%
|
|=========== | 22%
|
|============ | 23%
|
|============ | 25%
|
|============= | 27%
|
|============== | 28%
|
|=============== | 30%
|
|================ | 32%
|
|================= | 33%
|
|================== | 35%
|
|================== | 37%
|
|=================== | 38%
|
|==================== | 40%
|
|===================== | 42%
|
|====================== | 43%
|
|====================== | 45%
|
|======================= | 47%
|
|======================== | 48%
|
|========================= | 50%
|
|========================== | 52%
|
|=========================== | 53%
|
|============================ | 55%
|
|============================ | 57%
|
|============================= | 58%
|
|============================== | 60%
|
|=============================== | 62%
|
|================================ | 63%
|
|================================ | 65%
|
|================================= | 67%
|
|================================== | 68%
|
|=================================== | 70%
|
|==================================== | 72%
|
|===================================== | 73%
|
|====================================== | 75%
|
|====================================== | 77%
|
|======================================= | 78%
|
|======================================== | 80%
|
|========================================= | 82%
|
|========================================== | 83%
|
|========================================== | 85%
|
|=========================================== | 87%
|
|============================================ | 88%
|
|============================================= | 90%
|
|============================================== | 92%
|
|=============================================== | 93%
|
|================================================ | 95%
|
|================================================ | 97%
|
|================================================= | 98%
|
|==================================================| 100%
## [1] "terminal" "Tier" "TF" "genes" "peaks"
## [6] "cells" "atac.ratio" "score" "weight" "links"
## [11] "seed"
head(en.regs[[1]]$genes)
## [1] "MAN1C1" "STMN1" "EIF2B3" "ZSWIM5" "JUN" "EVI5"
head(en.regs[[1]]$peaks)
## [1] "chr1-25904959-25907184" "chr1-45302959-45304454"
## [3] "chr1-58783212-58785307" "chr1-93179195-93181524"
## [5] "chr1-108660816-108661818" "chr1-150578547-150580018"
head(en.regs[[1]]$links)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 25904959-25907184 * | MAN1C1
## [2] chr1 25904959-25907184 * | STMN1
## [3] chr1 45302959-45304454 * | EIF2B3
## [4] chr1 45302959-45304454 * | ZSWIM5
## [5] chr1 58783212-58785307 * | JUN
## [6] chr1 93179195-93181524 * | EVI5
## -------
## seqinfo: 19 sequences from an unspecified genome; no seqlengths
message ("The running time of eRegulon prediction is: ", time1["elapsed"])
## The running time of eRegulon prediction is: 564.544
Construct eGRN in each cell type.
In the function get_cts_en_GRNs
, variabled are defined
below:
obj
: Seurat
object composed of both RNA
and ATAC assays
celltype
: Name of the metadata column indicating
cell types
en.regs
: List of predicted eRegulons
peak.assay
: Name of the ATAC assay
rna.dims
: Number of RNA dimensions for multimodal
clustering using Seurat
atac.dims
: Number of ATAC dimensions for multimodal
clustering using Seurat
padj.cutoff
: Threshold of adjusted P-values when
mapping eRegulons onto cell types
out.dir
: Directory to save the final result and
intermediate results
time2 <- system.time( en.grns <- get_cts_en_GRNs(obj = pbmc, celltype = "predicted.id",
en.regs = en.regs, peak.assay = "peaks",
rna.dims = 50, atac.dims = 50,
padj.cutoff = 0.05,
out.dir = "./") )
qs::qsave(en.grns, paste0(work.dir, "eGRNs.qsave"))
names(en.grns[[1]])
## [1] "links" "cell.type" "cells"
head(en.grns[[1]]$links)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | gene TF
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 25904959-25907184 * | MAN1C1 TCF12
## [2] chr1 25904959-25907184 * | STMN1 TCF12
## [3] chr1 45302959-45304454 * | EIF2B3 TCF12
## [4] chr1 45302959-45304454 * | ZSWIM5 TCF12
## [5] chr1 58783212-58785307 * | JUN TCF12
## [6] chr1 93179195-93181524 * | EVI5 TCF12
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
message ("The running time of eGRN construction is: ", time2["elapsed"])
## The running time of eGRN construction is: 1.28699999999992
Construct cell-type-specific eRegulons.
In the function get_cts_en_regs
, variables are
introduced as follows:
obj
: Seurat
object composed of both RNA
and ATAC assays
peak.assay
: Name of the ATAC assay
de.genes
: List of differentially expressed genes
(DEGs)
cts.en.grns
: Cell-type-specific eGRNs saved in
GRanges object, the name of each of which is cell type, and the GRanges
object contains metadata columns “gene” and “TF”.
out.dir
: Directory to save the final result and
intermediate results
celltype
: Name of the metadata column indicating
cell types
min.pct
: Cutoff of overlap percentage in calculating
DEGs
logfc.threshold
: Threshold of log-fold change in
predicting DEGs
padj.cutoff
: Cutoff of adjusted P-values in
identifying DEGs
time3 <- system.time( cts.en.regs <- get_cts_en_regs(obj = pbmc, peak.assay = "peaks", de.genes = NULL,
cts.en.grns = en.grns, out.dir = "./", celltype = "predicted.id",
min.pct = 0.25, logfc.threshold = 0.25, padj.cutoff = 0.05) )
qs::qsave(cts.en.regs, paste0(work.dir, "cell-type-specific-eRegulons.qsave"))
names(cts.en.regs[[1]])
## [1] "TF" "celltype" "cells" "genes" "enhancers" "links"
head(cts.en.regs[[1]]$genes)
## [1] "ADAMTSL4-AS1" "TSPAN5" "ADAM19" "ATXN1" "EEF1A1"
## [6] "TNFAIP3"
head(cts.en.regs[[1]]$enhancers)
## [1] "chr1-150320808-150322695" "chr4-98928092-98930218"
## [3] "chr5-157252607-157253720" "chr6-16759215-16763347"
## [5] "chr6-73653070-73655111" "chr6-137707572-137708859"
head(cts.en.regs[[1]]$links)
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | gene TF
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 150320808-150322695 * | ADAMTSL4-AS1 CEBPD
## [2] chr4 98928092-98930218 * | TSPAN5 CEBPD
## [3] chr5 157252607-157253720 * | ADAM19 CEBPD
## [4] chr6 16759215-16763347 * | ATXN1 CEBPD
## [5] chr6 73653070-73655111 * | EEF1A1 CEBPD
## [6] chr6 137707572-137708859 * | TNFAIP3 CEBPD
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
message ("The running time of cell-type-specific eRegulon discovery is: ", time3["elapsed"])
## The running time of cell-type-specific eRegulon discovery is: 0.673000000000002
Firstly, we performed pathway enrichment analysis for gene lists in cell-type-specific eRegulons against Gene Ontology (GO) or KEGG.
In the function enrich_genes
, variables are defined as
follows:
regs
: The list of enhancer regulons (eRegulons) or
cell-type-specific eRegulons.
dbs
: The list of databases to run enrichment
analysis, c(“GO”, “KEGG”) by default.
library(pbapply)
time4 <- system.time( pathways <- enrich_genes(regs = cts.en.regs, dbs = "GO") )
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Cellular_Component_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Cellular_Component_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Cellular_Component_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Cellular_Component_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Cellular_Component_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
sig.pathways <- pbapply::pblapply(pathways, function(x) {
x[x$Adjusted.P.value < 0.05,, drop = FALSE]
})
qs::qsave(sig.pathways, paste0(work.dir, "enriched_pathways.qsave"))
head(sig.pathways[[1]])
## Id Term Overlap P.value
## 1 1 poly(G) binding (GO:0034046) 1/7 0.003495224
## 2 1 phospholipase inhibitor activity (GO:0004859) 1/8 0.003993646
## 3 1 Tat protein binding (GO:0030957) 1/10 0.004989818
## 4 1 Lys63-specific deubiquitinase activity (GO:0061578) 1/11 0.005487568
## 5 1 nitric-oxide synthase binding (GO:0050998) 1/11 0.005487568
## 6 1 translation elongation factor activity (GO:0003746) 1/13 0.006482395
## Adjusted.P.value Old.P.value Old.Adjusted.P.value Odds.Ratio Combined.Score
## 1 0.04006892 0 0 370.0741 2093.2713
## 2 0.04006892 0 0 317.1905 1751.8591
## 3 0.04006892 0 0 246.6790 1307.4865
## 4 0.04006892 0 0 222.0000 1155.5700
## 5 0.04006892 0 0 222.0000 1155.5700
## 6 0.04006892 0 0 184.9815 932.0597
## Genes
## 1 ATXN1
## 2 ANXA1
## 3 ACTB
## 4 TNFAIP3
## 5 ACTB
## 6 EEF1A1
message ("The running time of pathway enrichment analysis for eRegulons is: ", time4["elapsed"])
## The running time of pathway enrichment analysis for eRegulons is: 2.81799999999998
Then, we intersected the enhancer list of a cell-type-specific eRegulon against H3K27ac ChIP-seq peaks in CD4+ T cells with ENCODE accession code: ENCFF498QFU.
In the function intersect_peaks
, variables are described
as follows:
x
: The first GRanges object or data.frame.
y
: The first GRanges object or data.frame.
library(dplyr)
library(Signac)
chipseq.df <- read.table(paste0(work.dir, "ENCFF498QFU.bed"), sep = "\t") %>% dplyr::select(V1, V2, V3)
chipseq.peaks <- StringToGRanges(paste0(chipseq.df[, 1], "-", chipseq.df[, 2], "-", chipseq.df[, 3]))
time5 <- system.time( permTest.res <- intersect_peaks(x = Signac::StringToGRanges(cts.en.regs[[1]]$enhancers),
y = chipseq.peaks) )
permTest.res$numOverlaps
## P-value: 0.0099009900990099
## Z-score: 14.9248
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 11
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
message ("The running time of evaluating TF-enhancer relations in eRegulons is: ", time5['elapsed'])
## The running time of evaluating TF-enhancer relations in eRegulons is: 7.50099999999998
Finally, we overlapped the enhancer-gene relations in a cell-type-specific eRegulon against enhancer-target pairs in CD4 TCM cells curated by scEnhancer.
In the function intersect_enhancer_gene_relations
,
variables are introduced as follows:
x
: The first GRanges object saving enhancer-gene
relations with gene symbols saved in “gene” meta column
y
: The second GRanges object saving enhancer-gene
relations with gene symbols saved in “gene” meta column
# Preprocess the peak-gene interaction pairs in scEnhancer database
enhGene.df <- read.table(paste0(work.dir, "Memory_CD4_T_interaction.txt"), sep = "\t") %>% dplyr::select(V1, V2)
enhGene.pairs <- Signac::StringToGRanges(strsplit(enhGene.df$V1, split = "\\|") %>% sapply(., "[[", 1), sep = c(":", "-"))
mcols(enhGene.pairs)$gene <- strsplit(enhGene.df$V2, split = "[\\||\\:]") %>% sapply(., "[[", 3)
head(enhGene.pairs)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene
## <Rle> <IRanges> <Rle> | <character>
## [1] chr10 100227139-100227639 * | HPS1
## [2] chr10 100227139-100227639 * | HPS1
## [3] chr10 100227139-100227639 * | HPS1
## [4] chr10 100227139-100227639 * | HPS1
## [5] chr10 101629999-101630499 * | DNMBP
## [6] chr10 101629999-101630499 * | DNMBP
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
# Calculate the overlaps between enhancer-gene relations in a cell-type-specific eRegulon
time6 <- system.time(overlaps <- intersect_enhancer_gene_relations(x = resize(cts.en.regs[[1]]$links, width = 50000, fix = "center"),
y = resize(enhGene.pairs, width = 50000, fix = "center") ) )
overlaps
## x.peak y.peak gene
## 1: chr6-16736281-16786280 chr6-16686955-16736954 ATXN1
## 2: chr6-16736281-16786280 chr6-16744524-16794523 ATXN1
## 3: chr6-16736281-16786280 chr6-16746364-16796363 ATXN1
## 4: chr7-5372117-5422116 chr7-5412511-5462510 ACTB
message ("The running time of assessing enhancer-gene relations in eRegulons is: ", time6["elapsed"], "\n\n",
"The total time for the implementation and evaluation of eRegulon/eGRN prediction is: ",
time1["elapsed"] + time2["elapsed"] + time3["elapsed"] + time4["elapsed"] + time5["elapsed"] + time6["elapsed"])
## The running time of assessing enhancer-gene relations in eRegulons is: 0.0580000000001064
##
## The total time for the implementation and evaluation of eRegulon/eGRN prediction is: 576.881