From Simulation to Insight: Inferring eRegulons and eGRNs using STREAM

In this tutorial, we will generate a simulated dataset that jointly profiles scRNA-seq and scATAC-seq, drawing upon a Seurat object composed of both scRNA-seq and scATAC-seq assays, DoRothEA — a gene regulatory network (GRN) enriched with signed transcription factor (TF), and JASPAR — a collection of curated, non-redundant TF binding profiles to target gene interactions. Using this simulated dataset, we will subsequently infer enhancer regulons (eRegulons) and derive enhancer-driven GRNs (eGRNs).

Initially, we selected a subset of TFs at random from the JASPAR 2022 database, along with their curated binding sites. Utilizing these TF binding sites, we identified peaks associated with these TFs. From the regulon genes in DoRothEA, we randomly chose certain genes as target genes for these TFs. For each TF, we associated every target gene with a set of TF binding peaks that are located within a predetermined distance from the gene’s transcription start site (TSS). We consider these eRegulons peaks as enhancers.

Upon determining the TFs, genes, and enhancers for the eRegulon, we randomly sampled cell subsets where the eRegulons are active, denoted as eRegulon-active cells. For every eRegulon gene, its expression value in eRegulon-active cells was set to its highest expression value across all cells, multiplied by a factor, e.g., one hundred. In a similar fashion, we increased the chromatin accessibility value for each enhancer within eRegulon-active cells.

Subsequently, we selected a distinct set of genes, peaks, and cells – ones that were not considered as eRegulon genes, peaks, or eRegulon-active cells – to serve as background genes, peaks, and cells. In the end, we crafted scRNA-seq and scATAC-seq matrices by tailoring the original matrices. These new matrices incorporate both eRegulon and background genes/peaks as rows, and a combination of eRegulon-active and background cells as columns.

Execute the commands below to simulate a dataset. The scRNA-seq matrix has dimensions 1000 x 1000, while the scATAC-seq matrix is sized at 3000 x 1000. The dataset encompasses five eRegulons. Each eRegulon contains around 100 target genes, and each gene is modulated by a transcription factor via roughly two enhancers.

Run the following commands to load the STREAM library:

dyn.load(x = "/users/PAS1475/liyang/libs/hdf5_1.10.6/lib/libhdf5_hl.so.100")
library(stream2)

Data simulation

Load a Seurat object that contains jointly profiled scRNA-seq and scATAC-seq assays derived from human PBMCs, and generate simulated data based on this object. Construct a simulated Seurat object comprising 1,000 genes, 3,000 peaks, and 1,000 cells. This object should feature five eRegulons regulated by five different TFs, with each eRegulon linked to approximately fifty target genes. Each target gene should be associated with roughly two enhancers. These eRegulons should be active in 100 cells.

In the function create_rna_atac, variable annotations are given as follows:

obj: Seurat object composed of both RNA and ATAC assays
ntfs: Number of eRegulons/TFs to be implanted
ngenes: Average number of genes in eRegulons to be implanted
ncells: Average number of cells in eRegulons to be implanted
all.genes: Number of genes in dataset to be simulated
all.enhs: Number of peaks in dataset to be simulated
all.cells: Number of cells in dataset to be simulated
org: Organism assembly version
atac.assay: Name of the ATAC assay
gene.links: Average number of enhancers linked to each gene

pbmc <- qs::qread(paste0(work.dir, "tutorial_pbmc.qsave"))
pbmc

## An object of class Seurat 
## 6193 features across 1000 samples within 3 assays 
## Active assay: peaks (3991 features, 3991 variable features)
##  2 other assays present: RNA, SCT
##  3 dimensional reductions calculated: pca, lsi, umap

simul <- create_rna_atac(obj = pbmc, ntfs = 5, ngenes = 100,
                         ncells = 100, all.genes = 1000, all.enhs = 3000, all.cells = 1000,
                         org = "hg38", atac.assay = "peaks", gene.links = 2
                         )

## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |==========                                        |  20%
  |                                                        
  |====================                              |  40%
  |                                                        
  |==============================                    |  60%
  |                                                        
  |========================================          |  80%
  |                                                        
  |==================================================| 100%

simul$Seurat

## An object of class Seurat 
## 2811 features across 1000 samples within 2 assays 
## Active assay: RNA (1000 features, 0 variable features)
##  1 other assay present: ATAC

head(simul$HBCs[[1]]$genes)

## [1] "ITGB1"   "ANK3"    "PLEKHG1" "DUSP6"   "ZDHHC14" "ARMC4"

head(simul$HBCs[[1]]$enhancers)

## [1] "chr10-33134015-33134576"  "chr10-33264223-33264655" 
## [3] "chr10-62050806-62051384"  "chr10-62120709-62121465" 
## [5] "chr6-150622501-150623717" "chr12-89768730-89769666"

head(simul$HBCs[[1]]$links)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames              ranges strand |        gene
##          <Rle>           <IRanges>  <Rle> | <character>
##   [1]    chr10   33134015-33134576      * |       ITGB1
##   [2]    chr10   33264223-33264655      * |       ITGB1
##   [3]    chr10   62050806-62051384      * |        ANK3
##   [4]    chr10   62120709-62121465      * |        ANK3
##   [5]     chr6 150622501-150623717      * |     PLEKHG1
##   [6]    chr12   89768730-89769666      * |       DUSP6
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

sapply(simul$HBCs, "[[", "TF")

## [1] "BATF3" "KLF15" "BATF"  "CEBPA" "TCF12"

Compiled: June 16, 2024

Data simulation