vignettes/data-simulation.Rmd
data-simulation.Rmd
In this tutorial, we will generate a simulated dataset that jointly profiles scRNA-seq and scATAC-seq, drawing upon a Seurat object composed of both scRNA-seq and scATAC-seq assays, DoRothEA — a gene regulatory network (GRN) enriched with signed transcription factor (TF), and JASPAR — a collection of curated, non-redundant TF binding profiles to target gene interactions. Using this simulated dataset, we will subsequently infer enhancer regulons (eRegulons) and derive enhancer-driven GRNs (eGRNs).
Initially, we selected a subset of TFs at random from the JASPAR 2022
database, along with their curated binding sites. Utilizing these TF
binding sites, we identified peaks associated with these TFs. From the
regulon genes in DoRothEA
, we randomly chose certain genes
as target genes for these TFs. For each TF, we associated every target
gene with a set of TF binding peaks that are located within a
predetermined distance from the gene’s transcription start site (TSS).
We consider these eRegulons peaks as enhancers.
Upon determining the TFs, genes, and enhancers for the eRegulon, we randomly sampled cell subsets where the eRegulons are active, denoted as eRegulon-active cells. For every eRegulon gene, its expression value in eRegulon-active cells was set to its highest expression value across all cells, multiplied by a factor, e.g., one hundred. In a similar fashion, we increased the chromatin accessibility value for each enhancer within eRegulon-active cells.
Subsequently, we selected a distinct set of genes, peaks, and cells – ones that were not considered as eRegulon genes, peaks, or eRegulon-active cells – to serve as background genes, peaks, and cells. In the end, we crafted scRNA-seq and scATAC-seq matrices by tailoring the original matrices. These new matrices incorporate both eRegulon and background genes/peaks as rows, and a combination of eRegulon-active and background cells as columns.
Execute the commands below to simulate a dataset. The scRNA-seq matrix has dimensions 1000 x 1000, while the scATAC-seq matrix is sized at 3000 x 1000. The dataset encompasses five eRegulons. Each eRegulon contains around 100 target genes, and each gene is modulated by a transcription factor via roughly two enhancers.
Run the following commands to load the STREAM
library:
Load a Seurat object that contains jointly profiled scRNA-seq and scATAC-seq assays derived from human PBMCs, and generate simulated data based on this object. Construct a simulated Seurat object comprising 1,000 genes, 3,000 peaks, and 1,000 cells. This object should feature five eRegulons regulated by five different TFs, with each eRegulon linked to approximately fifty target genes. Each target gene should be associated with roughly two enhancers. These eRegulons should be active in 100 cells.
In the function create_rna_atac
, variable annotations
are given as follows:
obj
: Seurat
object composed of both RNA
and ATAC assays
ntfs
: Number of eRegulons/TFs to be
implanted
ngenes
: Average number of genes in eRegulons to be
implanted
ncells
: Average number of cells in eRegulons to be
implanted
all.genes
: Number of genes in dataset to be
simulated
all.enhs
: Number of peaks in dataset to be
simulated
all.cells
: Number of cells in dataset to be
simulated
org
: Organism assembly version
atac.assay
: Name of the ATAC assay
gene.links
: Average number of enhancers linked to
each gene
## An object of class Seurat
## 6193 features across 1000 samples within 3 assays
## Active assay: peaks (3991 features, 3991 variable features)
## 2 other assays present: RNA, SCT
## 3 dimensional reductions calculated: pca, lsi, umap
simul <- create_rna_atac(obj = pbmc, ntfs = 5, ngenes = 100,
ncells = 100, all.genes = 1000, all.enhs = 3000, all.cells = 1000,
org = "hg38", atac.assay = "peaks", gene.links = 2
)
##
|
| | 0%
|
|========== | 20%
|
|==================== | 40%
|
|============================== | 60%
|
|======================================== | 80%
|
|==================================================| 100%
simul$Seurat
## An object of class Seurat
## 2811 features across 1000 samples within 2 assays
## Active assay: RNA (1000 features, 0 variable features)
## 1 other assay present: ATAC
head(simul$HBCs[[1]]$genes)
## [1] "ITGB1" "ANK3" "PLEKHG1" "DUSP6" "ZDHHC14" "ARMC4"
head(simul$HBCs[[1]]$enhancers)
## [1] "chr10-33134015-33134576" "chr10-33264223-33264655"
## [3] "chr10-62050806-62051384" "chr10-62120709-62121465"
## [5] "chr6-150622501-150623717" "chr12-89768730-89769666"
head(simul$HBCs[[1]]$links)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene
## <Rle> <IRanges> <Rle> | <character>
## [1] chr10 33134015-33134576 * | ITGB1
## [2] chr10 33264223-33264655 * | ITGB1
## [3] chr10 62050806-62051384 * | ANK3
## [4] chr10 62120709-62121465 * | ANK3
## [5] chr6 150622501-150623717 * | PLEKHG1
## [6] chr12 89768730-89769666 * | DUSP6
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
sapply(simul$HBCs, "[[", "TF")
## [1] "BATF3" "KLF15" "BATF" "CEBPA" "TCF12"