Last updated: 2023-06-03
Checks: 7 0
Knit directory: SCREE/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210907)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version dc5b2b2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Untracked files:
Untracked: data/workflow.png
Untracked: img/
Untracked: output/workflow.png
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/SCREE.Rmd
) and HTML (docs/SCREE.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 2c16336 | HailinWei98 | 2021-09-23 | Build site. |
Rmd | a2a5b0a | HailinWei98 | 2021-09-23 | Publish the initial files for myproject |
html | eeeebf3 | HailinWei98 | 2021-09-23 | Build site. |
Rmd | 94450d1 | HailinWei98 | 2021-09-23 | Publish the initial files for myproject |
html | 94450d1 | HailinWei98 | 2021-09-23 | Publish the initial files for myproject |
html | 518a5ca | HailinWei98 | 2021-09-22 | Build site. |
Rmd | ab04ccf | HailinWei98 | 2021-09-22 | Publish the initial files for myproject |
html | 855bd74 | HailinWei98 | 2021-09-22 | Build site. |
Rmd | e582e5c | HailinWei98 | 2021-09-22 | Publish the initial files for myproject |
In this page, we provide the code for testing GSFA(v0.2.8) and SCEPTRE(V0.1.0). To estimate the time usage, we record the time before and after running the major model. To record the memory usage, we create a python script as follows and include the R code for running into the robjects.r.
from memory_profiler import profile
import rpy2.robjects as robjects
import os
@profile
def run():
robjects.r('''{R code}''')
if __name__ == '__main__':
run()
For GSFA testing, we find that its iteration step stop after finishing 800 iteration when we using a dataset of ~30K cells, 1.2K genes and 249 perturbed genes as input.
library(Seurat)
library(GSFA)
#data preparation
mtx <- readRDS("perturb-CITE/control/RNA_QC.rds")
sg_lib <- read.table("perturb-CITE/control/sg_lib_all.txt", header = T)
sg_lib <- subset(sg_lib, cell %in% colnames(mtx))
scale.data <- GetAssayData(object = mtx, slot = "scale.data")
#define the function to convert the data frame to index matrix, derived from scmageck
frame2indmatrix <- function(bc_d, targetobj) {
rnm = colnames(targetobj)
cnm = unique(bc_d$gene)
scalef = targetobj
message(paste(length(unique(bc_d$cell)), "..."))
message(paste(ncol(scalef), "..."))
#rnm = rnm[!is.na(rnm)] #remove NA
#rnm = rnm[rnm %in% colnames(scalef)]
if (length(rnm) == 0) {
stop("Cell names do not match in expression matrix and barcode.")
}
cnm = cnm[!is.na(cnm)]#remove NA
ind_matrix = matrix(rep(0, length(rnm) * length(cnm)), nrow = length(rnm))
rownames(ind_matrix) = rnm
colnames(ind_matrix) = cnm
row <- bc_d[, 'cell']
col <- bc_d[, 'gene']
test <- (row %in% rnm) & (col %in% cnm)
idx <- as.matrix(data.frame(row[test], col[test]))
ind_matrix[idx] <- 1
return(ind_matrix)
}
G_mat <- frame2indmatrix(sg_lib, scale.data)
scale.data <- t(scale.data)
rm(list = c("mtx", "sg_lib"))
index <- which(colnames(G_mat) == "NTC")
#running GSFA
st<- Sys.time()
set.seed(50000)
fit0 <- fit_gsfa_multivar(Y = scale.data, G = G_mat, K = 20,
use_neg_control = TRUE,
neg_control_index = index,
prior_type = "mixture_normal",
init.method = "svd",
niter = 1000, used_niter = 500,
verbose = T, return_samples = T)
set.seed(50000)
fit0 <- fit_gsfa_multivar(Y = scale.data, G = G_mat,
use_neg_control = TRUE,
neg_control_index = index,
prior_type = "mixture_normal", fit0 = fit0,
init.method = "svd",
niter = 1000, used_niter = 500,
verbose = T, return_samples = T)
et<- Sys.time()
print(et - st)
For SCEPTRE testing, we execute the code with the parameter 'parallel = TRUE' to minimize the running time. However, we find that it will raise the warning message as follows when we test a dataset of ~15K cells, 1.2K genes and 4 perturbed genes.
In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : scheduled cores 80, 87, 97, 99, 103, 107, 108, 109, 111, 120, 126, 129, 132, 151, 154, 159, 176 did not deliver results, all values of the jobs will be affected.
Due to this message, we don't test SCEPTRE with datasets including more cells. Additionally, while we execute the code with the parameter 'parallel = FALSE' recently, we find that the dataset used before can be analyzed by SCEPTRE without any warning message. We also test the new function in SCEPTRE v0.2.0 and find that the function can analyze datasets including more than 75K cells with less memory usage and more time usage, compared to the function in SCREE. We show the code for testing SCEPTRE v0.1.0 below.
To make sure the dimension of datasets for all methods is the same, we remove the build-in quality control step in the SCEPTRE function. The new functions are as follows.
run_sceptre_high_moi2 <- function(gene_matrix, combined_perturbation_matrix, covariate_matrix, gene_gRNA_group_pairs, side = "both", storage_dir = tempdir(), regularization_amount = 0.1, B = 1000, full_output = FALSE, parallel = TRUE, seed = 4) {
##################
# DEFINE CONSTANTS
##################
cat(paste0("Check ", storage_dir, "/logs for more detailed status updates.\n"))
#############################
# BASIC PROCESSING AND CHECKS
#############################
cat("Running checks and setting up directory structure. ")
# 0. Set up parallel, fst, offsite directory structure, pod sizes
dirs <- initialize_directories(storage_location = storage_dir)
if (parallel) {
n_cores <- parallel::detectCores()
doParallel::registerDoParallel(cores = n_cores)
foreach_funct <- foreach::`%dopar%`
} else {
n_cores <- 1L
foreach_funct <- foreach::`%do%`
}
# 1. threshold gRNA_matrix (using threshold = 3, for now) if necessary
gRNA_matrix <- combined_perturbation_matrix
# 2. Ensure that cell barcodes coincide across gene and perturbation matrices
if (!identical(colnames(gRNA_matrix), colnames(gene_matrix)) || !identical(row.names(covariate_matrix), colnames(gene_matrix))) {
stop("The column names of `gene_matrix` and `gRNA_matrix` and the row names of `covariate matrix` should be the cell barcodes. Ensure that the cell barcodes are present and identical across these matrices/data frames.")
}
if (is.null(row.names(gRNA_matrix)) || is.null(row.names(gene_matrix))) {
stop("The row names of `gene_matrix` and `gRNA_matrix` should contain the gene IDs and gRNA IDs, respectively.")
}
# 4. Make sure genes/gRNAs in the data frame are actually a part of the expression matrices; ensure all rows distinct
if ("gRNA_group" %in% colnames(gene_gRNA_group_pairs)) {
gene_gRNA_group_pairs <- dplyr::rename(gene_gRNA_group_pairs, gRNA_id = gRNA_group)
}
if (!all(c("gene_id", "gRNA_id") %in% colnames(gene_gRNA_group_pairs))) stop("The columns `gene_id` and `gRNA_id` must be present in the `gene_gRNA_group_pairs` data frame.")
abs_genes <- gene_gRNA_group_pairs$gene_id[!(gene_gRNA_group_pairs$gene_id %in% row.names(gene_matrix))]
abs_gRNAs <- gene_gRNA_group_pairs$gRNA_id[!(gene_gRNA_group_pairs$gRNA_id %in% row.names(gRNA_matrix))]
if (length(abs_genes) >= 1) {
msg <- paste0("The genes `", paste0(abs_genes, collapse = ", "), "' are present in the `gene_gRNA_group_pairs` data frame but not in the `gene_matrix` matrix. Either remove these genes from `gene_gRNA_group_pairs,` or add these genes to `gene_matrix`.")
stop(msg)
}
if (length(abs_gRNAs) >= 1) {
msg <- paste0("The perturbations `", paste0(abs_gRNAs, collapse = ", "), "' are present in the `gene_gRNA_group_pairs` data frame but not in the `gRNA_matrix` matrix. Either remove these perturbations from `gene_gRNA_group_pairs,` or add these perturbations to `gRNA_matrix`.")
stop(msg)
}
gene_gRNA_group_pairs <- gene_gRNA_group_pairs %>% dplyr::distinct()
# 5. Set the pods
n_genes <- length(unique(gene_gRNA_group_pairs$gene_id))
n_gRNAs <- length(unique(gene_gRNA_group_pairs$gRNA_id))
n_pairs <- nrow(gene_gRNA_group_pairs)
pod_sizes <- c(gene = ceiling(n_genes/(2 * n_cores)),
gRNA = ceiling(n_gRNAs/(2 * n_cores)),
pair = ceiling(n_pairs/(2 * n_cores)))
cat(crayon::green(' \u2713\n'))
# 6. If number of genes is small, set regularization_amount to 0
if (n_genes < 50) regularization_amount <- 0
##############
# METHOD START
##############
# create file dictionaries
dicts <- suppressMessages(create_and_store_dictionaries(gene_gRNA_group_pairs,
dirs[["gene_precomp_dir"]],
dirs[["gRNA_precomp_dir"]],
dirs[["results_dir"]],
pod_sizes))
cat("Running gene precomputations. ")
# run first round of gene precomputations
foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gene"]])),
run_gene_precomputation_at_scale_round_1(pod_id = pod_id,
gene_precomp_dir = dirs[["gene_precomp_dir"]],
gene_matrix = gene_matrix,
covariate_matrix = covariate_matrix,
regularization_amount = regularization_amount,
log_dir = dirs[["log_dir"]]))
cat(crayon::green(' \u2713\n'))
# regularize thetas
regularize_gene_sizes_at_scale(gene_precomp_dir = dirs[["gene_precomp_dir"]],
regularization_amount = regularization_amount,
log_dir = dirs[["log_dir"]])
# run second round of gene precomputations
foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gene"]])),
run_gene_precomputation_at_scale_round_2(pod_id = pod_id,
gene_precomp_dir = dirs[["gene_precomp_dir"]],
gene_matrix = gene_matrix,
covariate_matrix = covariate_matrix,
regularization_amount = regularization_amount,
log_dir = dirs[["log_dir"]]))
# run gRNA precomputations
cat("Running perturbation precomputations. ")
foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gRNA"]])),
run_gRNA_precomputation_at_scale(pod_id = pod_id,
gRNA_precomp_dir = dirs[["gRNA_precomp_dir"]],
combined_perturbation_matrix = combined_perturbation_matrix,
covariate_matrix = covariate_matrix,
log_dir = dirs[["log_dir"]],
B = B,
seed = seed))
cat(crayon::green(' \u2713\n'))
# run at-scale analysis
cat("Running perturbation-to-gene association tests.")
foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["pairs"]])),
run_gRNA_gene_pair_analysis_at_scale(pod_id = pod_id,
gene_precomp_dir = dirs[["gene_precomp_dir"]],
gRNA_precomp_dir = dirs[["gRNA_precomp_dir"]],
results_dir = dirs[["results_dir"]],
log_dir = dirs[["log_dir"]],
gene_matrix = gene_matrix,
combined_perturbation_matrix = combined_perturbation_matrix,
covariate_matrix = covariate_matrix,
regularization_amount = regularization_amount,
side = side,
B = B,
full_output))
cat(crayon::green(' \u2713\n'))
# collect results
cat("Collecting and returning results. ")
results <- collect_results(results_dir = dirs[["results_dir"]], gene_gRNA_group_pairs)
cat(crayon::green(' \u2713\n'))
return(results)
}
The code for SCEPTRE testing. The function frame2indmatrix
is the same as the function mentioned in GSFA testing part.
library(dplyr)
library(sceptre)
library(Seurat)
#data preparation
sg_lib<- read.table("GSE90546/GSM2406677_10X005/sg_lib_all.txt", header=T)
data<- readRDS("GSE90546/GSM2406677_10X005/RNA.rds")
#quality control
data<- subset(data, sgRNA_num != 0 &
nFeature_RNA >= 200 &
nCount_RNA >= 1000 &
percent.mt <= 10)
data_qc <- CreateSeuratObject(count = data@assays$RNA@counts, min.cells = 0.01 * ncol(data))
data_qc[["percent.mt"]] <- PercentageFeatureSet(data_qc, pattern = "^MT-")
sg_num <- subset(data$sgRNA_num, names(data$sgRNA_num) %in% colnames(data_qc))
perturbations<- subset(data$perturbations, names(data$perturbations) %in% colnames(data_qc))
data_qc[["log_UMI"]]<- log(data_qc$nCount_RNA)
sg_lib <- subset(sg_lib, cell %in% colnames(data_qc))
sg_lib$gene <- gsub("NTC", "non-targeting", sg_lib$gene)
sg_lib$barcode <- gsub("NTC_", "non-targeting_", sg_lib$barcode)
#Create gene-sgRNA pairs
l <- length(unique(sg_lib$gene))
gene_id<- rep(rownames(data_qc), times = l)
gRNA_id<- rep(unique(sg_lib$gene), each = nrow(data_qc))
gene_gRNA_pairs<- data.frame(gene_id = gene_id, gRNA_group = gRNA_id)
#Generate perturbation index matrix
perturbation_matrix <- t(frame2indmatrix(sg_lib, data_qc))
#Generate covariate matrix
covariate_matrix <- data_qc[[c("log_UMI", "percent.mt")]]
#Get sparse matrix
expression_matrix <- data_qc@assays$RNA@counts
sg_lib <- unique(sg_lib[, c("barcode", "gene")])
grna_group_df <- data.frame(grna_id = sg_lib$barcode, grna_group = sg_lib$gene)
rm(names = "l","sg_lib","data","data_qc","sg_num","perturbations")
dir.create("comparison_results/sceptre_results")
#Running SCEPTRE
st<- Sys.time()
result <- run_sceptre_high_moi2(storage_dir = "comparison_results/sceptre_results",
gene_matrix = expression_matrix,
combined_perturbation_matrix = perturbation_matrix,
covariate_matrix = covariate_matrix,
gene_gRNA_group_pairs = gene_gRNA_pairs,
side = "both",
B = 2000,
parallel = T)
et<- Sys.time()
print(et-st)
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 whisker_0.4 knitr_1.33 magrittr_2.0.1
[5] R6_2.5.0 rlang_0.4.11 fansi_0.5.0 stringr_1.4.0
[9] tools_4.0.2 xfun_0.25 utf8_1.2.2 git2r_0.28.0
[13] jquerylib_0.1.4 htmltools_0.5.1.1 ellipsis_0.3.2 rprojroot_2.0.2
[17] yaml_2.2.1 digest_0.6.27 tibble_3.1.3 lifecycle_1.0.0
[21] crayon_1.4.1 later_1.2.0 sass_0.4.0 vctrs_0.3.8
[25] promises_1.2.0.1 fs_1.5.0 glue_1.4.2 evaluate_0.14
[29] rmarkdown_2.10 stringi_1.7.3 bslib_0.2.5.1 compiler_4.0.2
[33] pillar_1.6.2 jsonlite_1.7.2 httpuv_1.6.1 pkgconfig_2.0.3