SCREE

Last updated: 2023-06-03

Checks: 7 0

Knit directory: SCREE/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20210907)

The command set.seed(20210907) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: dc5b2b2

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version dc5b2b2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory

Untracked files:
    Untracked:  data/workflow.png
    Untracked:  img/
    Untracked:  output/workflow.png

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/SCREE.Rmd) and HTML (docs/SCREE.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	2c16336	HailinWei98	2021-09-23	Build site.
Rmd	a2a5b0a	HailinWei98	2021-09-23	Publish the initial files for myproject
html	eeeebf3	HailinWei98	2021-09-23	Build site.
Rmd	94450d1	HailinWei98	2021-09-23	Publish the initial files for myproject
html	94450d1	HailinWei98	2021-09-23	Publish the initial files for myproject
html	518a5ca	HailinWei98	2021-09-22	Build site.
Rmd	ab04ccf	HailinWei98	2021-09-22	Publish the initial files for myproject
html	855bd74	HailinWei98	2021-09-22	Build site.
Rmd	e582e5c	HailinWei98	2021-09-22	Publish the initial files for myproject

In this page, we provide the code for testing GSFA(v0.2.8) and SCEPTRE(V0.1.0). To estimate the time usage, we record the time before and after running the major model. To record the memory usage, we create a python script as follows and include the R code for running into the robjects.r.

from memory_profiler import profile
import rpy2.robjects as robjects
import os
    
@profile
def run():
    robjects.r('''{R code}''')

if __name__ == '__main__':
        run()

GSFA Testing

For GSFA testing, we find that its iteration step stop after finishing 800 iteration when we using a dataset of ~30K cells, 1.2K genes and 249 perturbed genes as input.

library(Seurat)
library(GSFA)

#data preparation

mtx <- readRDS("perturb-CITE/control/RNA_QC.rds")
sg_lib <- read.table("perturb-CITE/control/sg_lib_all.txt", header = T)
sg_lib <- subset(sg_lib, cell %in% colnames(mtx))
scale.data <- GetAssayData(object = mtx, slot = "scale.data")

#define the function to convert the data frame to index matrix, derived from scmageck

frame2indmatrix <- function(bc_d, targetobj) {
    rnm = colnames(targetobj)
    cnm = unique(bc_d$gene)
    scalef = targetobj
    message(paste(length(unique(bc_d$cell)), "..."))
    message(paste(ncol(scalef), "..."))
    #rnm = rnm[!is.na(rnm)] #remove NA
    #rnm = rnm[rnm %in% colnames(scalef)]
    if (length(rnm) == 0) {
        stop("Cell names do not match in expression matrix and barcode.")
    }
    cnm = cnm[!is.na(cnm)]#remove NA
    ind_matrix = matrix(rep(0, length(rnm) * length(cnm)), nrow = length(rnm))
    rownames(ind_matrix) = rnm
    colnames(ind_matrix) = cnm
    row <- bc_d[, 'cell']
    col <- bc_d[, 'gene']
    test <- (row %in% rnm) & (col %in% cnm)
    idx <- as.matrix(data.frame(row[test], col[test]))
    ind_matrix[idx] <- 1
    return(ind_matrix)
}

G_mat <- frame2indmatrix(sg_lib, scale.data)

scale.data <- t(scale.data)
rm(list = c("mtx", "sg_lib"))


index <- which(colnames(G_mat) == "NTC")

#running GSFA

st<- Sys.time()
set.seed(50000)
fit0 <- fit_gsfa_multivar(Y = scale.data, G = G_mat, K = 20,
                          use_neg_control = TRUE,
                          neg_control_index = index,
                          prior_type = "mixture_normal", 
                          init.method = "svd",
                          niter = 1000, used_niter = 500,
                          verbose = T, return_samples = T)

set.seed(50000)
fit0 <- fit_gsfa_multivar(Y = scale.data, G = G_mat,
                          use_neg_control = TRUE,
                          neg_control_index = index,
                          prior_type = "mixture_normal", fit0 = fit0,
                          init.method = "svd",
                          niter = 1000, used_niter = 500,
                          verbose = T, return_samples = T)
et<- Sys.time()
print(et - st)

SCEPTRE Testing

For SCEPTRE testing, we execute the code with the parameter 'parallel = TRUE' to minimize the running time. However, we find that it will raise the warning message as follows when we test a dataset of ~15K cells, 1.2K genes and 4 perturbed genes.

In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed,  : scheduled cores 80, 87, 97, 99, 103, 107, 108, 109, 111, 120, 126, 129, 132, 151, 154, 159, 176 did not deliver results, all values of the jobs will be affected.

Due to this message, we don't test SCEPTRE with datasets including more cells. Additionally, while we execute the code with the parameter 'parallel = FALSE' recently, we find that the dataset used before can be analyzed by SCEPTRE without any warning message. We also test the new function in SCEPTRE v0.2.0 and find that the function can analyze datasets including more than 75K cells with less memory usage and more time usage, compared to the function in SCREE. We show the code for testing SCEPTRE v0.1.0 below.

To make sure the dimension of datasets for all methods is the same, we remove the build-in quality control step in the SCEPTRE function. The new functions are as follows.

run_sceptre_high_moi2 <- function(gene_matrix, combined_perturbation_matrix, covariate_matrix, gene_gRNA_group_pairs, side = "both", storage_dir = tempdir(), regularization_amount = 0.1, B = 1000, full_output = FALSE, parallel = TRUE, seed = 4) {
    ##################
    # DEFINE CONSTANTS
    ##################
    cat(paste0("Check ", storage_dir, "/logs for more detailed status updates.\n"))
  
    #############################
    # BASIC PROCESSING AND CHECKS
    #############################
    cat("Running checks and setting up directory structure. ")
  
    # 0. Set up parallel, fst, offsite directory structure, pod sizes
    dirs <- initialize_directories(storage_location = storage_dir)
    if (parallel) {
      n_cores <- parallel::detectCores()
      doParallel::registerDoParallel(cores = n_cores)
      foreach_funct <- foreach::`%dopar%`
    } else {
      n_cores <- 1L
      foreach_funct <- foreach::`%do%`
    }
    # 1. threshold gRNA_matrix (using threshold = 3, for now) if necessary
    gRNA_matrix <- combined_perturbation_matrix
  
    # 2. Ensure that cell barcodes coincide across gene and perturbation matrices
    if  (!identical(colnames(gRNA_matrix), colnames(gene_matrix)) || !identical(row.names(covariate_matrix), colnames(gene_matrix))) {
      stop("The column names of `gene_matrix` and `gRNA_matrix` and the row names of `covariate matrix` should be the cell barcodes. Ensure that the cell barcodes are present and identical across these matrices/data frames.")
    }
    if (is.null(row.names(gRNA_matrix)) || is.null(row.names(gene_matrix))) {
      stop("The row names of `gene_matrix` and `gRNA_matrix` should contain the gene IDs and gRNA IDs, respectively.")
    }
  
    # 4. Make sure genes/gRNAs in the data frame are actually a part of the expression matrices; ensure all rows distinct
    if ("gRNA_group" %in% colnames(gene_gRNA_group_pairs)) {
      gene_gRNA_group_pairs <- dplyr::rename(gene_gRNA_group_pairs, gRNA_id = gRNA_group)
    }
    if (!all(c("gene_id", "gRNA_id") %in% colnames(gene_gRNA_group_pairs))) stop("The columns `gene_id` and `gRNA_id` must be present in the `gene_gRNA_group_pairs` data frame.")
    abs_genes <- gene_gRNA_group_pairs$gene_id[!(gene_gRNA_group_pairs$gene_id %in% row.names(gene_matrix))]
    abs_gRNAs <- gene_gRNA_group_pairs$gRNA_id[!(gene_gRNA_group_pairs$gRNA_id %in% row.names(gRNA_matrix))]
    if (length(abs_genes) >= 1) {
      msg <- paste0("The genes `", paste0(abs_genes, collapse = ", "), "' are present in the `gene_gRNA_group_pairs` data frame but not in the `gene_matrix` matrix. Either remove these genes from `gene_gRNA_group_pairs,` or add these genes to `gene_matrix`.")
      stop(msg)
    }
    if (length(abs_gRNAs) >= 1) {
      msg <- paste0("The perturbations `", paste0(abs_gRNAs, collapse = ", "), "' are present in the `gene_gRNA_group_pairs` data frame but not in the `gRNA_matrix` matrix. Either remove these perturbations from `gene_gRNA_group_pairs,` or add these perturbations to `gRNA_matrix`.")
      stop(msg)
    }
    gene_gRNA_group_pairs <- gene_gRNA_group_pairs %>% dplyr::distinct()
    # 5. Set the pods
    n_genes <- length(unique(gene_gRNA_group_pairs$gene_id))
    n_gRNAs <- length(unique(gene_gRNA_group_pairs$gRNA_id))
    n_pairs <- nrow(gene_gRNA_group_pairs)
    pod_sizes <- c(gene = ceiling(n_genes/(2 * n_cores)),
                   gRNA = ceiling(n_gRNAs/(2 * n_cores)),
                   pair = ceiling(n_pairs/(2 * n_cores)))
    cat(crayon::green(' \u2713\n'))
    # 6. If number of genes is small, set regularization_amount to 0
    if (n_genes < 50) regularization_amount <- 0
  
    ##############
    # METHOD START
    ##############
    # create file dictionaries
    dicts <- suppressMessages(create_and_store_dictionaries(gene_gRNA_group_pairs,
                                           dirs[["gene_precomp_dir"]],
                                           dirs[["gRNA_precomp_dir"]],
                                           dirs[["results_dir"]],
                                           pod_sizes))
    cat("Running gene precomputations. ")
    # run first round of gene precomputations
    foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gene"]])),
                  run_gene_precomputation_at_scale_round_1(pod_id = pod_id,
                                                           gene_precomp_dir = dirs[["gene_precomp_dir"]],
                                                           gene_matrix = gene_matrix,
                                                           covariate_matrix = covariate_matrix,
                                                           regularization_amount = regularization_amount,
                                                           log_dir = dirs[["log_dir"]]))
    cat(crayon::green(' \u2713\n'))
  
    # regularize thetas
    regularize_gene_sizes_at_scale(gene_precomp_dir = dirs[["gene_precomp_dir"]],
                                   regularization_amount = regularization_amount,
                                   log_dir = dirs[["log_dir"]])
  
    # run second round of gene precomputations
    foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gene"]])),
                  run_gene_precomputation_at_scale_round_2(pod_id = pod_id,
                                                           gene_precomp_dir = dirs[["gene_precomp_dir"]],
                                                           gene_matrix = gene_matrix,
                                                           covariate_matrix = covariate_matrix,
                                                           regularization_amount = regularization_amount,
                                                           log_dir = dirs[["log_dir"]]))
  
    # run gRNA precomputations
    cat("Running perturbation precomputations. ")
    foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["gRNA"]])),
                  run_gRNA_precomputation_at_scale(pod_id = pod_id,
                                                   gRNA_precomp_dir = dirs[["gRNA_precomp_dir"]],
                                                   combined_perturbation_matrix = combined_perturbation_matrix,
                                                   covariate_matrix = covariate_matrix,
                                                   log_dir = dirs[["log_dir"]],
                                                   B = B,
                                                   seed = seed))
    cat(crayon::green(' \u2713\n'))
  
    # run at-scale analysis
    cat("Running perturbation-to-gene association tests.")
    foreach_funct(foreach::foreach(pod_id = seq(1, dicts[["pairs"]])),
                  run_gRNA_gene_pair_analysis_at_scale(pod_id = pod_id,
                                                       gene_precomp_dir = dirs[["gene_precomp_dir"]],
                                                       gRNA_precomp_dir = dirs[["gRNA_precomp_dir"]],
                                                       results_dir = dirs[["results_dir"]],
                                                       log_dir = dirs[["log_dir"]],
                                                       gene_matrix = gene_matrix,
                                                       combined_perturbation_matrix = combined_perturbation_matrix,
                                                       covariate_matrix = covariate_matrix,
                                                       regularization_amount = regularization_amount,
                                                       side = side,
                                                       B = B,
                                                       full_output))
    cat(crayon::green(' \u2713\n'))
    # collect results
    cat("Collecting and returning results. ")
    results <- collect_results(results_dir = dirs[["results_dir"]], gene_gRNA_group_pairs)
    cat(crayon::green(' \u2713\n'))
    return(results)
  }

The code for SCEPTRE testing. The function frame2indmatrix is the same as the function mentioned in GSFA testing part.

library(dplyr)
library(sceptre)
library(Seurat)

#data preparation

sg_lib<- read.table("GSE90546/GSM2406677_10X005/sg_lib_all.txt", header=T)
data<- readRDS("GSE90546/GSM2406677_10X005/RNA.rds")

#quality control

data<- subset(data, sgRNA_num != 0 & 
                nFeature_RNA >= 200 & 
                nCount_RNA >= 1000 &
                percent.mt <= 10)
data_qc <- CreateSeuratObject(count = data@assays$RNA@counts, min.cells = 0.01 * ncol(data))
data_qc[["percent.mt"]] <- PercentageFeatureSet(data_qc, pattern = "^MT-")
sg_num <- subset(data$sgRNA_num, names(data$sgRNA_num) %in% colnames(data_qc))
perturbations<- subset(data$perturbations, names(data$perturbations) %in% colnames(data_qc))

data_qc[["log_UMI"]]<- log(data_qc$nCount_RNA)

sg_lib <- subset(sg_lib, cell %in% colnames(data_qc))
sg_lib$gene <- gsub("NTC", "non-targeting", sg_lib$gene)
sg_lib$barcode <- gsub("NTC_", "non-targeting_", sg_lib$barcode)

#Create gene-sgRNA pairs

l <- length(unique(sg_lib$gene))
gene_id<- rep(rownames(data_qc), times = l)
gRNA_id<- rep(unique(sg_lib$gene), each = nrow(data_qc))
gene_gRNA_pairs<- data.frame(gene_id = gene_id, gRNA_group = gRNA_id)

#Generate perturbation index matrix

perturbation_matrix <- t(frame2indmatrix(sg_lib, data_qc))

#Generate covariate matrix

covariate_matrix <- data_qc[[c("log_UMI", "percent.mt")]]

#Get sparse matrix

expression_matrix <- data_qc@assays$RNA@counts

sg_lib <- unique(sg_lib[, c("barcode", "gene")])
grna_group_df <- data.frame(grna_id = sg_lib$barcode, grna_group = sg_lib$gene)

rm(names = "l","sg_lib","data","data_qc","sg_num","perturbations")

dir.create("comparison_results/sceptre_results")

#Running SCEPTRE

st<- Sys.time()
result <- run_sceptre_high_moi2(storage_dir = "comparison_results/sceptre_results",
                                gene_matrix = expression_matrix,
                                combined_perturbation_matrix = perturbation_matrix,
                                covariate_matrix = covariate_matrix,
                                gene_gRNA_group_pairs = gene_gRNA_pairs,
                                side = "both", 
                                B = 2000,
                                parallel = T)
et<- Sys.time()
print(et-st)

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        whisker_0.4       knitr_1.33        magrittr_2.0.1   
 [5] R6_2.5.0          rlang_0.4.11      fansi_0.5.0       stringr_1.4.0    
 [9] tools_4.0.2       xfun_0.25         utf8_1.2.2        git2r_0.28.0     
[13] jquerylib_0.1.4   htmltools_0.5.1.1 ellipsis_0.3.2    rprojroot_2.0.2  
[17] yaml_2.2.1        digest_0.6.27     tibble_3.1.3      lifecycle_1.0.0  
[21] crayon_1.4.1      later_1.2.0       sass_0.4.0        vctrs_0.3.8      
[25] promises_1.2.0.1  fs_1.5.0          glue_1.4.2        evaluate_0.14    
[29] rmarkdown_2.10    stringi_1.7.3     bslib_0.2.5.1     compiler_4.0.2   
[33] pillar_1.6.2      jsonlite_1.7.2    httpuv_1.6.1      pkgconfig_2.0.3

Benchmark

GSFA Testing

SCEPTRE Testing