CRISPR KO Analysis based on Genomic Editing data —— Toward personalized sgRNA design in heterogeneous experimental conditions
A CRISPR-cas9 based genome-editing data analysis resource and platform, for the analysis of indels and microhomology patterns, the identification of personalized features correlated to sgRNA KO efficiency on heterogeneous experimental conditions, and the evaluation of the sgRNA KO efficiency based on the CRISPR-Cas9 KO NGS data or the sgRNA KO assay data.
The ultimate goals of CAGE:
-
CAGE provides a standard CROWDSOURCING platform for users to share the CRISPR-Cas9 based gene KO data.
-
CAGE provides an efficient interface to analyze and visualize the CRISPR-based KO NGS data.
-
CAGE provides a robust learning pipeline to derive the sequence determinants from heterogeneous genome editing data.
-
CAGE provides an personalized evaluation framework for on-target sgRNA design based on the derived sequence determinants torward specific cell types or organisms.
Currently CAGE records the optimal sgRNA KO efficiency prediction models and the personalized evaluation models in sgRNA design for the following different cell types. The optimal results for new cell types as well as the the current ones will be updated regularly. Users can select the existing evaluation model for a specific cell type for sgRNA design, or they can use their own sgRNA KO data to generate a new personalized evaluation model for a new cell type for furthur sgRNA KO efficiency evaluation.
Evaluation Model | Species | Cell Type | KO Efficiency Measurement | Data Type | Learning Model | Performance | Actual sgRNA Library Size | Accession |
---|---|---|---|---|---|---|---|---|
a375_1 | Homo sapiens | A375 | See Doench et al. | numerical | LASSO | r2=0.504 | 1248 | - |
el4_1 | Mus musculus | EL4 | See Doench et al. | numerical | LASSO | r2=0.508 | 858 | - |
mesc_1 | Mus musculus | mESC | OTF Ratio | numerical | LASSO | r2=0.72 | 99 | ERP003292 |
rn2c_1 | Mus musculus | RN2c | OTF Ratio | numerical | LASSO | r2=0.89 | 26 | SRP057117 |
hela_1 | Homo sapiens | Hela | OTF Ratio | numerical | LASSO | r2=0.87 | 68 | SRP042061 |
dr_1 | Danio rerio | *AB/Tu | OTF Ratio | numerical | LASSO | r2=0.91 | 47 | SRP052749 |
hl60_nonribo | Homo sapiens | HL60 | See Xu et al. | categorical | Logistic | AUC=0.76 | 908 | - |
hl60_ribo | Homo sapiens | HL60 | See Xu et al. | categorical | Logistic | AUC=0.79 | 373 | - |
mesc_2 | Mus musculus | mESC | See Xu et al. | categorical | Logistic | AUC=0.81 | 1028 | - |
hek293t_1 | Homo sapiens | HEK293T | See Chari et al. | categorical | Logistic | AUC=0.77 | 279 | - |
- Python 2.7
- Numpy >= 1.9.2
- Scipy >= 0.15.1
- Pandas >= 0.16.0
- scikit-learn >= 0.16.1
- lxml >= 3.4.4
- pyfasta >= 0.5.2
- bwa >= 0.7.12
- samtools >= 0.1.19
- bedtools >= 2.23.0
- pyslep (for multi-task group lasso)
- LaTeX (for visualization)
Make sure to perform this presetting carefully. Because reference setting is very important.
For the sake of simplicity, we use hg19 as the example.
-
Download the hg19 genome(fasta file) from UCSC, put it in certain directory, name it
hg19.fa
and set the directory path as$FASTADB
. -
Generate
bwa
index files fromhg19.fa
, put them in certain directory and set the directory path as$BWADB
. -
(Optional) Download the hg19 gene annotation files from UCSC, convert it to
bed-6
format with the 4th column being the gene name, put them in certain directory and set the directory path as$BEDDB
. Here are the renamed file:
File | Standard | Requirement |
---|---|---|
hg19ref.bed | Refseq | required |
hg19ucsc.bed | UCSC Gene | optional |
hg19gencode.bed | GENCODE | optional |
git clone https://github.com/bm2-lab/cage-dev.git
For performing multi-task group lasso, pyslep is necessary.
cd pyslep
sh setup_pyslep.sh
python cage.py <command> [option] ...
sg
Process sgRNA sequences into sgRNA information tableprep
Process NGS datamh
Microhomology Detectionindel
Indel frameshifting paradigm analysisfs
Feature selection and model prediction on clearly defined sgRNA KO efficiencymt
Feature selection with multi-task group LASSO on clearly defined sgRNA KO efficiency for cross-platform dataeval
sgRNA KO efficiency evaluation and the scanning of a given genome region for sgRNA designvis
Visualization of feature selection result
File Type | Suffix | Usage |
---|---|---|
sg file | .sg | sgRNA information table |
samind file | .samind | reads mapping result |
mnst file | .mnst | microhomology information table |
iost file | .iost | sgRNA-indel table |
seq file | .seq | original sequence feature table |
fesrep file | _fesrep.xml | feature selection and model prediction report |
pkl file | .pkl | evaluation model file |
st file | .st | evaluation result file |
label file | (arbitrary) | user-customized evaluation file for feature selection and model prediction |
#####Note For label file, the first and the last column will be regarded as sgID and score respectively. File header should exist and the header of the first column must be sgID. See the following example.
sgID | ... | score |
---|---|---|
sg1 | ... | 0.1 |
sg2 | ... | 0.2 |
Generate sgRNA Information Table (sg file)
python cage.py sg -s <sgRNA.fq>
-o <output directory>
-g <reference genome> (e.g. hg19)
-t <bwa threads> (default 1)
-a (annotate, optional)
For more detail on the options, see python cage.py sg -h
.
- Single-end
python cage.py prep -s <sg file>
-f <reads.fq>
-o <output directory>
-g <reference genome>
-t <bwa threads>
- Paired-end
python cage.py prep -s <sg file>
-f <reads_1.fq>
-r <reads_2.fq>
-o <output directory>
-g <reference genome>
-t <bwa threads>
For more detail on the options, see python cage.py prep -h
.
python cage.py mh -i <samind file>
-o <output directory>
-g <reference genome>
For more detail on the options, see python cage.py mh -h
.
Generate sgRNA-indel table (iost file)
python cage.py indel -i <samind file>
-s <sg file>
-o <output directory>
-g <reference genome>
-t <read-count cutoff> (default 0)
For more detail on the options, see python cage.py indel -h
.
- Manual
python cage.py fs -i <label file>
-s <sg file>
-o <output directory>
-g <reference genome>
-t <reads cutoff> (default 0)
-u <upstream region length> (default 30)
-w <downstream region length> (without PAM, default 27)
-c <cross-validation folds> (default 5)
-j <number of CPU cores used> (default 1)
-m <lasso|logit>
- Auto
python cage.py fs -i <label file>
-s <sg file>
-o <output directory>
-g <reference genome>
-t <reads cutoff> (default 0)
-a (auto detection for sequence region)
--init-radius <init radius> (default 0)
-r <radius> (default 200)
--step <detection step> (default 5)
-c <cross-validation folds> (default 5)
-j <number of CPU cores used> (default 1)
-m <lasso|logit> (method)
For more detail on the options, see python cage.py fs -h
.
Feature selection with multi-task group LASSO on clearly defined sgRNA KO efficiency for cross-platform data
python cage.py mt -i [<label file> [<label file> ...]]
-s [<sg file> [<sg file> ...]]
-o <output directory>
-g [<ref genome> [<ref genome> ...]]
-u <upstream region length> (default 30)
-w <downstream region length> (without PAM, default 27)
-d <selection strictness>
- Evaluation with Genome Scanner
python cage.py eval -c <target chromosome>
-b <start coordinate>
-e <end coordinate>
-m <pkl file>
-o <output directory>
-g <reference genome>
-d <two-sided|pos|neg> (scan direction)
-t <bwa threads>
- Evaluation with sgRNA Information Table
python cage.py eval -s <sg file>
-m <pkl file>
-o <output directory>
-g <reference genome>
For more detail on the options, see python cage.py eval -h
.
python cage.py vis -f <feature report file>
-o <output directory>
For more detail on the options, see python cage.py vis -h
.
To run examples, cd example
first, then execute the following commands.
sg
:sh exam.sh sg
prep
for single-end:sh exam.sh prep_se
prep
for pair-end:sh exam.sh prep_pe
mh
:sh exam.sh mh
indel
:sh exam.sh indel
fs
using LASSO without auto detection:sh exam.sh fs_las
fs
using LASSO with auto detection:sh exam.sh fs_las_a
fs
using Logistic Regression without auto detection:sh exam.sh fs_log
fs
using Logistic Regression with auto detection:sh exam.sh fs_log_a
mt
:sh exam.sh mt
eval
using genome scanner:sh exam.sh eval
eval
using sgRNA information table:sh exam.sh eval_sg
vis
:sh exam.sh vis