AtacWorks is a deep learning toolkit for coverage track denoising and peak calling from low-coverage or low-quality ATAC-Seq data.
This will clone the repo to the master
branch, which contains code for latest released version
and hot-fixes.
git clone --recursive -b master https://github.com/clara-genomics/AtacWorks.git
This will clone the repo to the default branch, which is set to be the latest development branch. This branch is subject to change frequently as features and bug fixes are pushed.
git clone --recursive https://github.com/clara-genomics/AtacWorks.git
- Ubuntu 16.04+
- CUDA 9.0+
- Python 3.6.7+
- GCC 5+
- (Optional) A conda or virtualenv setup
- Any NVIDIA GPU. AtacWorks training and inference currently does not run on CPU.
-
Download
bedGraphToBigWig
andbigWigToBedGraph
binaries and add to your $PATHrsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig <custom_path> rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bigWigToBedGraph <custom_path> export PATH="$PATH:<custom_path>" sudo apt-get install hdf5-tools
-
Install pip dependencies
pip install -r requirements-base.txt && pip install -r requirements-macs2.txt
-
Install atacworks
pip install .
Note: The above non-standard installation is necessary to ensure the requirements for macs2 are installed before macs2 itself.
Run unit tests:
```
python -m pytest tests/
```
####Running CI Tests Locally Please note, your git repository will be mounted to the container, any untracked files will be removed from it. Before executing the CI locally, stash or add them to the index.
Requirements:
- docker (https://docs.docker.com/install/linux/docker-ce/ubuntu/)
- nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
- nvidia-container-runtime (https://github.com/NVIDIA/nvidia-container-runtime)
Run the following command to execute the CI build steps inside a container locally:
bash ci/local/build.sh -r <Atacworks repo path>
ci/local/build.sh script was adapted from rapidsai/cudf
The default docker image is clara-genomics-base:cuda10.1-ubuntu16.04-gcc5-py3.6. Other images from gpuci/clara-genomics-base repository can be used instead, by using -i argument
bash ci/local/build.sh -r <Atacworks repo path> -i gpuci/clara-genomics-base:cuda10.0-ubuntu18.04-gcc7-py3.6
AtacWorks trains a deep neural network to learn a mapping between noisy (low coverage/low quality) ATAC-Seq data and matching clean (high coverage/high quality) ATAC-Seq data from the same cell type. Once this mapping is learned, the trained model can be applied to improve other noisy ATAC-Seq datasets.
To train an AtacWorks model, you need a pair of ATAC-Seq datasets from the same cell type, where one dataset has lower coverage or lower quality than the other. You can also use multiple such pairs of datasets. For each such pair of datasets, AtacWorks requires three input files:
-
A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed. Format: bigWig
-
A coverage track representing the number of sequencing reads mapped to each position on the genome in the high-coverage or high-quality dataset. This may be smoothed or processed in the same way as the previous track. Format: bigWig
-
The genomic positions of peaks called on the high-coverage or high-quality dataset. These can be obtained by using MACS2 or any other peak caller. Format: either BED or the narrowPeak format produced by MACS2.
The model learns a mapping from (1) to both (2) and (3); in other words, from the noisy coverage track, it learns to predict both the clean coverage track, and the positions of peaks in the clean dataset.
See Tutorial 1 for a workflow detailing the steps of data processing, encoding and model training and how to modify the parameters used in these steps.
All models described in Lal & Chiang, et al. (2019) are available for download and use at https://atacworks-paper.s3.us-east-2.amazonaws.com
.
See pre-trained denoising models for a list of the available pre-trained denoising models.
Before using one of these models, please read the description of how the training datasets for these models were preprocessed, in Lal & Chiang, et al. (2019), Methods section, paragraph 1. If your data processing pipeline is different, it is advisable to train a new model using the instructions above.
See below for instructions to use our pre-trained models or your own trained models.
To denoise and call peaks from low-coverage/low-quality ATAC-seq data, you need three input files:
-
A trained AtacWorks model file with extension
.pth.tar
. -
A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed in the same way as the files used for training the model. Format: bigWig
-
Chromosome sizes file - a tab-separated text file containing the names and sizes of chromosomes in the genome.
bash Atacworks/scripts/run_inference.sh -bw <path to bigWig file with test ATAC-seq data> -m <path to model file> -f <path to chromosome sizes file> -o <output directory> -c <path to folder containing config files (optional)>
This command produces a folder containing several files:
- _infer_results.track.bw: A bigWig file containing the denoised ATAC-seq coverage track.
- infer_results_peaks.bed: A BED file containing the peaks called from the denoised ATAC-seq track. This file has 8 columns, in order:
- chromosome
- peak start position
- peak end position
- peak length (bp)
- Mean coverage over peak
- Maximum coverage in peak
- Position of summit (relative to start)
- Position of summit (absolute).
- _infer_results.peaks.bw: The same peak calls, in the form of a bigWig track for genome browser visualization.
run_inference.sh
optionally takes a folder containing config files - specifically, this folder needs to contain two files, infer_config.yaml
which specifies parameters for inference, and model_structure.yaml
which specifies the structure of the deep learning model. If no folder containing config files is supplied, the folder AtacWorks/configs
containing default parameter values will be used.
In order to vary output file names or formats, or inference parameters, you can change the arguments supplied in infer_config.yaml
. Type python AtacWorks/scripts/main.py infer --help
to understand which arguments to change.
In particular, the threshold for peak calling is controlled by the infer_threshold
parameter in infer_config.yaml
. By default, this is set to 0.5. If infer_threshold
is set to "None" in the config file, run_inference.sh
will instead produce a bigWig file in which each base is labeled with the probability (between 0 and 1) that it is part of a peak.
See Tutorial 2 for an advanced workflow detailing the individual steps of data processing, encoding and prediction using a trained model, and how to modify the parameters used in these steps.
- What's the preferred way for setting up the environment?
A virtual environment or conda installation is preferred. You can follow conda installation instructions on their website and then follow the instructions in the README.
Please cite AtacWorks as follows:
Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481.