AtacWorks

AtacWorks is a deep learning toolkit for coverage track denoising and peak calling from low-coverage or low-quality ATAC-Seq data.

Installation

1. Clone repository

Latest released version

This will clone the repo to the master branch, which contains code for latest released version and hot-fixes.

git clone --recursive -b master https://github.com/clara-genomics/AtacWorks.git

Latest development version

This will clone the repo to the default branch, which is set to be the latest development branch. This branch is subject to change frequently as features and bug fixes are pushed.

git clone --recursive https://github.com/clara-genomics/AtacWorks.git

2. System Setup

System requirements

Ubuntu 16.04+
CUDA 9.0+
Python 3.6.7+
GCC 5+
(Optional) A conda or virtualenv setup
Any NVIDIA GPU. AtacWorks training and inference currently does not run on CPU.

Install dependencies

Download bedGraphToBigWig and bigWigToBedGraph binaries and add to your $PATH

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig <custom_path>
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bigWigToBedGraph <custom_path>
export PATH="$PATH:<custom_path>"
sudo apt-get install hdf5-tools

Install pip dependencies

pip install -r requirements-base.txt && pip install -r requirements-macs2.txt

Install atacworks
```
pip install .
```

Note: The above non-standard installation is necessary to ensure the requirements for macs2 are installed before macs2 itself.

3. Tests

Run unit tests:

```
python -m pytest tests/
```

####Running CI Tests Locally Please note, your git repository will be mounted to the container, any untracked files will be removed from it. Before executing the CI locally, stash or add them to the index.

Requirements:

docker (https://docs.docker.com/install/linux/docker-ce/ubuntu/)
nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
nvidia-container-runtime (https://github.com/NVIDIA/nvidia-container-runtime)

Run the following command to execute the CI build steps inside a container locally:

bash ci/local/build.sh -r <Atacworks repo path>

ci/local/build.sh script was adapted from rapidsai/cudf

The default docker image is clara-genomics-base:cuda10.1-ubuntu16.04-gcc5-py3.6. Other images from gpuci/clara-genomics-base repository can be used instead, by using -i argument

bash ci/local/build.sh -r <Atacworks repo path> -i gpuci/clara-genomics-base:cuda10.0-ubuntu18.04-gcc7-py3.6

Workflow

AtacWorks trains a deep neural network to learn a mapping between noisy (low coverage/low quality) ATAC-Seq data and matching clean (high coverage/high quality) ATAC-Seq data from the same cell type. Once this mapping is learned, the trained model can be applied to improve other noisy ATAC-Seq datasets.

1. Training an AtacWorks model

Input files

To train an AtacWorks model, you need a pair of ATAC-Seq datasets from the same cell type, where one dataset has lower coverage or lower quality than the other. You can also use multiple such pairs of datasets. For each such pair of datasets, AtacWorks requires three input files:

A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed. Format: bigWig
A coverage track representing the number of sequencing reads mapped to each position on the genome in the high-coverage or high-quality dataset. This may be smoothed or processed in the same way as the previous track. Format: bigWig
The genomic positions of peaks called on the high-coverage or high-quality dataset. These can be obtained by using MACS2 or any other peak caller. Format: either BED or the narrowPeak format produced by MACS2.

The model learns a mapping from (1) to both (2) and (3); in other words, from the noisy coverage track, it learns to predict both the clean coverage track, and the positions of peaks in the clean dataset.

Tutorial

See Tutorial 1 for a workflow detailing the steps of data processing, encoding and model training and how to modify the parameters used in these steps.

2. Denoising and peak calling using a trained AtacWorks model

Downloading pre-trained models

All models described in Lal & Chiang, et al. (2019) are available for download and use at https://atacworks-paper.s3.us-east-2.amazonaws.com.

See pre-trained denoising models for a list of the available pre-trained denoising models.

Before using one of these models, please read the description of how the training datasets for these models were preprocessed, in Lal & Chiang, et al. (2019), Methods section, paragraph 1. If your data processing pipeline is different, it is advisable to train a new model using the instructions above.

See below for instructions to use our pre-trained models or your own trained models.

Input files

To denoise and call peaks from low-coverage/low-quality ATAC-seq data, you need three input files:

A trained AtacWorks model file with extension .pth.tar.
A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed in the same way as the files used for training the model. Format: bigWig
Chromosome sizes file - a tab-separated text file containing the names and sizes of chromosomes in the genome.

One step denoising + peak calling command

bash Atacworks/scripts/run_inference.sh -bw <path to bigWig file with test ATAC-seq data> -m <path to model file> -f <path to chromosome sizes file> -o <output directory> -c <path to folder containing config files (optional)>

This command produces a folder containing several files:

_infer_results.track.bw: A bigWig file containing the denoised ATAC-seq coverage track.
infer_results_peaks.bed: A BED file containing the peaks called from the denoised ATAC-seq track. This file has 8 columns, in order:

chromosome
peak start position
peak end position
peak length (bp)
Mean coverage over peak
Maximum coverage in peak
Position of summit (relative to start)
Position of summit (absolute).

_infer_results.peaks.bw: The same peak calls, in the form of a bigWig track for genome browser visualization.

run_inference.sh optionally takes a folder containing config files - specifically, this folder needs to contain two files, infer_config.yaml which specifies parameters for inference, and model_structure.yaml which specifies the structure of the deep learning model. If no folder containing config files is supplied, the folder AtacWorks/configs containing default parameter values will be used.

In order to vary output file names or formats, or inference parameters, you can change the arguments supplied in infer_config.yaml. Type python AtacWorks/scripts/main.py infer --help to understand which arguments to change.

In particular, the threshold for peak calling is controlled by the infer_threshold parameter in infer_config.yaml. By default, this is set to 0.5. If infer_threshold is set to "None" in the config file, run_inference.sh will instead produce a bigWig file in which each base is labeled with the probability (between 0 and 1) that it is part of a peak.

Advanced usage: step-by-step denoising + peak calling with subcommands

See Tutorial 2 for an advanced workflow detailing the individual steps of data processing, encoding and prediction using a trained model, and how to modify the parameters used in these steps.

FAQ

What's the preferred way for setting up the environment?

A virtual environment or conda installation is preferred. You can follow conda installation instructions on their website and then follow the instructions in the README.

Citation

Please cite AtacWorks as follows:

Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481.

Link: https://www.biorxiv.org/content/10.1101/829481v2

Name		Name	Last commit message	Last commit date
Latest commit History 502 Commits
atacworks		atacworks
ci		ci
configs		configs
data/reference		data/reference
docs		docs
reference		reference
scripts		scripts
tests		tests
tutorials		tutorials
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

rmovva/AtacWorks

Folders and files

Latest commit

History

Repository files navigation

AtacWorks

Installation

1. Clone repository

Latest released version

Latest development version

2. System Setup

System requirements

Install dependencies

3. Tests

Workflow

1. Training an AtacWorks model

Input files

Tutorial

2. Denoising and peak calling using a trained AtacWorks model

Downloading pre-trained models

Input files

One step denoising + peak calling command

Advanced usage: step-by-step denoising + peak calling with subcommands

FAQ

Citation

About

Resources

License

Stars

Watchers

Forks

Languages