Skip to content

Deep learning based processing of Atac-seq data

License

Notifications You must be signed in to change notification settings

rmovva/AtacWorks

 
 

Repository files navigation

AtacWorks

AtacWorks is a deep learning toolkit for coverage track denoising and peak calling from low-coverage or low-quality ATAC-Seq data.

AtacWorks

Installation

1. Clone repository

Latest released version

This will clone the repo to the master branch, which contains code for latest released version and hot-fixes.

git clone --recursive -b master https://github.com/clara-genomics/AtacWorks.git

Latest development version

This will clone the repo to the default branch, which is set to be the latest development branch. This branch is subject to change frequently as features and bug fixes are pushed.

git clone --recursive https://github.com/clara-genomics/AtacWorks.git

2. System Setup

System requirements

  • Ubuntu 16.04+
  • CUDA 9.0+
  • Python 3.6.7+
  • GCC 5+
  • (Optional) A conda or virtualenv setup
  • Any NVIDIA GPU. AtacWorks training and inference currently does not run on CPU.

Install dependencies

  • Download bedGraphToBigWig and bigWigToBedGraph binaries and add to your $PATH

    rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig <custom_path>
    rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bigWigToBedGraph <custom_path>
    export PATH="$PATH:<custom_path>"
    sudo apt-get install hdf5-tools
    
  • Install pip dependencies

    pip install -r requirements-base.txt && pip install -r requirements-macs2.txt
    
  • Install atacworks

    pip install .
    

Note: The above non-standard installation is necessary to ensure the requirements for macs2 are installed before macs2 itself.

3. Tests

Run unit tests:

```
python -m pytest tests/
```

####Running CI Tests Locally Please note, your git repository will be mounted to the container, any untracked files will be removed from it. Before executing the CI locally, stash or add them to the index.

Requirements:

  1. docker (https://docs.docker.com/install/linux/docker-ce/ubuntu/)
  2. nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
  3. nvidia-container-runtime (https://github.com/NVIDIA/nvidia-container-runtime)

Run the following command to execute the CI build steps inside a container locally:

bash ci/local/build.sh -r <Atacworks repo path>

ci/local/build.sh script was adapted from rapidsai/cudf

The default docker image is clara-genomics-base:cuda10.1-ubuntu16.04-gcc5-py3.6. Other images from gpuci/clara-genomics-base repository can be used instead, by using -i argument

bash ci/local/build.sh -r <Atacworks repo path> -i gpuci/clara-genomics-base:cuda10.0-ubuntu18.04-gcc7-py3.6

Workflow

AtacWorks trains a deep neural network to learn a mapping between noisy (low coverage/low quality) ATAC-Seq data and matching clean (high coverage/high quality) ATAC-Seq data from the same cell type. Once this mapping is learned, the trained model can be applied to improve other noisy ATAC-Seq datasets.

1. Training an AtacWorks model

Input files

To train an AtacWorks model, you need a pair of ATAC-Seq datasets from the same cell type, where one dataset has lower coverage or lower quality than the other. You can also use multiple such pairs of datasets. For each such pair of datasets, AtacWorks requires three input files:

  1. A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed. Format: bigWig

  2. A coverage track representing the number of sequencing reads mapped to each position on the genome in the high-coverage or high-quality dataset. This may be smoothed or processed in the same way as the previous track. Format: bigWig

  3. The genomic positions of peaks called on the high-coverage or high-quality dataset. These can be obtained by using MACS2 or any other peak caller. Format: either BED or the narrowPeak format produced by MACS2.

The model learns a mapping from (1) to both (2) and (3); in other words, from the noisy coverage track, it learns to predict both the clean coverage track, and the positions of peaks in the clean dataset.

Tutorial

See Tutorial 1 for a workflow detailing the steps of data processing, encoding and model training and how to modify the parameters used in these steps.

2. Denoising and peak calling using a trained AtacWorks model

Downloading pre-trained models

All models described in Lal & Chiang, et al. (2019) are available for download and use at https://atacworks-paper.s3.us-east-2.amazonaws.com.

See pre-trained denoising models for a list of the available pre-trained denoising models.

Before using one of these models, please read the description of how the training datasets for these models were preprocessed, in Lal & Chiang, et al. (2019), Methods section, paragraph 1. If your data processing pipeline is different, it is advisable to train a new model using the instructions above.

See below for instructions to use our pre-trained models or your own trained models.

Input files

To denoise and call peaks from low-coverage/low-quality ATAC-seq data, you need three input files:

  1. A trained AtacWorks model file with extension .pth.tar.

  2. A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-coverage or low-quality dataset. This may be smoothed or processed in the same way as the files used for training the model. Format: bigWig

  3. Chromosome sizes file - a tab-separated text file containing the names and sizes of chromosomes in the genome.

One step denoising + peak calling command

bash Atacworks/scripts/run_inference.sh -bw <path to bigWig file with test ATAC-seq data> -m <path to model file> -f <path to chromosome sizes file> -o <output directory> -c <path to folder containing config files (optional)>

This command produces a folder containing several files:

  1. _infer_results.track.bw: A bigWig file containing the denoised ATAC-seq coverage track.
  2. infer_results_peaks.bed: A BED file containing the peaks called from the denoised ATAC-seq track. This file has 8 columns, in order:
  • chromosome
  • peak start position
  • peak end position
  • peak length (bp)
  • Mean coverage over peak
  • Maximum coverage in peak
  • Position of summit (relative to start)
  • Position of summit (absolute).
  1. _infer_results.peaks.bw: The same peak calls, in the form of a bigWig track for genome browser visualization.

run_inference.sh optionally takes a folder containing config files - specifically, this folder needs to contain two files, infer_config.yaml which specifies parameters for inference, and model_structure.yaml which specifies the structure of the deep learning model. If no folder containing config files is supplied, the folder AtacWorks/configs containing default parameter values will be used.

In order to vary output file names or formats, or inference parameters, you can change the arguments supplied in infer_config.yaml. Type python AtacWorks/scripts/main.py infer --help to understand which arguments to change.

In particular, the threshold for peak calling is controlled by the infer_threshold parameter in infer_config.yaml. By default, this is set to 0.5. If infer_threshold is set to "None" in the config file, run_inference.sh will instead produce a bigWig file in which each base is labeled with the probability (between 0 and 1) that it is part of a peak.

Advanced usage: step-by-step denoising + peak calling with subcommands

See Tutorial 2 for an advanced workflow detailing the individual steps of data processing, encoding and prediction using a trained model, and how to modify the parameters used in these steps.

FAQ

  1. What's the preferred way for setting up the environment?

    A virtual environment or conda installation is preferred. You can follow conda installation instructions on their website and then follow the instructions in the README.

Citation

Please cite AtacWorks as follows:

Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481.

Link: https://www.biorxiv.org/content/10.1101/829481v2

About

Deep learning based processing of Atac-seq data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 58.5%
  • Jupyter Notebook 30.1%
  • Shell 11.1%
  • Dockerfile 0.3%