JASPAR UCSC tracks

This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.

News

01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.

Content

The folder genomes contains scripts to download and process different genome assemblies
The folder profiles contains the output from the script get_profiles.py, which downloads JASPAR CORE profiles for different taxons
The script scan_sequence.py takes as input the profiles folder and a nucleotide sequence, in FASTA format
(e.g. a genome), and provides TFBS predictions
The script scans2bigBed creates a bigBed track file from TFBS predictions
The file environment.yml contains the conda environment (see Installation) used to generate the genomic tracks for JASPAR 2020

The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.

Dependencies

GNU parallel
Python 3.7 with the Biopython (<1.74), NumPy, pyfaidx and tqdm libraries
PWMScan
UCSC binaries for standalone command-line use

Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.

Installation

Except for PWMScan, which has to be downloaded, installed, and appended to your PATH manually, the remaining dependencies can be installed through the conda package manager:

conda env create -f ./environment.yml

Availability

Genomic tracks and TFBS predictions for human and 6 other model organisms are available online:

http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/

Usage

To illustrate the generation of genomic tracks, we provide an example for the baker's yeast genome:

Download the genome sequence and chromosome sizes (automated in this script)
Scan the genome sequence using all fungi profiles from the JASPAR CORE

./scan_sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
    --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi

For this example, this step should not take longer than a minute. For human (and for other similar genomes), this step should be completed within a few hours (the final amount of time will depend on the number of --threads specified).

Create the genomic track

./scans2bigBed -c ./genomes/sacCer3/sacCer3.chrom.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4

TFBS predictions from the previous step are merged into a bigBed track file. As scores (column 5), we use p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.

Important note: both disk space and memory requirements for large genomes (i.e. danRer11, hg19, hg38 and mm10) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space and 512Gb of ram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomes

genomes

profiles

profiles

tracks/sacCer3

tracks/sacCer3

version-1.0

version-1.0

LICENSE

LICENSE

README.md

README.md

environment.yml

environment.yml

scan_sequence.py

scan_sequence.py

scans2bigBed

scans2bigBed

Repository files navigation

JASPAR UCSC tracks

News

Content

Dependencies

Installation

Availability

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
genomes		genomes
profiles		profiles
tracks/sacCer3		tracks/sacCer3
version-1.0		version-1.0
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
scan_sequence.py		scan_sequence.py
scans2bigBed		scans2bigBed

License

Tixii/JASPAR-UCSC-tracks

Folders and files

Latest commit

History

Repository files navigation

JASPAR UCSC tracks

News

Content

Dependencies

Installation

Availability

Usage

About

Resources

License

Stars

Watchers

Forks

Languages