Transcription Factor Binding Site Data

Collection of Python scripts for getting and cleaning transcription factor binding site data from a collection of gene regulation databases.

This repository contains data from following databases.

CollecTF (prokaryotic-wide)
MtbRegList (Mycobacterium tuberculosis)
RegTransBase (prokaryotic-wide)
RegulonDB (Escherichia coli)
DBTBS (Bacillus subtilis)
CoryneRegNet (Corynebacterium species)

In addition to database-specific csv/tsv files, the data is merged into one tsv file having following fields

column	description
genome_accession	NCBI RefSeq genome accession number
TF	transcription factor
TF_accession	transcription factor accession number
site_start	site start position (0-indexed, inclusive)
site_end	site end position (0-indexed, exclusive)
site_strand	site strand, {+, -}
left_flanking	left flanking region of the site (100bp)
site_sequence	site sequence [ACTG]+
right_flanking	right flanking region of the site (100bp)
regulated_operon	the regulated genes/operon (as originally reported)
mode	TF mode {activator, repressor, dual, undefined}
evidence	list of techniques or PMIDs
database	the source of the site
alternative_database_id	original id used in the source database

The site location is 0 indexed. Start index is inclusive and end index is exclusive. For each record,

if site_strand == +1:
  site_sequence = genome_sequence[start:end]
else:
  site_sequence = reverse_complement(genome_sequence[start:end])

Left and right flanking regions are 100 bp sequences on both ends of the site. When they are concatenated with the site (i.e left_flanking + site_sequence + right_flanking), the joined sequence should be present in either strand of the genome.

Concatenated data

The csv files (one from each database) are concatenated into merged_data.tsv (see reformat.py). The RegTransBase is not included in the merged file as most of its sites don’t have any associated experimental evidence (see regtransbase readme file).

Removing duplicates

It is possible that the same binding site may be present in multiple databases in slightly different genomic location (e.g. [x, x+19] vs. [x-1, x+19]). Such duplicates are removed from the final merged data.

If two sites, from the same TF and same genome, overlap more than 75% of the combined length, one of such sites is selected and the other one is discarded. The selection of one over the other is arbitrary.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
collectf		collectf
coryneRegNet		coryneRegNet
dbtbs		dbtbs
mtbreglist		mtbreglist
regtransbase		regtransbase
regulonDB		regulonDB
.gitignore		.gitignore
common.py		common.py
get_genome_accession.py		get_genome_accession.py
merged_data.tsv		merged_data.tsv
readme.org		readme.org
reformat.py		reformat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collectf

collectf

coryneRegNet

coryneRegNet

dbtbs

dbtbs

mtbreglist

mtbreglist

regtransbase

regtransbase

regulonDB

regulonDB

.gitignore

.gitignore

common.py

common.py

get_genome_accession.py

get_genome_accession.py

merged_data.tsv

merged_data.tsv

readme.org

readme.org

reformat.py

reformat.py

Repository files navigation

Transcription Factor Binding Site Data

Concatenated data

Removing duplicates

About

Releases

Packages

Contributors 2

Languages

ErillLab/TFBS_data

Folders and files

Latest commit

History

Repository files navigation

Transcription Factor Binding Site Data

Concatenated data

Removing duplicates

About

Resources

Stars

Watchers

Forks

Languages