Collection of Python scripts for getting and cleaning transcription factor binding site data from a collection of gene regulation databases.
This repository contains data from following databases.
- CollecTF (prokaryotic-wide)
- MtbRegList (Mycobacterium tuberculosis)
- RegTransBase (prokaryotic-wide)
- RegulonDB (Escherichia coli)
- DBTBS (Bacillus subtilis)
- CoryneRegNet (Corynebacterium species)
In addition to database-specific csv/tsv files, the data is merged into one tsv file having following fields
column | description |
---|---|
genome_accession | NCBI RefSeq genome accession number |
TF | transcription factor |
TF_accession | transcription factor accession number |
site_start | site start position (0-indexed, inclusive) |
site_end | site end position (0-indexed, exclusive) |
site_strand | site strand, {+, -} |
left_flanking | left flanking region of the site (100bp) |
site_sequence | site sequence [ACTG]+ |
right_flanking | right flanking region of the site (100bp) |
regulated_operon | the regulated genes/operon (as originally reported) |
mode | TF mode {activator, repressor, dual, undefined} |
evidence | list of techniques or PMIDs |
database | the source of the site |
alternative_database_id | original id used in the source database |
The site location is 0 indexed. Start index is inclusive and end index is exclusive. For each record,
if site_strand == +1:
site_sequence = genome_sequence[start:end]
else:
site_sequence = reverse_complement(genome_sequence[start:end])
Left and right flanking regions are 100 bp sequences on both ends of the
site. When they are concatenated with the site (i.e left_flanking +
site_sequence + right_flanking
), the joined sequence should be present in
either strand of the genome.
The csv files (one from each database) are concatenated into
merged_data.tsv
(see reformat.py
). The RegTransBase is not included
in the merged file as most of its sites don’t have any associated experimental
evidence (see regtransbase
readme file).
It is possible that the same binding site may be present in multiple databases in slightly different genomic location (e.g. [x, x+19] vs. [x-1, x+19]). Such duplicates are removed from the final merged data.
If two sites, from the same TF and same genome, overlap more than 75% of the combined length, one of such sites is selected and the other one is discarded. The selection of one over the other is arbitrary.