Skip to content

Getting and cleaning bacterial transcription factor binding site data from several databases

Notifications You must be signed in to change notification settings

ErillLab/TFBS_data

Repository files navigation

Transcription Factor Binding Site Data

Collection of Python scripts for getting and cleaning transcription factor binding site data from a collection of gene regulation databases.

This repository contains data from following databases.

In addition to database-specific csv/tsv files, the data is merged into one tsv file having following fields

columndescription
genome_accessionNCBI RefSeq genome accession number
TFtranscription factor
TF_accessiontranscription factor accession number
site_startsite start position (0-indexed, inclusive)
site_endsite end position (0-indexed, exclusive)
site_strandsite strand, {+, -}
left_flankingleft flanking region of the site (100bp)
site_sequencesite sequence [ACTG]+
right_flankingright flanking region of the site (100bp)
regulated_operonthe regulated genes/operon (as originally reported)
modeTF mode {activator, repressor, dual, undefined}
evidencelist of techniques or PMIDs
databasethe source of the site
alternative_database_idoriginal id used in the source database

The site location is 0 indexed. Start index is inclusive and end index is exclusive. For each record,

if site_strand == +1:
  site_sequence = genome_sequence[start:end]
else:
  site_sequence = reverse_complement(genome_sequence[start:end])

Left and right flanking regions are 100 bp sequences on both ends of the site. When they are concatenated with the site (i.e left_flanking + site_sequence + right_flanking), the joined sequence should be present in either strand of the genome.

Concatenated data

The csv files (one from each database) are concatenated into merged_data.tsv (see reformat.py). The RegTransBase is not included in the merged file as most of its sites don’t have any associated experimental evidence (see regtransbase readme file).

Removing duplicates

It is possible that the same binding site may be present in multiple databases in slightly different genomic location (e.g. [x, x+19] vs. [x-1, x+19]). Such duplicates are removed from the final merged data.

If two sites, from the same TF and same genome, overlap more than 75% of the combined length, one of such sites is selected and the other one is discarded. The selection of one over the other is arbitrary.

About

Getting and cleaning bacterial transcription factor binding site data from several databases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published