Skip to content

Functional prediction of transcription factors driving differential gene expression

License

Notifications You must be signed in to change notification settings

i-strielkov/tfdrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TFdrive

TFdrive is an algorithm for functional prediction of transcription factors (TF) driving differential gene expression. It uses a list of differentially expressed genes identified, e.g., by an RNA-seq or microarray analysis and returns a list of TFs ranked by the probability score. Currently TFdrive library contains 173 human TFs.

Method Details

Accurate predictions of TFs driving changes in gene expression is still a challenging task. Current methods are predominantly based on the available TF binding site data. However, the significance of a certain TF binding site often heavily depends on a cell type, experimental conditions and activity of other TFs. As a result, such methods tend to produce a lot of false positives. In contrast, TFdrive predictions are based on an overlap of KEGG pathways/GO terms between differentially expressed genes (DEGs) and a TF. The algorithm takes into account relative pathway/term importance and the number of genes related to each of the overlapping pathways/terms. This data is further supplemented with the analysis of gene-TF association tables obtained from two manually curated databases, TRANSFAC and ChEA. These associations are predominantly inferred from TF binding site data. Considering that several TFs from the same family are often able to bind to the same motif, ratios between gene-TF association frequencies among DEGs vs. non-DEGs for each TF family are calculated using TRANSFAC and ChEA tables. Family-based scores appear to have more predictive value in this context than the scores calculated for individual TFs. As a next step, probability scores using previously created random forest model are calculated. Lastly, all the scores mentioned above are used to obtain the final probability score using logistic regression. Both random forest and logistic regression models were trained using the results of 443 human TF knockout/knockdown and overexpression experiments available at ChEA3 website. TFs, which were among DEGs, were considered to belong to the positive class.
The data from GSE50588 is used here to demonstrate TFdrive predictive performance. This dataset contains the microarray results of 59 human transcription factors knockdowns. Differentially expressed TFs were assigned to the positive class. Although they presumably represent only a fraction of actual positives, the number of correctly identified TFs is expected to correlate with the number of all TFs involved in differential gene expression. Therefore, the results of such analysis may give a general impression regarding algorithm's predictive accuracy. Previously, similar approach to model evaluation was employed in the development of ChEA3 (see this publication for details). Note that for testing purposes KEGG pathway/GO term information attributed to differentially expressed TFs was excluded from the analysis to avoid data leakage.

The R version of the algorithm have similar predictive efficiency in this test (ROC AUC = 0.868, PR AUC = 0.861; see the Jupyter Notebook file here for details). However, since it relies on a different logistic regression library (glmnet) and currently does not use random forest, the results can be slightly different as compared to the Python library.
In summary, although TFdrive is not designed to discover new TF-gene interactions, it allows to identify major players driving differential gene expression among known TFs in new experimental data with a high level of precision.

Dependencies

Python

TFdrive supports Python 3.6+. Usage requires numpy, scipy, pandas, pyarrow, scikit-learn, and joblib.

R

TFdrive work with R 3.6.1+ and uses glmnet library to calculate the probability scores.

Installation

Python

Download tfdrive_files.7z from here (210 MB) and unpack tfdrive.py and the tfdrive_data folder to your project folder.

R

Use devtools library for installation:

devtools::install_github("i-strielkov/tfdrive/r")

Usage

After importing tfdrive, call tfpred providing a list of gene Entrez IDs as an argument. The values may be either integers or strings. Currently, the method works only with human genes. tfpred returns a Data Frame object containing a ranked list of TFs with probability scores associated with them. The higher the score, the more likely the TF is to be involved in the observed changes in gene expression. Note that TFs from the library, which are found among DEGs, are excluded from the final results.
To have a general idea of how the probability scores relate to the actual probability of TF involvement, consider their distribution in the example shown above:



As you can see, the TFs with probability scores higher than 0.45 - 0.5 are much more likely to belong to the positive class than to the negative class. The distribution of probability scores for R library is expected to be slightly different.

Future plans

  • Increasing TF library and training dataset
  • A model for prediction of mouse TFs
  • Other tweaks to the algorithm

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Functional prediction of transcription factors driving differential gene expression

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published