Please note: repository in WIP, each folder indicated by WIP will be updated soon.
All protein domains analysis follows the data from Interpro version 75.0. All data associated can be found at the ftp site for the version 75.0, accessible from the general download site. All analysis is decipted in the following image:
Summary of the approach/code divided in four parts, building two forms of domains architecture, training domain embeddings, performing intrinsic and extrinsic evaluation of embeddings.
Code was executed using a conda environment, of which the full list of dependencies is in conda_env_dependencies.txt.
The main dependencies are listed below:
- Python 3.7.6
- BioPython 1.74
- Gensim 3.8.0
- Pytorch 1.2.0
- Torchtext 0.4.0
- Numpy 1.18.1
- Pandas 1.0.1
- Scikit-learn 0.22.1
- Matplotlib 3.1.1
- Intervaltree 3.0.2
- Treelib 1.5.5
-
Data acquisition:
For Interpro 75.0 version download the files:
- match_complete.xml.gz
- protein2ipr.dat.gz
- Get protein lengths parsing match_complete.xml:
- change folder/files paths appropriately in proteinXMLHandler_run.py
- run
proteinXMLHandler_run.py
- prot_id_len tabular file will be created; a sample of the first 100 lines of the full file is saved at sample file
- Get domains and evidence db id per protein:
- select the output domain annotation type: overlap, non overlapping or non redundant. Then set if GAP domain is also added to annotations. Change folder/files paths appropriately and uncomment the first section in main.py
- parse domain hits per protein running
main.py
- id_domains_type.tab file will be created; a sample of the first 100 lines of the full file, for non overlapping with GAP, is saved at sample file
- Get domain architecture corpus:
- change folder/files paths appropriately and uncomment the first section in main.py
- run
main.py
- domains_corpus_type.txt file will be created; sample of the first 100 line of the full file, for non overlapping with GAP, is saved at sample file
- Needed data:
- the domains_corpus_type.txt from last step
- Train word2vec model from domain architectures corpus:
- change folder/files paths appropriately in word2vec_run.py
- change the paths and the training parameters in the provided bash script run_embs.sh
- run
run_embs.sh
- word2vec embedding standard txt file(s) will be created
Data and example running experiments for:
- Domain hierarchy
-
Data acquisition:
- For Interpro 75.0 version, download the ParentChildTreeFile.txt file
-
Parse the parent child relation:
- uncomment the domain hierarchy section in intrinsic_eval_run.py
- parse parent child using
parse_parent_child_file()
- interpro_parsed_tree.txt will be created; the first 3 Interpro parents of the full parsed tree is saved at sample file
-
Run evaluation
- run evaluation with the rest section using the looped
get_nn_calculate_precision_recall_atN()
- the outputs will be: average recall value, recall histogram png, diagnostic histogram for parents with recall 0 (if parameter is selected)
- example outputs can be found respectively at table 1, Figure S1 and S2 in the below bioRxiv manuscript
- run evaluation with the rest section using the looped
- SCOPe and EC
-
Data acquisition:
- For Interpro 75.0 version, download and decompress interpro.xml.gz file
-
Parse interpro.xml:
- uncomment the EC & SCOPe section in intrinsic_eval_run.py
- parse xml to get available SCOPe and EC labels per domain using
parse_and_save_EC_SCOP()
- interpro2EC_SCOPe.tab will be created; a sample of the first 100 lines of the full file is saved at sample file
-
Run evaluation
- initialize
EC_SCOP_Evaluate()
class for evaluation using EC or SCOPe - run evaluation with the rest section using the looped
run_classification()
- average test accuracy over 5-fold cross validation will be printed; example values can be found in Tables 2 and 3 in the below bioRxiv manuscript
- initialize
- GO molecular function
- Data acquisition:
- For Interpro 75.0 version, download the interpro2go file and add the suffix .txt
For each organism: malaria, ecolik12, yeast, human follow the steps:
-
Parse interpro2go.txt:
- uncomment the GOEvaluate section in intrinsic_eval_run.py
- parse the txt file using
convert_go_labels()
producing: - interpro2go_organism_MF.tab containing unprocessed available GO MF labels per domain; a sample of the first 100 lines of the full file for yeast is saved at sample file
- interpro2go_yeast_MF_labels.csv containing GO MF labels after abstracting them; a sample of the first 100 lines of the full file for yeast is saved at sample file
-
Run evaluation
- initialize
GOEvaluate()
class for evaluation in selected organism - run evaluation with the rest section using the looped
run_classification()
- average test accuracy over 5-fold cross validation will be printed; example can be found in Table 4 in the below bioRxiv manuscript
- initialize
Data and example running cross validation and performance experiments for three data sets:
- TargetP
- Toxin
- NEW
This repository is the implementation of the bioRxiv research paper: