Skip to content

bio-ontology-research-group/pgsim

Repository files navigation

Evaluation of Semantic Similarity Measures

This repository includes different scripts in python and groovy used in the project of evaluating semantic similarity measures on their sensitivity to annotation size and difference

Scripts for data generation

  1. annotations.py - This script is used for reformat the database annotations. It uses gene_association files GAF version 2.0 and creates the annotations file where each line represent gene and its annotations separated by tabs.

  2. gen_annotations.py - This script generates random annotations of the same size as in files generated by annotations.py script.

Scripts for computing similarity measures

  1. Sim.groovy, SimPairwise.groovy - This groovy scripts are used for computing groupwise and pairwise similarity measures for the given annotations file. Requires Gene Ontology file in OBO Format. Outputs a file with similarity values for each entry with all the other entries.
  2. SimHP.groovy, SimHPPairwise.groovy - This groovy scripts are used for computing groupwise and pairwise similarity measures for the given annotations file. Requires Human Phenotype Ontology file in OBO Format. Outputs a file with similarity values for each entry with all the other entries.
  3. SimGDPairwise.groovy - This groovy script is used for computingpairwise similarity measures for the between genes and disease annotations. Requires Human Phenotype Ontology file in OBO Format. Outputs a file with similarity values for each gene with all the diseases.

Scripts for evaluating the similarities

  1. correlation.py - This script is used for computing Spearman and Pearson correlations between similarity values and annotations size. Requires the annotations file and file with similarity values.
  2. interactions.py - This script is used for computing ROC AUC for protein-protein interaction predictions. We use similarity values as predictions score and BioGrid interaction data as our test data. BioGRID Tab 2.0 formatted files are used.
  3. gene_disease.py - This script is used for evaluating similarity measures on gene-disease association predictions.

Scripts for generating plots

  1. plot_figures.py, plot_figures_pairwise.py - This scripts are used to generate plots from similarity measures values. Requires annotations file and similarity values.

Repository also includes shell scripts for running some scripts for multiple files.

Generated similarity values and plots can be found here: http://www.cbrc.kaust.edu.sa/onto/sim-eval/