Code related to the Git Repositories track from Hercules challenge
- data: Directory with the input dataset for this track.
- notebooks: Jupyter notebooks with the process taken for the creation and training of the topic extraction system.
- results: Output of the notebooks and scripts from the system. This folder contains a variety of files (pickled models, dataframes, track output...).
- scripts: Scripts provided to run the final system and obtain predictions for the track.
- src: Track-specific source code regarding the parsing and handling of repositories.
In order to run the code from this repository, Python 3.7 or greater is required. Experiments were executed in Python 3.7.8, and that is the preferred version for the execution of the models.
Instructions to install Python 3.7.8 are available at the official website.
Once Python has been installed, it is preferrable to create a [environment] before installing the dependencies. To create a new python environment, the following command can be used:
python -m venv edma_env
This environment can be then used with the following command:
souce edma_env/bin/activate
Finally, we can install the dependencies of the system with pip:
pip install -r requirements.txt
In the notebooks directory we provide a series of Jupyter notebooks that can be executed to explore how the systems were created and get more information about them or finetune their hyperparameters. In this section we will explain how to run those notebooks and provide some advice on how they should be executed.
If you have followed the steps from the previous section to install Python and the project dependencies, the jupyter package should already be installed. In order to run the Jupyter client, go to the notebooks directory and run the jupyter notebook command:
cd notebooks
jupyter notebook
This will open a new tab in your browser with the Jupyter explorer where the different files can be explored:
If the browser was not automatically opened, you can connect to the Jupyter client through localhost on port 8888 (localhost:8888).
Now you can click on any of the notebooks to explore its content or even rerun and modify the cells. Instructions on how to do this are provided in the official Notebook docs.
It is recommended to run the notebooks in a sequential manner, in the order indicated by their first filename number (i.e. notebook 1_data_fetching.ipynb should be run before 2_Topic_Modelling.ipynb and so on). Non-sequential execution is not recommended and should be avoided, since the execution of one notebook may depend on outputs produced by the previous ones.
Note: In order to run the systems you must first obtain the models used to perform the topic extraction. Due to size constraints, they are not included in these repositories. There are two main alternatives to obtain the models: the first one is the execution of every notebook to retrain and build the systems from scratch, but this may take some time; the second alternative is to go to the complete_system directory and follow the instructions to download the trained models. Several scripts are provided in the scripts folder to execute the systems and reproduce the results obtained for this track. In the following sections we will explain the main functionality of each script and how they can be executed.
The script run_track_predictions.py can be used to obtain at once all the topics assigned to every repository from the dataset. The following parameters can be passed to the script:
Name | Description | Compulsory | Allowed Values |
---|---|---|---|
-f --format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of csv, json, jsonld, n3, rdf/xml or ttl |
-o --output | Name of the file where the results will be saved. If no output file is specified, results will be written to the console instead. | No | Any valid filename. |
For more additional information about how to run the script, you can execute the following command:
python scripts/run_track_predictions.py -h
In the following example, we will be running the script twice. The first execution will print the results in console and in json format (default values). The second one will save the results to the file results.ttl in the turtle format:
python scripts/run_track_predictions.py
python scripts/run_track_predictions.py -o results.ttl -f ttl
The script predict_repo.py can be used to obtain the topics for a given protocol or list of protocols. The following parameters can be passed to the string:
Name | Description | Compulsory | Allowed Values |
---|---|---|---|
input | Url of the GitHub repository to extract the topics from. If the --file flag is set, file with the urls of the GitHub repositories. | Yes | Any GitHub url or file. |
--isFile | If present, this flag indicates that the input passed to the script is a file with the urls of each repository delimited by newlines. | No | True or False |
--token | itHub token used to fetch information about the input repositories before the topic extraction steps. | No | Github tokens |
-f --format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of csv, json, jsonld, n3 or ttl |
-o --output | Name of the file where the results will be saved. If no output file is specified, results will be written to the console instead. | No | Any valid filename. |
For more additional information about how to run the script, you can execute the following command:
python scripts/predict_repo.py -h
In the following example, we will be running the script twice. The first execution will print the results in console and in json format (default values). The second one we will use the list of repositories urls from the script directory to predict the topics for those repositories. After that, we will save the results to the file results.ttl in the turtle format:
python scripts/predict_protocol.py https://github.com/cmungall/LIRICAL/
python scripts/predict_protocol.py scripts/test.txt --isFile -o results.ttl -f ttl
An API has been deployed at http://edma-challenge.compute.weso.network/ where the different functionality of the system can be tested out without needing to manually run the scripts with Python.
For the Git track, we provide the api/git/topics GET endpoint to predict the topics of a given GitHub repository. The following parameters can be sent in the JSON body:
Name | Description | Compulsory | Allowed Values |
---|---|---|---|
input | Url of the GitHub repository to extract the topics from. | Yes | Any GitHub url. |
token | itHub token used to fetch information about the input repositories before the topic extraction steps. | No | Github tokens |
format | Output format of the results. If no output format is specified, results are returned in JSON by default. | No | One of json, jsonld, n3 or ttl |
An example body passed to the API could be as follows:
{
"input": "https://github.com/cmungall/LIRICAL/",
"token": "GITHUB_TOKEN_HERE",
"format": "ttl"
}
The response will be as follows:
{
"task_id": "YOUR_TASK_ID"
}
A task identifier will be returned. We can query the api/prediction/<task_id> endpoint to get the status of our task. When the task is finished, we will get a result like the following one:
{
"result":"{'216602979': {'source_url': 'https://www.github.com/cmungall/LIRICAL', 'author': 'cmungall', 'name': 'LIRICAL', 'languages': {'Java': 492423, 'FreeMarker': 13149, 'Python': 849}, 'topics': [{'labels': {'en': 'computer science', 'es': 'ciencias de la computación'}, 'external_ids': ['https://www.wikidata.org/w/Q21198', 'https://freebase.toolforge.org//m/01mkq', 'https://www.jstor.org/topic/computer-science-education', 'https://academic.microsoft.com/v2/detail41008148', 'http://vocabularies.unesco.org/thesaurus/concept450'], 'descriptions': {'en': 'study of the theoretical foundations of computation', 'es': 'ciencia dedicada a la computación y al proceso de información'}, 'score': 0.21895332390381894}, {'labels': {'en': 'software', 'es': 'software'}, 'external_ids': ['https://www.wikidata.org/w/Q7397', 'https://id.ndl.go.jp/auth/ndlsh/00684642', 'https://freebase.toolforge.org//m/01mf0', 'https://www.jstor.org/topic/computer-software', 'https://meshb.nlm.nih.gov/record/ui?ui=D012984', 'https://academic.microsoft.com/v2/detail2777904410', 'http://vocabularies.unesco.org/thesaurus/concept6081'], 'descriptions': {'en': 'non-tangible executable component of a computer', 'es': 'equipamiento lógico o soporte lógico de un sistema informático'}, 'score': 0.2180281690140845}, {'labels': {'en': 'artificial intelligence', 'es': 'inteligencia artificial'}, 'external_ids': ['https://www.wikidata.org/w/Q11660', 'https://id.ndl.go.jp/auth/ndlsh/00574798', 'https://freebase.toolforge.org//m/0mkz', 'https://meshb.nlm.nih.gov/record/ui?ui=D001185', 'https://www.jstor.org/topic/artificial-intelligence', 'http://vocabularies.unesco.org/thesaurus/concept3052', 'http://id.nlm.nih.gov/mesh/G17.035.250', 'http://id.nlm.nih.gov/mesh/L01.224.050.375', 'https://academic.microsoft.com/v2/detail154945302'], 'descriptions': {'en': 'intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals', 'es': 'rama de la informática que desarrolla máquinas y software con inteligencia similar a la humana'}, 'score': 0.21228743828853539}, {'labels': {'en': 'interaction science', 'es': ''}, 'external_ids': ['https://www.wikidata.org/w/Q97008347'], 'descriptions': {'en': 'scientific discipline', 'es': ''}, 'score': 0.21164889253486463}, {'labels': {'en': 'engineering', 'es': 'ingeniería'}, 'external_ids': ['https://www.wikidata.org/w/Q11023', 'https://id.ndl.go.jp/auth/ndlsh/00566144', 'https://freebase.toolforge.org//m/02ky346', 'https://www.jstor.org/topic/engineering', 'http://vocabularies.unesco.org/thesaurus/concept623', 'http://uri.gbv.de/terminology/bk/5', 'https://meshb.nlm.nih.gov/record/ui?ui=D004738', 'http://id.nlm.nih.gov/mesh/J01.293', 'https://academic.microsoft.com/v2/detail127413603'], 'descriptions': {'en': 'applied science', 'es': 'ciencia aplicada'}, 'score': 0.21032608695652175}, {'labels': {'en': 'automation', 'es': 'automatización'}, 'external_ids': ['https://www.wikidata.org/w/Q184199', 'https://freebase.toolforge.org//m/017cmr', 'http://vocabularies.unesco.org/thesaurus/concept3401', 'https://meshb.nlm.nih.gov/record/ui?ui=D001331', 'https://academic.microsoft.com/v2/detail115901376', 'https://academic.microsoft.com/v2/detail167123822'], 'descriptions': {'en': 'use of various control systems for operating equipment', 'es': 'uso de sistemas o elementos computerizados y electromecánicos para controlar maquinarias o procesos industriales'}, 'score': 0.2096424702058505}, {'labels': {'en': 'statistics', 'es': 'estadística'}, 'external_ids': ['https://www.wikidata.org/w/Q12483', 'https://id.ndl.go.jp/auth/ndlsh/00573173', 'https://freebase.toolforge.org//m/06mnr', 'https://www.jstor.org/topic/statistics', 'https://meshb.nlm.nih.gov/record/ui?ui=D020500', 'https://meshb.nlm.nih.gov/record/ui?ui=D013223', 'http://id.nlm.nih.gov/mesh/V02.925', 'http://id.nlm.nih.gov/mesh/E05.318.740', 'http://id.nlm.nih.gov/mesh/H01.548.832', 'http://id.nlm.nih.gov/mesh/N05.715.360.750', 'http://id.nlm.nih.gov/mesh/N06.850.520.830', 'https://academic.microsoft.com/v2/detail105795698', 'http://vocabularies.unesco.org/thesaurus/concept119', 'http://vocabularies.unesco.org/thesaurus/concept2238'], 'descriptions': {'en': 'study of the collection, organization, analysis, interpretation, and presentation of data', 'es': 'estudio de la recolección, organización, análisis, interpretación y presentación de los datos'}, 'score': 0.2094722598105548}]}}",
"task_status":"Finished"
}
The results obtained for the track dataset can be found in the script_results folder. These results are provided in multiple formats (.csv, .json, .jsonld, and .ttl).