This project provides the backend for the GRITS diagnostic-dashboard and a variety of other EHA projects. The main API which it furnishes, accessible at /diagnose
, takes an incoming document and returns a differential disease diagnosis and numerous extracted features for that document.
This project also provides resources for training the classifier model used to make disease predictions, and for managing long-running classification tasks over large corpora.
Aside from the requirments noted in requirements.txt which may be installed as usual with pip install -r requirements.txt
, this project also relies on the annotation library EpiTator.
These instructions will get grits-api
working under a Python virtualenv.
Next, start mongo on port 27017 by running mongod
and restore the girder database:
mongorestore --host=127.0.0.1 --port=27017 -d girder PATH/TO/item.bson
Clone grits-api
git clone git@github.com:ecohealthalliance/grits-api.git
cd grits-api
Get a copy of config.py
from someone at EHA (this contains sensitive AWS authentication information) or create your own from config.sample.py
.
If you do not have virtualenv
, first install it globally.
sudo pip install virtualenv
Now create and enter the virtual environment. All pip
and python
commands from here should be run from within the environment. Leave the environment with the deactivate
command.
virtualenv venv
source venv/bin/activate
Install grits-api
dependencies and nose
.
pip install -r requirements.txt
pip install EpiTator
pip install nose
If lxml fails to install, run (in bash) STATIC_DEPS=true pip install lxml
Download the GRITS classifier data:
aws s3 sync s3://classifier-data/classifiers/1456399096 current_classifier
Download the EpiTator data dependencies:
python -m spacy download en_core_web_md
python -m epitator.importers.import_all
Start a celery worker:
There are 3 celery task queues, priority
, process
and diagnose
. The process queue is for scraping and extracting # articles prior to diagnosis. We recommend running a single threaded worker process on the process queue because it primarily makes http requests, so it spends most of it's time idling. The diagnose queue should have several worker processes as it is very CPU intensive. The priority queue is for both processing and diagnosing articles and should have a dedicated worker process for immediatly diagnosing individual articles.
celery worker -A tasks -Q priority --loglevel=INFO --concurrency=2
Start the server:
# The -debug flag will run a celery worker synchronously in the same process,
# so you can debug without starting a separate worker process.
python server.py
We have created a script for deploying the grits API and diagnostic dashboard to Ubuntu AWS instances here.
You will need to edit inventory.ini if you are not deploying to our server. Furthermore, if you are not an EHA employee, you will need replace the my_secure.yml file with one of your own that defines any missing variables.
The following commands invoke the build and deployment scripts:
ansible-playbook provision-instance-and-build.yml --extra-vars "image_name=grits" ansible-playbook deploy-apps.yml --private-key ~/.ssh/id_rsa --tags deploy-grits
To run the tests:
git clone -b fetch_4-18-2014 git@github.com:ecohealthalliance/corpora.git
cd test
python -m unittest discover
Many tests are based on the comments in this document: https://docs.google.com/document/d/12N6hIDiX6pvIBfr78BAK_btFxqHepxbrPDEWTCOwqXk/edit
A corpus of HealthMap articles in the girder database is used to train the classifier. It must be manually downloaded and restored to the db. The database collection can be obtained from S3, in the bucket girder-data/proddump/girder. One additional file is required to operate the classifier: ontologies.p. It will be downloaded from our S3 bucket by default, however that bucket might not be available to you, or it might no longer exist. In that case, ontologies.p can be generated by running the mine_ontologies.py. The HealthMap data, however, can no longer be generated from a script.
The corpora directory includes code for iterating over HealthMap data stored in a girder database, scraping and cleaning the content of the linked source articles, and generating pickles from it.
First, get a copy of the Girder data (backed up in S3 - the bucket is girder-data/proddump/girder). This will give you the file item.bson.
New data may be generated using the train.py
script:
$ python train.py
This script relies on having the HealthMap articles available in the girder database.
Copyright 2016 EcoHealth Alliance
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.