Skip to content

ecohealthalliance/grits-api

Repository files navigation

grits-api

This project provides the backend for the GRITS diagnostic-dashboard and a variety of other EHA projects. The main API which it furnishes, accessible at /diagnose, takes an incoming document and returns a differential disease diagnosis and numerous extracted features for that document.

This project also provides resources for training the classifier model used to make disease predictions, and for managing long-running classification tasks over large corpora.

Dependencies

Aside from the requirments noted in requirements.txt which may be installed as usual with pip install -r requirements.txt, this project also relies on the annotation library EpiTator.

Installation and set-up

Full setup with virtualenv

These instructions will get grits-api working under a Python virtualenv.

Next, start mongo on port 27017 by running mongod and restore the girder database:

mongorestore --host=127.0.0.1 --port=27017 -d girder PATH/TO/item.bson

Clone grits-api

git clone git@github.com:ecohealthalliance/grits-api.git
cd grits-api

Get a copy of config.py from someone at EHA (this contains sensitive AWS authentication information) or create your own from config.sample.py.

If you do not have virtualenv, first install it globally.

sudo pip install virtualenv

Now create and enter the virtual environment. All pip and python commands from here should be run from within the environment. Leave the environment with the deactivate command.

virtualenv venv
source venv/bin/activate

Install grits-api dependencies and nose.

pip install -r requirements.txt
pip install EpiTator
pip install nose

If lxml fails to install, run (in bash) STATIC_DEPS=true pip install lxml

Download the GRITS classifier data:

aws s3 sync s3://classifier-data/classifiers/1456399096 current_classifier

Download the EpiTator data dependencies:

python -m spacy download en_core_web_md
python -m epitator.importers.import_all

Start a celery worker:

There are 3 celery task queues, priority, process and diagnose. The process queue is for scraping and extracting # articles prior to diagnosis. We recommend running a single threaded worker process on the process queue because it primarily makes http requests, so it spends most of it's time idling. The diagnose queue should have several worker processes as it is very CPU intensive. The priority queue is for both processing and diagnosing articles and should have a dedicated worker process for immediatly diagnosing individual articles.

celery worker -A tasks -Q priority --loglevel=INFO --concurrency=2

Start the server:

# The -debug flag will run a celery worker synchronously in the same process,
# so you can debug without starting a separate worker process.
python server.py

Deployment

We have created a script for deploying the grits API and diagnostic dashboard to Ubuntu AWS instances here.

You will need to edit inventory.ini if you are not deploying to our server. Furthermore, if you are not an EHA employee, you will need replace the my_secure.yml file with one of your own that defines any missing variables.

The following commands invoke the build and deployment scripts:

ansible-playbook provision-instance-and-build.yml --extra-vars "image_name=grits" ansible-playbook deploy-apps.yml --private-key ~/.ssh/id_rsa --tags deploy-grits

Testing

To run the tests:

git clone -b fetch_4-18-2014 git@github.com:ecohealthalliance/corpora.git
cd test
python -m unittest discover

Many tests are based on the comments in this document: https://docs.google.com/document/d/12N6hIDiX6pvIBfr78BAK_btFxqHepxbrPDEWTCOwqXk/edit

Classifier Data

Using existing classifier data

A corpus of HealthMap articles in the girder database is used to train the classifier. It must be manually downloaded and restored to the db. The database collection can be obtained from S3, in the bucket girder-data/proddump/girder. One additional file is required to operate the classifier: ontologies.p. It will be downloaded from our S3 bucket by default, however that bucket might not be available to you, or it might no longer exist. In that case, ontologies.p can be generated by running the mine_ontologies.py. The HealthMap data, however, can no longer be generated from a script.

The corpora directory includes code for iterating over HealthMap data stored in a girder database, scraping and cleaning the content of the linked source articles, and generating pickles from it.

Training the classifier

First, get a copy of the Girder data (backed up in S3 - the bucket is girder-data/proddump/girder). This will give you the file item.bson.

New data may be generated using the train.py script:

$ python train.py

This script relies on having the HealthMap articles available in the girder database.

License

Copyright 2016 EcoHealth Alliance

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.