Wellcome Reach

Wellcome Reach is an open source service for discovering how research publications are being cited in global policy documents, including those produced by policy organizations such as the WHO, MSF, and the UK government. Key parts of it include:

Web scrapers for pulling PDF "policy documents" from policy organizations,
A reference parser for extracting references from these documents,
A task for sourcing publications from Europe PMC (EPMC),
A task for matching policy document references to EPMC publications,
An Airflow installation for automating the above tasks, and
A web application for searching and retrieving data from the datasets produced above.

Wellcome Reach is written in Python and developed using docker-compose.

Although parts of the Wellcome Reach have been in use at Wellcome since mid-2018, the project has only just gone open source starting in March 2019. Given these early days, please be patient as various parts of it are made accessible to external users. All issues and pull requests are welcome. Contributing guidelines can be found in CONTRIBUTING.md.

Development

Dependencies

To develop for this project, you will need:

Python 3.6+, plus pip and virtualenv
Docker and docker-compose
AWS credentials with read/write S3 permissions.
A clean json file containing reference sections (TODO: remove this by pulling from a public S3 bucket by default)
A clean csv file containing all your references (TODO: remove this by pulling from a public S3 bucket by default)

docker-compose

To bring up the development environment using docker:

Set your AWS credentials into your environment. (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)

If you don't have Wellcome IAM creds, run: (TODO: remove this step)

export 
    DIMENSIONS_USERNAME="" \
    DIMENSIONS_PASSWORD="" \
    AIRFLOW_FERNET_KEY=""

If you do, simply run:
```
eval $(./export_env.py)
```
Build and start the env with:
```
make docker-build
docker-compose up -d
```
Verify came up with:
```
docker-compose ps
```

Once up, you'll be able to access:

airflow on http://localhost:8080/
Elasticsearch on http://localhost:9200/
the website on http://localhost:8081/

virtualenv

For local development outside of airflow or other services, just use the project's virtualenv:

make virtualenv
source build/virtualenv/bin/activate

Testing

To run all tests for the project using the official Python version and other dependencies, run:

make docker-test

You can also run tests locally using the project's virtualenv, with

make test

or using the appropriate pytest command, as documented in Makefile.

Airflow

Wellcome Reach uses Apache Airflow to automate running its data pipelines. Specifically, we've broken down the batch pipeline into a series of dependent steps, all part of a Directed Acyclic Graph (DAG).

Running a task in airflow

It's quite common to want to run a single task in Airflow without having to click through in the UI, not least because all logging messages are then on the console. To do this, from top of the project directory:

Bring up the stack with docker-compose as shown above, and

Run the following command, substituting for DAG_NAME, TASK_NAME, and JSON_PARAMS:

./docker_exec.sh airflow test \
    ${DAG_NAME} ${TASK_NAME} \
    2018-11-02 -tp '${JSON_PARAMS}'

Deployment

For production, a typical deployment uses:

a Kubernetes cluster that supports persistent volumes
a PostgreSQL or MySQL database for Airflow to use
a distributed storage service such as S3
an ElasticSearch cluster for searching documents

Evaluating each component of the algorithm

We have devised some evaluation data in order to evaluate 5 steps of the model. The results can be calculated by first installing poppler

brew install poppler

and then downloading the evaluation data from here and storing it in algo_evaluation/data_evaluate/, which can be done in the command line by running

aws s3 cp --recursive s3://datalabs-data/policy_tool_tests ./policytool/refparse/algo_evaluation/data_evaluate/

and then running

python evaluate_algo.py

(or set the verbose argument to False (python evaluate_algo.py --verbose False) if you want less information about the evaluation to be printed).

You can read more about how we got the evaluation data and what the evaluation results mean here.

Contributing

See the Contributing guidelines

Name		Name	Last commit message	Last commit date
Latest commit History 754 Commits
docs		docs
policytool		policytool
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO		TODO
buildspec.yml		buildspec.yml
docker-compose.yml		docker-compose.yml
docker_exec.sh		docker_exec.sh
export_env.py		export_env.py
pull_request_template.md		pull_request_template.md
requirements.txt		requirements.txt
setup.py		setup.py
test_requirements.txt		test_requirements.txt
unpinned_requirements.txt		unpinned_requirements.txt

License

UriCW/policytool

Folders and files

Latest commit

History

Repository files navigation

Wellcome Reach

Development

Dependencies

docker-compose

virtualenv

Testing

Airflow

Running a task in airflow

Deployment

Evaluating each component of the algorithm

Further reading

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages