CoronaCentral Machine Learning

This repository contains the code for text mining the coronavirus literature for CoronaCentral. It manages the download, clean up, categorization (using deep learning) and many more steps to process the coronavirus literature. The output of this is then upload to the CoronaCentral website. The web interface of the website is kept in a separate Github repo.

This README will cover the three main steps

Downloading coronavirus literature
Running the full pipeline
Uploading to a database

Download the coronavirus literature

Scripts in the data/ manage the download of the literature from PubMed and CORD-19. These two sources are then combined into one file and fed through the pipeline where they are cleaned up.

Full Pipeline

The full pipeline in the pipeline/ directory takes in documents from PubMed and CORD-19 and does cleaning, merging, categorization, and more steps outlined below. These are managed by a Snakemake script.

Word-lists for entities are sourced from WikiData
Spotfixes are applied to manually clean up some documents
Web data is pulled to get metadata tags
Web data is integrated with the documents and used to infer some article types
Documents are further cleaned by a number of rules, including steps to normalize journal names
Documents are parsed and named entity recognition is applied to find mentions of different entities, including viruses, drugs, locations, etc
Categories are predicted using BERT (using scripts in category_prediction/)
Additional categories are identified with rules
A final filter does some final tidying up and checking
Document annotations are prepared for upload to a database

Database

The database/ directory contains scripts for creating and managing a MySQL database containing documents and annotations.

Data

If using data from this project, please cite this work along with the CORD-19 dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
altmetric		altmetric
annotation		annotation
category_prediction		category_prediction
data		data
database		database
paper		paper
pipeline		pipeline
topic-surging-analysis		topic-surging-analysis
twitter		twitter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
altmetrictime.sh		altmetrictime.sh
coronatime.sh		coronatime.sh
cron_altmetrictime.sh		cron_altmetrictime.sh
cron_coronatime.sh		cron_coronatime.sh
output_description.md		output_description.md
pokeWebsite.py		pokeWebsite.py
rebuildtime.sh		rebuildtime.sh
requirements.txt		requirements.txt
submission.json		submission.json

License

personx000/corona-ml

Folders and files

Latest commit

History

Repository files navigation

CoronaCentral Machine Learning

Download the coronavirus literature

Full Pipeline

Database

Data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages