Skip to content

CoronaWhy CORD-19 Common Module for preprocessing tasks and metadata conversion.

Notifications You must be signed in to change notification settings

CoronaWhy/cord19-metadata

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CORD-19 Common Preprocessing Module

This module was created to get CORD-19 papers synchronized in MongoDB and Elasticsearch and get its metadata converted to various bibliographic standards (MARC21, etc).

Install python3 dependencies

pip3 install -r requirements.txt

CORD-19 collection

  • Download the original collection from Kaggle or directly from Ai2
  • unzip archive in some folder on your hard drive, for example, /corddata
  • edit api/config.py and change "maindir" to your folder, "cordversion" to reflect the current CORD-19 version from Kaggle (v38 at the moment)

Setup Mongo locally and create user to access it, for example:

mongo admin
db.createUser({user: "coronawhyguest" , pwd: "coro901na", roles: [  "readWriteAnyDatabase" ]});

Run ingest process to get CORD-19 collection in Mongo

python3 ./start.py

Check CORD-19 papers import

Login to Mongo and check imported CORD-19 metadata records

mongo -u coronawhyguest -p coro901na cord19
db.v38.find().count()

About

CoronaWhy CORD-19 Common Module for preprocessing tasks and metadata conversion.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.6%
  • Python 4.4%