Skip to content

ehenneken/ADSimportpipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADSimportpipeline

Overview

Coordinates ingest of a full ADS record.

  1. Parses "classic" bibcodes files defined in settings.py
  2. Operates on any bibcode whose "timestamp" differs from the cooresponding "JSON_fingerprint" field in the mongodb
  3. Uses ads.ADSExports.ADSRecords to consolidate data from classic based on bibcodes in 2.
  4. Parses resulting xmlobject to python dict via xmltodict.py
  5. Enforces type=list on any potentially repeated entries
  6. Merges any repeated blocks having the same @type attribute
  7. Insert (upsert=True) data to mongodb

Step 1 is initiated by invoking run.py.

Async workflow with rabbitMQ

  • Invoking run.py --async publishes the [(bibcode, fingerprint),...] records to rabbitmq.
  • Workers that consume these messages are defined in pipeline/psettings.py and pipeline/workers.py.
  • Workers are controlled via a master process in pipeline/ADSimportpipeliny.py.

Requirements (version numbers will come at release time)

  • pika
  • rabbitmq
  • ADSExports
  • pymongo + mongo
  • Note: The rabbitmq server should be configured for frame_max=512000
  • Note: pika should be configured with frame_max=512000 (seemingly must be changed in spec.py in addition to normal connection definition)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published