Skip to content

amber-reichert/ingestion

 
 

Repository files navigation

The DPLA Ingestion

Build Status

Build Status

Documentation

To install or upgrade the ingest subsystem, first install the necessary components;

$ pip install --no-deps --ignore-installed -r requirements.txt

Configure an akara.ini file appropriately for your environment;

[Akara]
Port=<port for Akara to run on>

[Bing]
ApiKey=<your Bing Maps API key>

[CouchDb]
Url=<URL to CouchDB instance>
Username=<CouchDB username>
Password=<CouchDB password>

[Geonames]
Username=<Geonames username>

The akara.conf.template and akara.ini file are merged to generate the akara.conf file by running;

$ python setup.py install 

Then set up and start the (Akara) server;

$ akara -f akara.conf setup
$ akara -f akara.conf start

You can test it with this set description from Clemson;

$ curl "http://localhost:8889/oai.listrecords.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&oaiset=jfb&limit=10" 

If you have the endpoint URL but not a set id, there's a separate service for listing the sets;

$ curl "http://localhost:8889/oai.listsets.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&limit=10"

To run the ingest process, manually configure akara.conf to point to a CouchDB database, then install the script and feed it a source profile description;

$ python setup.py install
$ mkdir profiles && mkdir data
$ cat <<DONE  >profiles/myprofile.pjs
{"name":"clemsontest",
 "subresources":["gmb","ctm"],
 "endpoint_URL":"http://localhost:8889/dpla-list-records?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&oaiset=",
 "enrichments_coll": ["http://localhost:8889/oai-set-name?sets_service=http://localhost:8889/oai.listsets.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe"],
 "enrichments_rec": ["http://localhost:8889/geocode?prop=coverage&newprop=coverage_geo","http://localhost:8889/shred?prop=subject&delim=%3b","http://localhost:8889/oai-to-dpla"]
}
DONE
$ poll_profiles profiles/myprofile.pls http://localhost:8889/enrich

Source profiles are represented as JSON objects. Their properties include;

  • endpoint_URL; the Akara-wrapped URL from which JSON representations are retrieved.
  • subresources; for OAI, names individual sets in an OAI store. When used, endpoint_URL should terminate with "&oaiset=" (this may change)
  • last_checked; read-only timestamp indicating the last time this source was polled
  • enrichments_coll; ordered list of Akara enrichment services for collections, including any service specific query parameters
  • enrichments_rec; ordered list of Akara enrichment services for records, including any service specific query parameters

Enrichment pipelines are implemented through a central enrichment service which interprets the list of other services as communicated via a "Pipeline" HTTP header on a POST request. For example, given a data.sjs data document, the following request will send that data through the provided pipeline;

$ curl -X POST -d @data.sjs -H "Pipeline: http://localhost:8889/geocode?p=location" http://localhost:8889/enrich

The provided enrichment services include;

  • shred/unshred; ',' based string/list and list/string (de)construction. The "prop" parameter specifies which property is to be shredded/unshredded (support multi-properties using a period delimiter)
  • geocode; creates a new property containing the lat/long of the location present in the property identified by the prop parameter. NOTE; in order to use geo lookups, the geonames sqlite file has to be created using the instructions and stored in the "caches" directory below the home directory of akara.conf
  • select-id; creates or updates an "id" property to the value of the property named by the "prop" parameter

License

This application is released under a AGPLv3 license.

Copyright President and Fellows of Harvard College, 2013

About

The DPLA ingestion code

Resources

Stars

Watchers

Forks

Packages

No packages published