A containerized python application for importing data from multiple source projects and transforming this data into a unified format that can be accessed via an API (which powers Digital Research Books Beta). The application runs a set of individual processes that can be orchestrated with AWS ECS, Kubernetes, or run as standalone processes.
The overall goal of this project is to provide access to the universe of open source and public domain monographs through a single portal, making it much easier for researchers, students, and others to discover obscure works and newly digitized materials that they may otherwise be unaware of.
This ETL pipeline operates in several phases to progressively enhance the data that is received from the source projects. This allows us to both normalize data from a wide range of sources (which naturally exists in numerous formats) and to enhance this data in an additive way, presenting the resulting records to users.
The objective is to produce a database of "FRBRized". In these records each source record is represented as an Item
(something that can actually be read online), which are grouped into Edition
s (e.g. the 1917 edition of X), which are in turn grouped into Work
s, (e.g. Moby Dick, or, The Whale). Through this a user can search for and find a single Work
record and see all editions of that Work
and all of its options for reading online.
The first step of this work is to gather all source records into the "Dublin Core Data Warehouse (DCDW)". This is a simple data store (currently a flat file in a PostgreSQL database) that normalizes data (from CSVs, MARC records, XML documents and more) into a simple Dublin Core representation. This representation uses the flexibility of DC to allow comparison from these different files while tolerating different formats and missing fields, as all DC fields are optional we can create valid DC records regardless of the source. Using some additional formatting rules (description TK) within each field, we additionally do not lose fidelity from these records.
Once stored in the DCDW these records are used to generate "clustered" work records in the FRBRized BIBFRAME model desrcibed above. This is done by using the source DCDW records as "seed" records to fetch additional metadata from the OCLC catalog, utilizing the OCLC Classify service to initially FRBRize these records and retrieve additional MARC records for the work.
Using these retrieved records, and matched records from the DCDW as a corpus, these records are passed into a relatively simple Machine Learning algorithm to identify which records represent single editions and produce a the data model which is stored in a PostgreSQL database and indexed in ElasticSearch.
This application is built as a monorepo, which can be built as a single Docker container. This container can be run to execute different processes, which either execute discrete tasks or start a persistent service (such as a Flask API instance). The monorepo structure allows for a high degree of code reuse and makes extending existing services/adding new services easier as they can be based on existing patterns. Many of the modules include abstract base classes that define the mandatory methods for each service.
Locally these services can be run in two modes:
- As a local docker image, which replicates the deployed version of any component process. This allows for confidence that locally developed code will function properly in the QA and production environments.
- As individual services on the host machine with local PostgreSQL and ElasticSearch instances. This is the primary mode for developing new services as it allows for instantaneous debugging without the need to rebuild or restart a virtual environment or container image
Local development requires that the following services be available. They do not need to be running locally, but for development purposes this is probably easiest. These should be installed by whatever means is easiest (on macOS this is generally brew
, or your package manager of choice). These dependencies are:
- PostgreSQL@10
- ElasticSearch@7.10>
- RabbitMQ
- Redis
- XCode Command Line Tools
This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice).
The steps to install the application are:
- Install dependencies, including Python >= 3.6, if not already installed
- Set up virtual environment
- Clone this repository
- Run
pip install -r requirements.txt
from the root directory - Configure environment variables per instructions below
- Run
DevelopmentSetupProcess
per instructions below
It's required to have Docker/Docker Desktop installed locally for setting up a local development environment in this section. Further details on using Docker with this codebase is given in the next section.
All services share a single entry point in main.py
file. This script dynamically imports available processes from the processes
directory and executes the selected process. This script accepts the following arguments (these can also be displayed by running python main.py --help
)
--process
The name of the process to execute. This should be the name of the process class--environment
The environment in which to execute the process. This controls which set of environment variables are loaded from theconfig
directory, and should be set tolocal
for local development--ingestType
Applicable for processes that fetch records from external sources. Generally three settings are available (see individual processes for their own settings):daily
,complete
andcustom
--inputFile
Used with thecustom
ingest setting provides a local file of records to import--startDate
Also used with thecustom
ingest setting, sets a start point for a period to query or ingest records--limit
Limits the total number of rows imported in a single process--offset
Skips the firstn
rows of an import process--singleRecord
Accepts a single record identifier for the current process and imports that record only. Setting this will ignoreingestType
,limit
andoffset
.
To set up a local environment there is a special process to initialize a database and search cluster which is the DevelopmentSetupProcess
. However, it's recommended to run the DevelopmentSetupProcess
and APIProcess
at the same time to build the most efficient local environment. Before running a command, it's required to set these config variables in the sample-compose.yaml file:
HATHI_API_KEY
:
HATHI_API_SECRET
:
OCLC_API_KEY
:
You can find the values to these variables from the HathiTrust website (https://babel.hathitrust.org/cgi/kgs/request) and OCLC website (https://www.oclc.org/developer/api/keys.en.html) or ask other developers for assistance on attaining these values.
With the configurations set, one of these commands should be run: make up
or docker compose up
. These commands will run the docker-compose file in the codebase and this is why it's required to have Docker/Docker Desktop installed locally. After running one of the commands, a short import process will occur and populate the database with some sample data alongside running the API locally. This will allow you to query the API at localhost:5050
and query the ESC at localhost:9200
.
The docker compose file uses the sample-compose.yaml file in the config
directory and additional configurations and dependencies can be added to the file to build upon your local environment.
To run the processes individually the command should be in this format: python main.py --process APIProcess
.
The currently available processes are:
DevelopmentSetupProcess
Initialize a testing/development databaseAPIProcess
run the DRB APIHathiTrustProcess
Run an import job on HathiTrust recordsCatalogProcess
Retrieve all OCLC Catalog records for identifiers in the queueClassifyProcess
Classify (FRBRize) records newly imported into the DCDWClusterProcess
Group records that have been FRBRized into editions via a Machine Learning algorithmS3Process
Fetch files (e.g. ePubs, cover images, etc.) associated with Item and Edition records and store them in AWS s3NYPLProcess
Fetch files from the NYPL catalog (specifically Bib records) and import themGutenbergProcess
Fetch updated files from Project Gutenberg and import themMUSEProcess
Fetch open access books from Project MUSE and import themDOABProcess
Fetch open access books from the Directory of Open Access BooksCoverProcess
Fetch covers for edition records
To run these processes as a containerized process you must have Docker Desktop installed.
Building the container is a standard process as the container provides an ENTRYPOINT
that accepts all arguments that can be passed to main.py
, which control the specific process invoked.
To build the container run the following command from the project root: docker build -t drb-etl-pipeline .
This will place an image drb-etl-pipeline:latest
in your local docker instance. To run a process with the containerized application (in this instance the Flask API) execute the following command: docker run drb-etl-pipeline -p APIProcess -e YOUR_ENV_FILE
. The ENTRYPOINT
will accept the same arguments as invoking the process via the CLI.
When running a Docker image locally that interacts with other resources running on localhost
it is necessary to supply a special URL to access them. Due to this it is generally helpful to define a unique config
file for local docker testing. You may not wish to commit this file to git as it may contain secrets.
To keep sensitive settings out of git, some secrets configuration must be done to run the cluster. To set up for running on your local machine, copy the config/example.yaml
file and provide the necessary configuration (ask a colleague if you need some of the keys required there). Then provide the name of this file as the --environment
argument when you run scripts.
Any file that contains sensitive details should not be committed to git. These values can be loaded via the AWS Parameter Store or AWS ECS. Speak to a NYPL engineer for access to these secrets and information on configuring your local setup to use them.
This application is deployed via Github Actions to an ECS cluster. Once merged into QA changes are deployed to the DRB QA Instance
Production deployments are to be made when releases are cut against main
.
We use git tags to tag releases and github's release feature to deploy. The steps are as follows:
- Decide on a new version number (assume 0.12.0 for the following steps)
- Make sure your local
main
branch is up to date - Update the
CHANGELOG.md
'unreleased version' header with the current date and new version number, e.g. '2023-04-03 -- v0.12.0' - Commit your change and push straight to
main
:git push origin main
- Create a new tag and name it after the new version number:
git tag -a v0.12.0
- Push your tag to github:
git push origin v0.12.0
- In github, navigate to the 'Releases' tab and click on 'Draft a new release'
- Choose your new tag from the dropdown, set
main
as the target, and name your release after the new version number ('v0.12.0') - Add a quick 1-2 sentence summary, make sure 'Set as the latest release' is enabled and hit 'Publish release'
- Check the repo's
Actions
tab to observe the progress of the deployment to production - Note that the deployment job merely kicks off an ECS service update. To fully verify success, you'll need to check theDeployments
tab for the relevant service / cluster in the ECS console. - Send a quick message to
#researchnow_aka_sfr
in Slack to notify folks of the newest release
And you're done!
- Improve this README
Add following data ingest processes:NYPL CatalogProject GutenbergDOABProject MUSEMET Exhibition Catalogs
Add centralized logging process- Add commenting/documentation strings
Generate C4 diagrams for applicationIntegrate ePub processor into standard processing flowAdd cover fetching process- Create test suite, including:
Unit tests for all components- Functional tests for each process
- Integration tests for the full cluster