forked from torresal/gcis-isbn-validation
dexterzzlab/gcis-isbn-validation
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
GCIS ISBN Validation Purpose This set of scripts is designed to validate the isbn information we have for all the books in GCIS Process 1. Use 'python get_books.py [GCIS book JSON endpoint]' to retrieve json dumps for all books from the GCIS book JSON endpoint 2. Use 'python normalize_isbn.py [top level GCIS script-created directory]' to normalize all the JSON files to ISBN13 if an ISBN13 version exists 3. Use 'python get_loc_xml.py [top level GCIS script-created directory]' to check existing isbn entries in json files against Library of Congress Database At this point, the following directories will be created: /gcis-[DATE]T[TIME] (This is in ISO8601 spec) --> book --> book-loc --> book-normalized --> book-to-fix --> book-to-ingest These folders contain the following: [ book ] : Original JSON files from data.globalchange.gov [ book-loc ] : XML files for matched ISBNs from Library of Congress (Separated into isbn-13-match, isbn-other-match, and isbn-problem subfolders) [ book-normalized ] : Normalized JSON files (Separated into isbn-13, isbn-none, and isbn-other subfolders) [ book-to-fix ] : Books where ISBNs did not match Library of Congress database. These need to be manually handled. [ book-to-ingest ] : JSON files of the cleaned/normalized and verified books that are ready to ingest into GCIS To ingest this data into GCIS elastic search, you simply run the following: 'python gcis_es_crawler.py gcis-[DATE]T[TIME]/book-to-ingest' Notes: Looks like there are only 3 isbn files missing
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Python 100.0%