This repository is a collection of software and tools put together over the course of the NeSI eResearch HathiTrust pilot project. They are provided as is - the command line tools are examples of working with HathiTrust data, and can hopefully reused in part or in full for projects down the road.
An example tool built using the hathitrust-api Data API to retrieve HathiTrust aggregate resources. It is limited to retrieving public domain documents, and requires an OAuth keyset to use--see oauth_keys.py.template
for information about how to set up the oauth_keys.py
file.
Usage:
python getdocs.py [-h] TARGETDIR [IDFILE]
An interactive document retriever for the HathiTrust Data API.
positional arguments:
TARGETDIR Retrieved files are stored in this directory.
IDFILE Path to a file of HathiTrust identifiers.
optional arguments:
-h, --help show this help message and exit
TARGETDIR
specifies the directory into which to save downloaded resources. IDFILE
is an optional argument, specifying the path to a file containing HathiTrust document identifiers, one per line. If IDFILE
is not specified the program runs under an interactive prompt.
With IDFILE
:
python getdocs.py . target_ids.txt
Interactive:
python getdocs.py .
Enter target htid >> dul1.ark:/13960/t0000xw9z
loc.ark+=13960=t01z49n4f.zip saved to .
Enter target htid >>
A more useful example using the [hathitrust-api][ht api] Solr API, solrquery.py
is a command line interface with the HTRC's Solr index, allowing document searches and MARC retrieval.
Usage:
python solrquery.py [-h] [-f [FIELD [FIELD ...]]] [-o OUTFILE] [-n] [-i]
[-m MARCFILE]
QUERY
A command line tool for the HTRC Solr Proxy.
positional arguments:
QUERY A Solr query string. See http://wiki.htrc.illinois.edu
/display/COM/Solr+Proxy+API+User+Guide for details.
optional arguments:
-h, --help show this help message and exit
-f [FIELD [FIELD ...]], --fields [FIELD [FIELD ...]]
A subset of index fields to include with the results.
-o OUTFILE, --outfile OUTFILE
Use --outfile to specify and optional output file.
-n, --numfound Print the number of results matching QUERY.
-i, --ids Return a stream of documents identifiers only.
-m MARCFILE, --marc MARCFILE
Retrieve MARC records for all documents matching QUERY
and write a zip archive to MARCFILE.
Tool for converting a large HathiTrust XML file to a managable SQLite database format, accessible through the class marc.MarcSQLite
.
Usage:
python marcdatabase.py [-h] SOURCE_XML TARGET_DB
Command line tool to parse a HathiTrust MarcXML file into a SQLite database.
positional arguments:
SOURCE_XML A multi-record MarcXML file.
TARGET_DB Name of database to create.
optional arguments:
-h, --help show this help message and exit
Various analysis functions over the records in a MarcSQLite database.
python analyze.py [-h] [--json JSON_OUT] [--id-file ID_FILE]
{years,subjects} DATABASE CSV_OUT
positional arguments:
{years,subjects} Type of analysis to perform/information to extract.
'years' tallies the publication years of all
documents. 'subjects' accumulates the subjects of the
documents.
DATABASE MarcSQLite record database from which to pull records.
CSV_OUT File for writing CSV output.
optional arguments:
-h, --help show this help message and exit
--json JSON_OUT, -j JSON_OUT
Output a JSON result file in addition the the default
csv file.
--id-file ID_FILE, -i ID_FILE
Analyze the ids contained in ID_FILE rather than the
entire database.
Tool for identifying documents in a MarcSQLite database via metadata features and keywords.
Usage:
python identify.py [-h] MARCDB OUTFILE TERM [TERM ...]
A quick and dirty script for searching for keywords in a HathiTrust MARC
database.
positional arguments:
MARCDB A HathiTrust MarcSQLite database file from which to retrieve
records.
OUTFILE File to write output to.
TERM Search keywords.
optional arguments:
-h, --help show this help message and exit
A command line wrapper around Ted Underwood's document collation scripts.
Usage:
python3 collate.py [-h] [--rewrite-existing] [--no-divs] [--skip SKIP]
COLLECTION [ID_FILE]
A command line wrapper around Ted Underwood's collation package.
positional arguments:
COLLECTION Specifies the root directory of a HathiTrust collection.
ID_FILE File of HathiTrust ids to collate; defaults to the
entire collection.
optional arguments:
-h, --help show this help message and exit
--rewrite-existing Overwrite existing collated documents.
--no-divs If specified, do not write page or header divisions to
the collation.
--skip SKIP Number of lines in the id file to skip; eg after an
interrupted collate.
Command line version of Ted Underwood's OCR evaluation scripts.
Usage:
python3 ocreval.py [-h] COLLECTION OUTFILE [IDFILE]
positional arguments:
COLLECTION Path to a HathiTrust collection.
OUTFILE Desination file for CSV output.
IDFILE Optional file of HathiTrust identifiers to evaluate. Defaults to
the entire collection.
optional arguments:
-h, --help show this help message and exit
Bits and pieces for working with HathiTrust MARC XML records. Check the docstrings for more usage information.
Class for storing HathiTrust MARC records in a SQLite database schema.
Main function for parsing a HathiTrust MARC record to a MarcSQLite accessible database.
Function to normalize a MARC year field (accessible through pymarc.Record.pubyear()).
Function to normalize formatting of a MARC subject field.
Class for navigating the HathiTrust pairtree collection structure.
Code in this package depends on the following third party libraries:
They should all be installable with a pip <dependency>
command. There may still be an issue with the requests-oauthlib version in PyPI. If you have issues using hathitrust_api.DataAPI
, install it from the source.
####Submodules:
Because I needed to hack some Python 2 code to make it compatible with Python 3, I've included several packages as submodules to ease the pain of setting up a bunch of dependencies. If you do a git clone
, these will all be included: