Skip to content

afcarl/HathiGenreTrainingset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HathiTrust training data

This repo currently includes Python scripts that I am using to munge page-level training data for a project, "Understanding Genre in a Collection of a Million Volumes."

The actual classification scripts (in Java) are in a different repo, intuitively named pages.

The subdirectory /olddata also includes older training data I used for an earlier volume-level classification project.

Scripts

I can't write an account of every single Python script in the repo; a lot of them are one-offs. Here are the most significant.

Evaluate.py - Primary script I'm using to assess accuracy of a single model.

Coalescer.py - Module that smooths predictions as part of Evaluate.

Ensemble.py - Combines multiple models into an ensemble and assesses collective accuracy.

JsonEnsemble.py - Runs the ensemble evaluation in folders where predictions are stored as jsons.

MetadataFeatures.py - Script that adds global "metadata features" to the pagefeatures files.

SelectFeatures.py - Script that I used to generate vocabularies.

SonicScrewdriver.py - A collection of utilities.

Triads

The scripts in this subdirectory represent a mostly-failed experiment to improve my approach to smoothing by training models using a lot of additional data. If you wanted to glorify it, you could call it a quasi- semi- Conditional Random Field approach. However, in practice, it didn't produce better results than the naive ad hoc rules embodied in Coalescer, so this is now a dead end.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.2%
  • R 0.8%