HathiTrust training data

This repo currently includes Python scripts that I am using to munge page-level training data for a project, "Understanding Genre in a Collection of a Million Volumes."

The actual classification scripts (in Java) are in a different repo, intuitively named pages.

The subdirectory /olddata also includes older training data I used for an earlier volume-level classification project.

Scripts

I can't write an account of every single Python script in the repo; a lot of them are one-offs. Here are the most significant.

Evaluate.py - Primary script I'm using to assess accuracy of a single model.

Coalescer.py - Module that smooths predictions as part of Evaluate.

Ensemble.py - Combines multiple models into an ensemble and assesses collective accuracy.

JsonEnsemble.py - Runs the ensemble evaluation in folders where predictions are stored as jsons.

MetadataFeatures.py - Script that adds global "metadata features" to the pagefeatures files.

SelectFeatures.py - Script that I used to generate vocabularies.

SonicScrewdriver.py - A collection of utilities.

Triads

The scripts in this subdirectory represent a mostly-failed experiment to improve my approach to smoothing by training models using a lot of additional data. If you wanted to glorify it, you could call it a quasi- semi- Conditional Random Field approach. However, in practice, it didn't produce better results than the naive ad hoc rules embodied in Coalescer, so this is now a dead end.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
confidence		confidence
olddata		olddata
triads		triads
vocabularies		vocabularies
AccuracyPlotter.py		AccuracyPlotter.py
AddJsons.py		AddJsons.py
AgreementPlotter.py		AgreementPlotter.py
BetterAligner.py		BetterAligner.py
CRF.py		CRF.py
CRFevaluate.py		CRFevaluate.py
Coalescer.py		Coalescer.py
CollateTrainingData.py		CollateTrainingData.py
CommonFunctions.py		CommonFunctions.py
CompareVoca.py		CompareVoca.py
ConfusionMatrix.py		ConfusionMatrix.py
CreateStupidPredictions.py		CreateStupidPredictions.py
Ensemble.py		Ensemble.py
EnsembleModule.py		EnsembleModule.py
Evaluate.py		Evaluate.py
FileCabinet.py		FileCabinet.py
FileUtils.py		FileUtils.py
GenerateActiveSet.py		GenerateActiveSet.py
GenerateCotrainingSet.py		GenerateCotrainingSet.py
HumanDissensus.py		HumanDissensus.py
Interrater.py		Interrater.py
JsonChecker.py		JsonChecker.py
JsonEnsemble.py		JsonEnsemble.py
LogisticPredict.py		LogisticPredict.py
MakeBubblePlot.R		MakeBubblePlot.R
MakeGenreStack.R		MakeGenreStack.R
MakeSampleData.py		MakeSampleData.py
MakeTrainmeta.py		MakeTrainmeta.py
MetadataCascades.py		MetadataCascades.py
MetadataCensor.py		MetadataCensor.py
MetadataFeatures.py		MetadataFeatures.py
MetadataScraper.py		MetadataScraper.py
MetadataSorter.py		MetadataSorter.py
MungeJsonForR.py		MungeJsonForR.py
PageLevelWordCounter.py		PageLevelWordCounter.py
PgAligner.py		PgAligner.py
PgChecker.py		PgChecker.py
PredictAccuracy.py		PredictAccuracy.py
README.md		README.md
RandomSample.py		RandomSample.py
Scriptmaker.py		Scriptmaker.py
Scriptmaker2.py		Scriptmaker2.py
SelectFeatures.py		SelectFeatures.py
SonicScrewdriver.py		SonicScrewdriver.py
SortByGenre.py		SortByGenre.py
SortByGenreDraPoe.py		SortByGenreDraPoe.py
arffmaker.py		arffmaker.py
checkfeatures.py		checkfeatures.py
findmatches.py		findmatches.py
labnotebook.txt		labnotebook.txt
labnotebook2.txt		labnotebook2.txt

afcarl/HathiGenreTrainingset

Folders and files

Latest commit

History

Repository files navigation

HathiTrust training data

Scripts

Triads

About

Resources

Stars

Watchers

Forks

Languages