Skip to content

htrc/Z604-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Temporal Classification of HathiTrust OCRed Texts

This is a course project for Z604 (Big Data Analytics for Web and Text) offered by Xiaozhong Liu & Miao Chen in 2014 Spring.

Abstract

We investigate the temporal resolution of texts in an effort to determine their date of publication and classify each in discrete temporal intervals(chronons). We describe and evaluate experiments that incorporate both temporal cues, i.e. explicit dates, pervasiveness of OCR errors, and document-chronon distance based on N-gram text cues. Three separate distance metrics (Cosine Similarity, Kullback-Leibler Divergence, and Normalized Log-Likelihood Ratio) and three classifiers (logistic regression, decision tree, and support vector machine) are evaluated using different feature sets. Our results indicate that logistic regression classifier plus NLLR metric achieve highest performance, and document-chronon distances computed based on higher order N-grams (bigrams & trigrams) are most effective features.

Paper

Draft paper is available here.

Team

  • Siyuan Guo @zachguo
  • Bin Dai @bindai
  • Trevor Edelblute @tedelblu
  • Zhichao Huo @zhhuo
  • Pallavi Murthy @PallaviMurthy

About

Temporal classification of HathiTrust texts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.0%
  • Shell 3.0%