This is a course project for Z604 (Big Data Analytics for Web and Text) offered by Xiaozhong Liu & Miao Chen in 2014 Spring.
We investigate the temporal resolution of texts in an effort to determine their date of publication and classify each in discrete temporal intervals(chronons). We describe and evaluate experiments that incorporate both temporal cues, i.e. explicit dates, pervasiveness of OCR errors, and document-chronon distance based on N-gram text cues. Three separate distance metrics (Cosine Similarity, Kullback-Leibler Divergence, and Normalized Log-Likelihood Ratio) and three classifiers (logistic regression, decision tree, and support vector machine) are evaluated using different feature sets. Our results indicate that logistic regression classifier plus NLLR metric achieve highest performance, and document-chronon distances computed based on higher order N-grams (bigrams & trigrams) are most effective features.
Draft paper is available here.
- Siyuan Guo @zachguo
- Bin Dai @bindai
- Trevor Edelblute @tedelblu
- Zhichao Huo @zhhuo
- Pallavi Murthy @PallaviMurthy