Skip to content

messner1/htrc-yule

Repository files navigation

htrc-yule

Yule's K generated for the pre-generated version of the HTRC genre seperated corpus (https://sharc.hathitrust.org/genre).

The CSVs contain the HTID of the volume followed by the K calculated for the volume. ' Was used as the text seperator.

The code is messy but fairly straightforward. The basic process is to call one of the three functions in yule_htrc.py -- non_fuzzy, as_single_corpus or fuzzy_restrictions. The difference between the three is briefly covered in the accompanying blogpost. Each requires an argument with the path to the data to be used (.tsvs, will iterate over child folders to gather them as well), the path to the metadata CSV, and the path to the contextual correction csv.

Why I did this can be found in the blog post here: http://cmessner.com/blog/?p=127

This repository now also contains code for deduplicating the dataset using K. The HTID clumps produced by deduplicating the fiction dataset can be found in the fiction_duplicated_clumps csv. Taking one entry from each clump will result in a deduplicated dataset.

More on this here: http://cmessner.com/blog/?p=209

About

Yule's K generated for the HTRC genre seperated corpus (https://sharc.hathitrust.org/genre)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages