Skip to content

Different approaches to computing document similarity

Notifications You must be signed in to change notification settings

wangjiaqiys/reuters-docsim

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

reuters-docsim

Different approaches to computing document similarity, compared quantitatively using the Reuters-21578 corpus. This blog post has more details:

Document Similarity using various Text Vectorizing Strategies

Running the code

  • Make a data folder under the project directory.
  • Download and expand the Reuters-21578 corpus into this folder. This will create a data/reuters-21578 folder under the project directory.
  • Run the parse-input.py script, this will parse the corpus data and produce two flat files, one for text and another for tags in the data directory, called text.tsv and tags.tsv respectively.
  • Generate vectors for the tags by running the tag-sims.py script. This will generate a tag-vecs.tsv file in the data directory.
  • Generate vectors for a vectorizer by running one of the *-sims.py scripts, which will generate a corresponding *-vecs.csv or *-vecs.mtx file depending on whether the generated vectors are dense or sparse.
  • Compute the correlation coefficient between the tag vectors and the text vectors by running the calc-pearson.py script.

About

Different approaches to computing document similarity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%