Skip to content

TheDataLeek/Python-LSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Latent Semantic Analysis in Python

Build Status

In this project we will perform latent semantic analysis of large document sets.

We first create a document term matrix, and then perform SVD decomposition.

This document term matrix uses tf-idf weighting.

To Run! Set your cwd to scripts/ and run the file located there.

Notes to @rrish:

  • This actually does work for the entire jeopardy dataset, with all 200,000 documents and 100,000 unique words. Warning, if you do run it on that, it needs about 2GB of memory to store everything, so be careful.
  • The global WORKERS variable sets how many worker processes to create. Feel free to play around for performance. (I haven't yet)
  • In terms of timing, as it stands it can analyze all 200,000 documents and create the document-term matrix in about 45-50 seconds on my machine (mileage may vary based on cores/etc.)
  • It is currently using the basic tf-idf weighting. We may wish to adjust this later.

The SVD_using_LSA.m file is a matlab implementation of the latter half of the LSA algorithm once the document-term matrix has been constructed and the SVD has been calculated. It calculated the new word matrix and doc matrix and then takes a query and calculates the cosine distances of the query with each of the documents (columns of the doc matrix, saved into a new array called "docs"). Finally, it ranks the documents according to the relevance to the query word/words.

About

Performing Latent Semantic Analysis with Python on large datasets.

Resources

License

Stars

Watchers

Forks

Packages

No packages published