Latent Semantic Analysis in Python

In this project we will perform latent semantic analysis of large document sets.

We first create a document term matrix, and then perform SVD decomposition.

This document term matrix uses tf-idf weighting.

To Run! Set your cwd to scripts/ and run the file located there.

Notes to @rrish:

This actually does work for the entire jeopardy dataset, with all 200,000 documents and 100,000 unique words. Warning, if you do run it on that, it needs about 2GB of memory to store everything, so be careful.
The global WORKERS variable sets how many worker processes to create. Feel free to play around for performance. (I haven't yet)
In terms of timing, as it stands it can analyze all 200,000 documents and create the document-term matrix in about 45-50 seconds on my machine (mileage may vary based on cores/etc.)
It is currently using the basic tf-idf weighting. We may wish to adjust this later.

The SVD_using_LSA.m file is a matlab implementation of the latter half of the LSA algorithm once the document-term matrix has been constructed and the SVD has been calculated. It calculated the new word matrix and doc matrix and then takes a query and calculates the cosine distances of the query with each of the documents (columns of the doc matrix, saved into a new array called "docs"). Finally, it ranks the documents according to the relevance to the query word/words.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
LSA		LSA
demo		demo
matlab		matlab
proposal		proposal
scripts		scripts
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSA

LSA

demo

demo

matlab

matlab

proposal

proposal

scripts

scripts

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.txt

LICENSE.txt

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Latent Semantic Analysis in Python

About

Releases 1

Packages

Contributors 2

Languages

License

TheDataLeek/Python-LSA

Folders and files

Latest commit

History

Repository files navigation

Latent Semantic Analysis in Python

About

Resources

License

Stars

Watchers

Forks

Languages