reuters-nlp

Learning basic natural language processing and topic modelling techniques with NLTK and Gensim.

Document tokenising and normalising
Stemming
Removing stopwords and words that are very rare and very common in a corpus
Bag-of-words vectors
TF-IDF
Latent Semantic Indexing
Similar document retrieval

Install

$ pip install nltk
$ pip install gensim

Prepare data

$ python preprocess.py
$ python build_corpus.py
$ python train.py

Find documents similar to search terms

$ python query.py

Enter a query document:
> africa coffee
The closes matching document:
training/3034 - 0.804
UGANDA DISAPPOINTED BY COFFEE TALKS FAILURE
  Uganda, Africa's second largest coffee producer [...]

The most significant topic included these stems:
gencorp, april, group, cent, price, offer, china, crown, american, industri, 
week, six, gener, save, januari, taiwan, day, baker, deficit, copper
-------------------------------------------------------------------------------
> american coffee
The closes matching document:
training/6595 - 0.675
N.Y. TRADERS SAY LATIN COFFEE PRODUCERS TO MEET
  Several traders and analysts here told Reuters Latin American coffee 
  producers will meet [...]  

The most significant topic included these stems:
co, quarterli, corp, sugar, april, set, corn, gencorp, japan, bill, cyclop, 
qtli, march, system, merger, februari, reserv, o, taiwan, offer
-------------------------------------------------------------------------------

TODO:

HTTP API for results visualisation and analysis
Improved tokenisation of in-word punctuation
More similarity metrics

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
build_corpus.py		build_corpus.py
preprocess.py		preprocess.py
query.py		query.py
reuters_nlp.py		reuters_nlp.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

build_corpus.py

build_corpus.py

preprocess.py

preprocess.py

query.py

query.py

reuters_nlp.py

reuters_nlp.py

train.py

train.py

Repository files navigation

reuters-nlp

Install

Prepare data

Find documents similar to search terms

TODO:

Resources

About

Releases

Packages

Languages

joews/reuters-nlp

Folders and files

Latest commit

History

Repository files navigation

reuters-nlp

Install

Prepare data

Find documents similar to search terms

TODO:

Resources

About

Resources

Stars

Watchers

Forks

Languages