README

stemtokstop = stemmer + tokenizer + remove stop words. It's nothing special, but a simple application of NLTK wrapped with Flask.

INSTALL

$ pip install nltk
$ pip install snowballstemmer  # 1.2.0 supports Turkish
$ python
>>> import nltk
>>> nltk.download() # And download all
>>> ^D
$ python stemtokstop.py

For Japanese stemmer, I chose Masato Hagiwara's [TinySegmenter] (https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py). It applies BSD License, so I keep a copy here.

If you like to have a more precise result in Japanese, install [MeCab] (http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Make it UTF-8 only. stemtokstop will use MeCab in place of TinySegmenter.

TEST

Run stemtokstop.py in one terminal, and run test.py in another. $ python test.py

You should see Sent: and Recv:. Use your linguistic knowledge to justify if the result is satisfactory.

WHAT IF I DON'T LIKE IT?

Open an issue, or better, submit a pull request.

I'm not satisfied with the current output, because I'd like them to be in Noun-stem and not the stems in a stemmer. For example, europe, not europ for Europe. It's possible to find another stemmer.

As for Japanese stop words, I use a heuristic method to enumerate them. Japanese, as well as Chinese, needs to be tagged to get high accuracy. Arbitrarily removing stop words (like the implementation here) results in losing meaningful words.

LICENSE

Apache License 2.0. Please refer to LICENSE.

tinysegmenter.py is written by Masato Hagiwara (萩原正人). Please refer to [his blog] (http://lilyx.net/tinysegmenter-in-python/) for his excellent Japanese tokenizer.
Release early, release often.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_ja_stopwords.py		extract_ja_stopwords.py
ja_stopword.txt		ja_stopword.txt
mecab_tokenizer.py		mecab_tokenizer.py
stemtokstop.py		stemtokstop.py
test.py		test.py
tiny_tokenizer.py		tiny_tokenizer.py
tinysegmenter.py		tinysegmenter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

extract_ja_stopwords.py

extract_ja_stopwords.py

ja_stopword.txt

ja_stopword.txt

mecab_tokenizer.py

mecab_tokenizer.py

stemtokstop.py

stemtokstop.py

test.py

test.py

tiny_tokenizer.py

tiny_tokenizer.py

tinysegmenter.py

tinysegmenter.py

Repository files navigation

README

INSTALL

TEST

WHAT IF I DON'T LIKE IT?

LICENSE

About

Releases

Packages

Languages

License

Jiannan28/stemtokstop

Folders and files

Latest commit

History

Repository files navigation

README

INSTALL

TEST

WHAT IF I DON'T LIKE IT?

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages