Skip to content

my-master/HashingTfidfVectorizer

Repository files navigation

HashingTfidfVectorizer

Very fast implementation of tf-idf vectorizer.

Features

  • data batch iteration
  • hash
  • parallel computing
  • fast implementation of SpaCy tokenizer
  • SQLite iterators (it's not necessary to use it, but if you have a SQLite textual database, it may be fun)

Though I'm still working on imporving of the parallel computing part.

Installation

pip install -r requirements.txt
python -m spacy download en

Usage

import time

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

from tokenizers.simple_iterator import SimpleIterator
from vectorizer import HashingTfIdfVectorizer

DATA = ["I think it's better to fry mushrooms.",
        "Oh, this senseless life of ours!"] * 20000

iterator = SimpleIterator(DATA, batch_size=1000)
vectorizer = HashingTfIdfVectorizer(iterator, ngram_range=(1, 2),
vectorizer = HashingTfIdfVectorizer(iterator, tokenizer=SimpleTokenizer(ngram_range=(1, 2),
                                                                        stopwords=ENGLISH_STOP_WORDS))

t01 = time.time()
vectorizer.fit_parallel(n_jobs=7)
t1 = time.time() - t01

t02 = time.time()
vectorizer.fit()
t2 = time.time() - t02


print(
    'Process time for parallel fit, {} docs: {} s.'.format(len(iterator.doc_index), t1))

print(
    'Process time for non parallel fit, {} docs: {} s.'.format(len(iterator.doc_index), t2))
Process time for parallel fit, 40000 docs: 9.25651478767395 s.
Process time for non parallel fit, 40000 docs: 12.76369833946228 s.

About

Very fast implementation of tf-idf vectorizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages