Skip to content

neshkatrapati/w2v-mmap

Repository files navigation

What ?

Gensim has a word2vec interface which can access and train word2vec models. But, this cannot work with rather huge models such as GoogleNewsVectors. This is an interface to deal with such large models. This does not read the whole file into RAM but indexes the model file and does random access. It cannot train w2v models.

How ?

Convert binary into txt

If you get pre-trained vectors in binary format or you have stored them into binary format, You must convert it into txt format. convert.c does exactly this.

 # Compile
 $ gcc convert.c  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -o convert -g3
 # Convert
 $ ./convert {filename}.bin > {filename}.txt

PS : This might take a while

Prerequisites

    $ sudo apt-get install marisa
    $ sudo pip install marisa-trie

Make Key-Index and Trie

    $ python make_index.py {filename}

This creates {filename}.kidx (Key-Linenumber index) and {filename}.trie (Trie of the previous file)

Make index of the W2V file

    $ python index_w2vfile.py {filename} 

This creates {filename}.idx

Accessing

About

Python Package for accessing a large word2vec model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published