Skip to content

corcra/word_reps

Repository files navigation

word_reps

Scripts pertaining to processing/clustering vector representations of words.

Roughly this corresponds to a pipeline:

  • Get a binary file of vector reps for 3m words (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM).
    These representations were obtained after training on part of a Google News database (~100b words) using word2vec (https://code.google.com/p/word2vec/).
  • Convert this to plaintext using bin_to_plain.c
    This is a modified version of word2vec's distance.c, because I couldn't figure out how else to decode the binary file.
    Note: need/want to gzip the result of this.
  • Using a .pk list of desired words, get representations for these (get_rep.py).
    Essentially search in the reference Google file, with some processing.
  • Do kmeans clustering on the result (get_clusters.py).
    Investigating alternate clustering methods currently.
  • Analyse the cluster results (analyse_clusters.r).
    Assume 'best' clusters have lowest mean distance-to-centroid, hope these correspond to semantically similar words.

There's also a script (get_wordcloud.py) which will create a wordcloud from a cluster, but this has several dependencies.

About

Scripts pertaining to processing/clustering vector representations of words.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published