Skip to content

evcu/FML-FA16-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FML-FA16-Project

Fundations of Machine Learning class final project

Project Abstract

Improving quality of features vectors in vector embedding models by using synsets

Methodology:

  1. We use different NLP tools to generate synsets from a corpus (text8, Wiki500M-2016-dump).

  2. The generated synset corpus is used in word2vec to obtain synset embedding vectors.

  3. We evaluate the synset model accuracy using a synset version of Google's question-answer (19,558 questions)

SetUp:

  • The python script wordnet_utils.py perfoms text processing and it is in the syn2vec folder. wordnet_utils provides two modes:

    • SynsetStreamGenerator: processes a stream of words like the text8 file [1]: http://mattmahoney.net/dc/text8.zip to generate a stream of pairs (lemma, PartOfSpeech)
    • SynsetLineGenerator: processes line by line the input file (lquestions-words.txt present in the folder gold-data) to generate a new file of pairs (lemma,PartOfSpeech) The lemmatizer is WordnetLemmatizer from nltk [2]: http://www.nltk.org/. The default tagger is the "Perceptron Tagger" [3]: http://spacy.io/blog/part-of-speech-POS-tagger-in-python/, but can be switched easily to other taggers from nltk library.
  • The Java programm WSD provides word disambiguation. It uses the library DKPro WSD [4]: https://dkpro.github.io/dkpro-wsd/. DKPro gives the sensekey of Wordnet when disambiguating a pair (lemma, POS). Similarly to wordnet_utils, WSD provides processing of:

  • stream of tuples
  • line of tuples

Training of the word or synset models are provided by two python scripts in the syn2vec folder:

  • syn2vec.py a wrapper of word2vec, offering the following features:

    • cbow or skip-gram methodologies
    • nce_loss and sampled_softmax_loss loss functions
    • Adagrad, Adam and StochsticGradientDescent optimizers
    • loading, saving a model
    • tsne plotting
  • word2vec_optimized.py which is a slightly customized version of the tensorflow code available at [5]:https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py, it allows to save the correct and incorrect predictions and other minor features.

Other important folders:

  • gold-data: contains the data for evaluation, and the overall or per category results. Interesting files are:

    • 20160-12-14-global-results.txt
    • 2016-12-14-categories-results.txt
    • words-nearby.txt
    • words-synsets.csv
    • synsets-nearby.tx
    • synsets-words.csv
  • deliverables: contains the abstract of the project, the report, the queries used for WordNet database, and plots.

  • scripts: the different scripts to perform the transformations of the initial corpus, the training of the model, and the model evaluation:

    • generate-streamfiles.sh, generate-linefiles.sh: transforms the initial corpus into pairs of (lemma, POS)
    • map-streamfiles.sh, map-linefiles.sh: disambiguates the pairs (lemma,POS) into sensekeys
    • train-words.sh, train-synsets.sh: trains either the word or synset models
    • eval-words-quest-words.sh, eval-words-categories,sh, eval-synsets-quest-words.sh, eval-synsets-categories.sh: different evaluation scripts against Goolge's questions-words.txt in gold-data

Due to the limitation in size of the non-pro version of github account, the following files are not present into the repository:

  • text8, text8.zip, text8-l-pos.tx, 2016-12-07-text8-synsets.txt
  • any trained models (the models will have to be regenerated using the scripts above)

In addition besides python 2.7 and various libraries used in the python scripts like (numpy, panda: recommendation is to install anaconda, nltk), the following have also to be installed: tensorflow framework, lua, torch.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • OpenEdge ABL 98.9%
  • Other 1.1%