Skip to content

KshitijKarthick/tvecs

Repository files navigation

image

T-Vecs

Prerequisites

  • Python 2.7 setup and installed
  • Pip setup and installed
  • Ensure all dependencies of requirements.txt are satisfied
  • Download nltk_data using nltk.download() -> only tokenizers required
  • Download corpus and extract in specified directory

Setup Development Environment

git clone https://github.com/KshitijKarthick/tvecs.git
cd tvecs
pip install -r requirements.txt
# Only Model needs to be downloaded and extracted in the t-vex directory

Install as a Package

# Install package
pip install git+https://github.com/KshitijKarthick/tvecs.git

# Usage from cmd line without recommendations menu
tvecs -c ./config.json

# Usage from cmd line with recommendations menu
tvecs -c ./config.json -r

# Usage without config file, with models, without recommendations menu
tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage without config file, with models, with recommendations menu
tvecs -r -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models

# Usage from inside python as a library
import tvecs.vector_space_mapper.vector_space_mapper as vm

Data

Corpus Download details

We are focusing on [English, Hindi] other possible prospects we could look into Kannada, Tamil languages

Sources

Bilingual Dictionary details

Provided in the repository, data/bilingual_dictionary. Compiled using the following sources.

Credits

Evaluation Dataset details

Human relatedness judgement score datasets provided in data/evaluate

Credits
  • wordsim_relatedness_goldstandard
  • MEN_dataset_natural_form_full
  • Mturk_287
  • Mturk_771

Ensure Model is downloaded and extracted in the t-vex directory

  • data/corpus -> corpus
  • data/models -> models

Usage Details

T-Vecs Driver Module Cmd Line Args

$ python -m tvecs --help

usage: __main__.py [-h] [-v] [-s] [-i ITER] [-m1 MODEL1] [-m2 MODEL2]
               [-l1 LANGUAGE1] [-l2 LANGUAGE2] [-c CONFIG]
               [-b BILINGUAL_DICT] [-r]

Script used to generate models

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  -s, --silent          silence all logging
  -i ITER, --iter ITER  number of Word2Vec iterations
  -m1 MODEL1, --model1 MODEL1
                        pre-computed model file path
  -m2 MODEL2, --model2 MODEL2
                        pre-computed model file path
  -l1 LANGUAGE1, --language1 LANGUAGE1
                        language name of model 1/ text 1
  -l2 LANGUAGE2, --l2 LANGUAGE2
                        language name of model 2/ text 2
  -c CONFIG, --config CONFIG
                        config file path
  -b BILINGUAL_DICT, --bilingual BILINGUAL_DICT
                        bilingual dictionary path
  -r, --recommendations
                        provide recommendations

Config File Format

  • See config.json in the repository for example.

Execution & Building

# Preprocessing, Model Generation, Bilingual Generation, Vector Space Mapping between two languages english hindi from the corpus using the config file

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

# Bilingual generation, Vector space mapping between two languages english hindi providing the models

python -im tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train

python -im tvecs -c config.json

# [ utilise the dictionary tvex_calls which contains results of every step performed ]

Obtain Recommendations

# Provide Recommendations using config file
python -m tvecs -c ./config.json -r

# Provide Recommendations using cmd line params
python2 -m tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train_bd -r

# Output for recommendations

Enter your Choice:
1> Recommendation
2> Exit

Choice: 1
Enter word in Language english: examination

Word    =>  Score

जाँच    =>  0.643208742142
नियुक्ति    =>  0.640852451324
जांच    =>  0.638412773609
अध्ययन  =>  0.638307392597
विवेचना =>  0.638229370117
मंत्रणा =>  0.634038448334
पुनर्मूल्यांकन  =>  0.627283990383
अध्‍ययन =>  0.624040842056
निरीक्षण    =>  0.623490035534
जाच =>  0.619904220104

Visualisation of vector space

python -m tvecs.visualization.server
[ Open browser to localhost:5000 for visualization ]
[ Ensure model generation is completed before running visualization ]

Execution of Individual Modules

# bilingual dictionary generation -> clustering vectors from trained model
python -m tvecs.bilingual_generator.clustering

# model generation
python -m tvecs.model_generator.model_generation

# vector space mapping [ utilise the object vm to obtain recommendations
python -m tvecs.vector_space_mapper.vector_space_mapper

Execution of Unit Tests

# Run all unit tests
py.test

# Run individual module tests seperately
py.test tests/test_emille_preprocessor.py
py.test tests/test_leipzig_preprocessor.py
py.test tests/test_hccorpus_preprocessor.py

Generate Documentation

# Generate HTML Documentation
make html
cd documentation/html && python -m SimpleHTTPServer
# [ Open browser to localhost:8000 for visualization ]

# Generate Man Pages
make man
cd documentation/man && man -l tvecs.1


# Other Makefile options
make

Please use `make <target>' where <target> is one of
html       to make standalone HTML files
dirhtml    to make HTML files named index.html in directories
singlehtml to make a single large HTML file
pickle     to make pickle files
json       to make JSON files
htmlhelp   to make HTML files and a HTML help project
qthelp     to make HTML files and a qthelp project
applehelp  to make an Apple Help Book
devhelp    to make HTML files and a Devhelp project
epub       to make an epub
epub3      to make an epub3
latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
latexpdf   to make LaTeX files and run them through pdflatex
latexpdfja to make LaTeX files and run them through platex/dvipdfmx
text       to make text files
man        to make manual pages
texinfo    to make Texinfo files
info       to make Texinfo files and run them through makeinfo
gettext    to make PO message catalogs
changes    to make an overview of all changed/added/deprecated items
xml        to make Docutils-native XML files
pseudoxml  to make pseudoxml-XML files for display purposes
linkcheck  to check all external links for integrity
doctest    to run all doctests embedded in the documentation (if enabled)
coverage   to run coverage check of the documentation (if enabled)