Word Sense Embeddings

Problem formulation

Given context of words surrounding a target word, model has to predict correct word sense in form of <lemma>_<synset>.
Output is sense embeddings not word embeddings

for a detailed overview please refer to the report
My Sense Embeddings for randomly selected 7 keys

Approach

To run the code

python code/main.py --window 10 --epochs 50 --sample 1e-6 --grid_search True

Parsing dataset

All the needed files/resources are constructed to be fetched from directory_vars.directory_variables()
Parsing of datasets (verbose using tqdm)
- Parsing of SEW might cause resources exhaustion error, so it was parsed in multiple files each of 5 million sentences
- Text preprocessing: Datasets after parsing are stored in a file in the form of list of lists without numbers, stopwords, punctuation, and all characters are lowered.

Model

Building the model gensim.models.Word2Vec using defaults as a first trail and model.build_vocab() in order to pass the vocabulary to the model
Followed by training the model on all the cores of my PC except 1, so I can use the PC freely
Training the model for 30 or 50 epochs and logging the loss after each epoch, order to get an overview how the model is training
After training we need to evaluate the model against a gold set, and this is done using
- Cosine similarity model.similarity()
- Weighted Cosine Similarity
- Tanimoto Similarity
Best score achieved was 0.5954, using the following parameters

Parameters Values

Alpha 0.001

Skip-Gram 0

Sample 0.0001

HEIRARCHICAL SOFTMAX 1

Window 10

Negative 5

Grid Search

In order to fine tune the model, either we do it manually or use gridsearhcv but this results in error, so I had to build it my own, refer to code/grid_search_model.py
The results of the grid-search was plotted as well, to be able to visualize it and understand how different parameters are correlated with the score of the model (in terms of correlation between model's output and gold test)
My Output

Plotting

Data must be visualized in order to interpret the underlying relations between sense embeddings, that was done using t-SNE and PCA

Improvements

Use ElMo, this link is there in order to be able to formulate and prepare dataset for using with ElMo

Roadmap I followed

Best results acquired:
- weighted cosine similarity = 0.5954 on 45 million sentences, approximately

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
resources		resources
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
grid_search_results.png		grid_search_results.png
report.pdf		report.pdf
visualized_sense_embeddings.png		visualized_sense_embeddings.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

resources

resources

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

grid_search_results.png

grid_search_results.png

report.pdf

report.pdf

visualized_sense_embeddings.png

visualized_sense_embeddings.png

Repository files navigation

Word Sense Embeddings

Problem formulation

Approach

Parsing dataset

Model

Grid Search

Plotting

Improvements

Roadmap I followed

About

Releases

Packages

Languages

Parameters	Values
Alpha	0.001
Skip-Gram	0
Sample	0.0001
HEIRARCHICAL SOFTMAX	1
Window	10
Negative	5

elsheikh21/sense_embeddings

Folders and files

Latest commit

History

Repository files navigation

Word Sense Embeddings

Problem formulation

Approach

Parsing dataset

Model

Grid Search

Plotting

Improvements

Roadmap I followed

About

Topics

Resources

Stars

Watchers

Forks

Languages