Description: Final project in "Intro. to NLP" course by Albert Shalumov and Kobi Bodek
Due date | Task | Status | Date |
---|---|---|---|
22/7 | Finish script to convert hebrew to transl. chars | ✔️ | 19/7 |
26/7 | Annotate some data for HMM work | ✔️ | 24/7 |
2/8 | Unigram HMM code finished | ✔️ | 24/7 |
2/8 | Bigram, Trigram HMM code finished | ✔️ | 25/7 |
2/8 | Metrics for HMM finished | ✔️ | 25/7 |
2/8 | Division to syllables finished | ✔️ | 7/8 |
9/8 | Conversion to english letters finished | ✔️ | 7/8 |
15/8 | CRF code finished | ✔️ | 28/7 |
8/9 | Final data version - no more annotation from this point | ✔️ | |
9/8 | Implement edit distance metric | ✔️ | 7/8 |
16/8 | Finish NN | ✔️ | 15/8 |
19/9 | Finished project | ✔️ | |
21/9 | Verified project |
Each model must be executed with a single parameter: search | seeds.
Search - train on each possible configuration and calculate accuracy measures
Seeds - Train using selected configuration over different seeds and calculate accuracy
- crf_sentence.py - CRF model for word and sentence features
- crf_word.py - CRF model for word only features
- embedding_mds.py - Create embedding matrix using MDS
- embedding_nn.py - Create embedding matrix using NN
- hmm.py - HMM model
- memm.py - MEMM model
- rnn.py - RNN model
- rnn_model.bin - Trained RNN model
- emb_model_mds.npy - MDS embedding matrix
- emb_model_nn.npy - NN embedding matrix
- post_proc\syllabification.py - Syllabification
- post_proc\post_processing.py - Romanization
- metrics.py - Accuracy measures
- test.py - Executes all models with the best configuration
- input_proc\utils.py - Convert and prepare MILA dataset for annotation
- input_proc\verifier.py - Validate annotation
- crf_sentence_res.csv - Results of CRF sentence search
- crf_word_res.csv - Results of CRF word search
- hmm_res.csv - Results of HMM search
- memm_res.csv - Results of MEMM search
- rnn_res.csv - Results of RNN search
- cots\used_packages.txt - Used packages