Dependency:
- check requirements.txt for dependecy
- Need to download google word2vec pretrained model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM and unzip it to data/ directory .
Data:
- train.txt (test.txt) one sentence per line
- train_label.txt(test_label.txt) corresponding label for the sentence in train.txt (test.txt)
- *.p is generated by program.
File:
-
preprocess_data.py:
- generate corpus based on train.txt and test.txt.
- generated index for each word in the corpus
- transform each word in the sentence to its corresponding index.
- output file: corpus.p
-
word_embed.py
- use Google pretrained word2vec model
- find word embedding vector for each word in the corpus.p file, if no vector is found, randomly generated a vector for that word
- change word embedding mapping key from word to its index in corpus.p file
- output file: word2vec.p (this word embedding model only contains words appear in corpus.p)
-
bacis_rnn.py
- Basic RNN class
- made several changes based on the RNN model in https://github.com/dennybritz/rnn-tutorial-rnnlm
-
vanilla_rnn.py
- based on baisc_rnn.py and vanilla rnn model from https://github.com/gwtaylor/theano-rnn
- extend BasicRNN model from basic_rnn.py and make following changes
- change parameters update approach to momentum
- add L1 and L2 regulation to cost function
- add bias on layer function
-
basicRNN_w2v.py
- an example of training basic rnn model and save training model under ./data directory
-
model_test.py
- an example of loading pretrained rnn model and test model with test data from ./data directory
- generate evaluation matrix for performance evaluation
-
gru_rnn.py
- GRU model
- support mini batch training.
- things to notice, when loading data with mini batch, take care of last batch size, it may smaller than the assigned batch size.
-
gruRNN_w2v.py
- an example of training gru rnn model with/without minibatch