Skip to content

nyu-dl/gated_word_char_rlm

Repository files navigation

Gated Word-Character Recurrent Language Model

Code for the experiments in the paper Gated Word-Character Recurrent Language Model. The base code is here: https://github.com/nyu-dl/dl4mt-tutorial.

Required packages

  • Theano
  • numpy
  • scipy
  • sklearn
  • pyyaml $ pip install pyyaml

Model files

  • word_char_lm.py - This model takes the word-level and character-level inputs. You can choose "gate" or "concat" by specifying in a config file.
  • char_only.py - This model takes the character-level input only.
  • word_lm.py - This model takes the word-level input only.

Data / Preprocessing

Input data should be a text file. Each line contains one tokenized sentence. The Penn Treebank dataset preprocessed by Tomas Mikolov et al. (2010) is available as an example.

If you use your own dataset, please tokenize sentences and split the data into training, validation, and test sets. And then, please create word and character dictionaries using scripts like tools/build_dictionary_char.py and tools/build_dictionary_char.py. You can specify paths to the data and dictionary files in the config file (.yaml).

Run code

First, clone the repository:

git clone https://github.com/Yasumasa/gated_word_char_rlm.git

To run the training pipeline, make sure the required packages are installed and run the following command lines from the root directory of this repository.

If you run on GPU (recommended):

cd gated_word_char_rlm

# gated word & char with pretraining
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/gate_word_char_pretrain.yaml

# gated word & char
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/gate_word_char.yaml

# concat word & char with pretraining
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/concat_word_char_pretrain.yaml

# concat word & char
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_char_lm.py ./config_files/concat_word_char.yaml

# char only
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python char_lm.py ./config_files/char_only.yaml

# word only
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python word_lm.py ./config_files/word_only.yaml

If you run on CPU:

cd gated_word_char_rlm
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python -u [model name]_lm.py ./config_files/[model name].yaml

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages