This work is done as part of assignment for E1 246: Natural Language Understanding (2019). The report for the same can be found here
NLTK's treebank dataset is used to train the generate PCFG.
Project layout
data/
EVALB/
results/
ckpt/
requirements.txt
config.py
data_handler.py
parser.py
utils.py
driver.py
init.sh
report.pdf
-- config.py --
# train and test split of file ids
train_set = 'data/train.txt'
test_set = 'data/test.txt'
# to save during training or load during testing or parsing
model_path = 'ckpt/model.pt'
# folders to store gold and generated probabilities
target_folder = 'results'
# number of processes to spawn during test time
processes = 4
# smoothing type prob/add_one
smoothing = 'prob'
make sure config.py has right values set before running the program
The data folder contains train.txt and text.txt holding comma separated fileids split for training and testing
Note: One can use the their own splits by either updating train.txt and test.txt or by adding new files and updating config.py
EVALB is program used to generate scores.
gold.txt and result.txt files are stored in results folder after testing phase. These files are used for evaluation
Note: One can change the results path by updating in config.py
Saves checkpoint which contains production rules and counts
data_handler.py responsible for reading training or test file mentioned in config and generating sentences
parser.py has core implementation of CYKParser
utils.py has methods used by other files
driver.py contains main
install requirements and initialize
pip3 install -r requirements.txt
./init.sh
Note IN all cases below model path can be chosen from either config.py or -- model path argument.
(1) parse sentence from existing model
python driver.py --mode parse --sent "This sentence will be parsed ."
(2) train the model
python driver.py --mode train
(3) test the model
python driver.py --mode test
./eval -p param results/gold.txt results/result.txt
Note result directory which contain gold and result parses can be found at target_folder in config.py.