This repository contains the source code for canonical morphological segmentation presented in Tatyana Ruzsics and Tanja Samardzic "Neural Sequence-to-sequence Learning of Internal Word Structure". In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada.
The code uses SGNMT framework and depends on the Blocks and srilm-swig libraries. Follow the SGNMT instructions to install these dependencies. The implementation also relies on the adapted version of Z-MERT. After installation update the enviromental variables LD_LIBRARY_PATH,PYTHONPATH, PATh in the header of the main executable Main.sh file with the location of swig and SRILM.
The main executable is Main.sh:
Main.sh AbsolutePATHtoDATA AbsolutePATHtoWorkingDir ResultsFolderName NMT_ENSEMBLES BEAM USE_LENGTH_CONTROL
The data folder contains the datasets for canonical segmentation.
Main.sh "Absolute path to /data/canonical-segmentation/indonesian/" "Absolute path to a working dir" results 5 12 -l
Main.sh "Absolute path to /data/canonical-segmentation/german/" "Absolute path to a working dir" results 5 12 -l
Main.sh "Absolute path to /data/canonical-segmentation/english/" "Absolute path to a working dir" results 5 12 -l