This project is based on Rashmina's project. However, Rashmina's project can not be run on HPC because of some limitations and the classifiers can not be trained parallelly which requires a lot of time to complete the training.
There are two csv files in this project.
- label.csv is the original data used by Rashmina's project.
- label3.csv is the data file without duplicated samples.
The file label.csv is too large to upload to the github, you can generate it again on the server by running
python Data.py
The model we trained in Spring 2016 is based on label.csv, but finding a way to train the model with label3.csv should save a lot of time.
- gbc.py the classifier is changed and it just trained one classifer for a given label now.
- classify_gbc.py the function sequence2 should give the score of two aligned words.
- generate_pbs.py generate the script used in the HPC.