Learning domain-specific word embeddings for synonym discovery in software domain.
create anaconda env according to utils/requirments.txt
download or perpare your own training and evaluation datasets, save them in the folders trainData, evalData
download pretrained models and save in the folder models
modify the hyperparameters in utils/constants.py before training
python main.py --train_ft_all --eval_sim --eval_syns_CV 4
--eval_sim: evaluation method 1, synonyms discovery;
--eval_syns_CV 4 :evaluation method 2, synonym pairs perdiction, do cross-validation with 4 folds
python main.py --c_train_ft_all --eval_sim --eval_syns_CV 4
download and extract google's bert project: https://github.com/google-research/bert
modify the paths in the .sh files before running experiments
bash bert_perpare_data.sh
bash bert_train.sh
bash bert_c_train.sh
bash bert_eval.sh or
python main.py --modelName Cbert_github_ts200 --modelPath models/bert/Cbert_github_ts200/model.ckpt-200 --eval_sim_bert --eval_syns_CV_bert 4
before evaluation, copy the files vocab.txt and bert_config.json from the pretrained bert model to the model's folder
--modelName: name the model;
--modelPath :path to the last checkpoint of the bert model
evaluation metrics are saved in evaluation_results.xlxs
only best models are saved in models/