Synonyms-Discovery-in-Software-Domain

Learning domain-specific word embeddings for synonym discovery in software domain.

Setup:

create anaconda env according to utils/requirments.txt

Data and pretrained models:

download or perpare your own training and evaluation datasets, save them in the folders trainData, evalData
download pretrained models and save in the folder models

Training and evaluation: FastText models

modify the hyperparameters in utils/constants.py before training

approach 1: training WE from scratch on the domain-specific training corpus

python main.py --train_ft_all --eval_sim --eval_syns_CV 4

--eval_sim: evaluation method 1, synonyms discovery;
--eval_syns_CV 4 :evaluation method 2, synonym pairs perdiction, do cross-validation with 4 folds

approach 2: domain adaption based on the pretrained model

python main.py --c_train_ft_all --eval_sim --eval_syns_CV 4

Training and evaluation: BERT models

download and extract google's bert project: https://github.com/google-research/bert
modify the paths in the .sh files before running experiments

perpare training data for BERT

bash bert_perpare_data.sh

approach 1: training WE from scratch on the domain-specific training corpus

bash bert_train.sh

approach 2: domain adaption based on the pretrained model

bash bert_c_train.sh

evaluation

bash bert_eval.sh or
python main.py --modelName Cbert_github_ts200 --modelPath models/bert/Cbert_github_ts200/model.ckpt-200 --eval_sim_bert --eval_syns_CV_bert 4

before evaluation, copy the files vocab.txt and bert_config.json from the pretrained bert model to the model's folder
--modelName: name the model;
--modelPath :path to the last checkpoint of the bert model

Experiment results

evaluation metrics are saved in evaluation_results.xlxs
only best models are saved in models/

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
evalData		evalData
models		models
trainData		trainData
utils		utils
README.md		README.md
bert_c_train.sh		bert_c_train.sh
bert_eval.sh		bert_eval.sh
bert_perpare_data.sh		bert_perpare_data.sh
bert_train.sh		bert_train.sh
evaluation.py		evaluation.py
main.py		main.py
train.py		train.py

QianRuan/SynonymsDiscoveryinSoftwareDomain

Folders and files

Latest commit

History

Repository files navigation

Synonyms-Discovery-in-Software-Domain

Setup:

Data and pretrained models:

Training and evaluation: FastText models

approach 1: training WE from scratch on the domain-specific training corpus

approach 2: domain adaption based on the pretrained model

Training and evaluation: BERT models

perpare training data for BERT

approach 1: training WE from scratch on the domain-specific training corpus

approach 2: domain adaption based on the pretrained model

evaluation

Experiment results

About

Resources

Stars

Watchers

Forks

Languages