Skip to content

Learning domain-specific word embeddings for synonym discovery in software domain.

Notifications You must be signed in to change notification settings

QianRuan/SynonymsDiscoveryinSoftwareDomain

Repository files navigation

Synonyms-Discovery-in-Software-Domain

Learning domain-specific word embeddings for synonym discovery in software domain.

Setup:

create anaconda env according to utils/requirments.txt

Data and pretrained models:

download or perpare your own training and evaluation datasets, save them in the folders trainData, evalData
download pretrained models and save in the folder models

Training and evaluation: FastText models

modify the hyperparameters in utils/constants.py before training

approach 1: training WE from scratch on the domain-specific training corpus

python main.py --train_ft_all --eval_sim --eval_syns_CV 4

--eval_sim: evaluation method 1, synonyms discovery;
--eval_syns_CV 4 :evaluation method 2, synonym pairs perdiction, do cross-validation with 4 folds

approach 2: domain adaption based on the pretrained model

python main.py --c_train_ft_all --eval_sim --eval_syns_CV 4

Training and evaluation: BERT models

download and extract google's bert project: https://github.com/google-research/bert
modify the paths in the .sh files before running experiments

perpare training data for BERT

bash bert_perpare_data.sh

approach 1: training WE from scratch on the domain-specific training corpus

bash bert_train.sh

approach 2: domain adaption based on the pretrained model

bash bert_c_train.sh

evaluation

bash bert_eval.sh or
python main.py --modelName Cbert_github_ts200 --modelPath models/bert/Cbert_github_ts200/model.ckpt-200 --eval_sim_bert --eval_syns_CV_bert 4

before evaluation, copy the files vocab.txt and bert_config.json from the pretrained bert model to the model's folder
--modelName: name the model;
--modelPath :path to the last checkpoint of the bert model

Experiment results

evaluation metrics are saved in evaluation_results.xlxs
only best models are saved in models/

About

Learning domain-specific word embeddings for synonym discovery in software domain.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published