This repository contains the code needed in order to replicate the results obtained in the "A Sketch-Based Neural Model for Generating Commit Messages from Diffs" paper.
- Datasets cleaned (datasets_cleaned)
- Datasets original (datasets_original)
- Models
- NNGen
- Predictions
- seq2seq
- utils
Contains the cleaned dataset (Liu et al. dataset) and the datasets derived from the cleaned dataset.
- all - contains the cleaned dataset
- gitignore - contains the dataset with gitignore files
- gradle - contains the dataset with gradle files
- java - contains the dataset with java files
- java_template - contains the dataset with java template files
- md - contains the dataset with gitignore files
- others_v1 - contains the dataset with files which are not gitrepo, gradle, java, txt and xml
- others_v2 - contains the dataset with files which are not gitrepo, gitignore, gradle, java, md, properties, txt, xml and yml
- properties - contains the dataset with properties files
- txt - contains the dataset with txt files
- xml - contains the dataset with xml files
- yml - contains the dataset with yml files
Contains the original dataset (Jiang et al. dataset) and the datasets derived from the cleaned dataset.
- all - contains the cleaned dataset
- gitignore - contains the dataset with gitignore files
- gitrepo - contains the dataset with gitrepo files
- gradle - contains the dataset with gradle files
- java - contains the dataset with java files
- java_template - contains the dataset with java template files
- md - contains the dataset with gitignore files
- others_v1 - contains the dataset with files which are not gitrepo, gradle, java, txt and xml
- others_v2 - contains the dataset with files which are not gitrepo, gitignore, gradle, java, md, properties, txt, xml and yml
- properties - contains the dataset with properties files
- txt - contains the dataset with txt files
- xml - contains the dataset with xml files
- yml - contains the dataset with yml files
distributions_plot.py - Plots the words distributions on the diffs and messages for the gitrepo, java and xml files.
- nmt2.yml - contains the model with two layer
- nmt4.yml - contains the model with four layers and residual connections
- nmt8.yml - contains the model with eight layers and residual conntections
- predict-beam5.sh - runs prediction with beam search with width 5
- predict-beam10-pen1-replace-unk.sh - runs prediction with beam search with width 10 and length penalty 1 and copying mechanism
- predict-beam10-pen1.sh - runs prediction with beam search with width 10 and length penalty 1
- predict-beam10-replace-unk.sh - runs prediction with beam 10 and copying mechanism
- predict-beam10.sh - runs prediction with beam search with width 10
- predict-normal.sh - runs prediction withour beam search and copying mechanism
- predict-replace-unk.sh - runs prediction with copying mechanism
- predict.sh - runs all the predictions scripts
- text_metrics.yml - contains the metrics for training
- train_seq2seq.yml - sets the training bucket sizes
- train.sh - runs the training
Our implementation of the NNGen algorithm introduced by Liu et al.
- main.py - contains the implementation
- run.sh - runs the implementation on all datasets in the datasets_original
The predictions folder contains two folders (original, cleaned) both of them containing three files:
- nmt8-ft.txt - predictions of the nmt8-ft ensemble
- nmt8-ft-jt.txt - predictions of the nmt8-ft-jt ensemble
- target_for_nmt_ensemble.msg - target messages reordered based on the file type
Is a modified version of Google's seq2seq which is able to support beam search with copying mechanism.
- calculate_bleu.sh - calculates the essemble bleu score based on each dataset predictions
- create_dataset_by_file_type.py - generates the gitignore, gitrepo, gradle, java, md, properties, txt, xml and yml datasets
- create_dataset_without_file_types.py - generates the others_v1 and others_v2 datasets
- find_top_k_file_types.py - calculates the top 10 file types found in the datasets
- generate_vocabs.py - generates the reduced vocabulary for each file type
- prepare_diffs.py - generates the template java diff and saves the constants, classes, functions, variables tokens in a mapper
- prepare_msgs.py - replaces tokens in the messages based on the tokens in found in the mapper
- replace.py - replaces the tokens in the predicted java template messages with the tokens found in the mapper
- utils.py