In this work, we investigate the capabilities of Generative Pre-trained Transformers, T5(Text-To-Text Transfer Transformer) to support code review.
This GCS Bucket will hold all the data needed for Setting up, pre-training, fine-tuning, and testing our T5 model. To Set up a new GCS Bucket, please follow the original guide provided by Google.
You need to have this folder on your GSC bucket. It will contain all of our data and some utilities to replicate our results.
In particular you will have:
- Pre-Training dataset Obtained by mining Stack Overflow and CodeSearchNet data.
- Fine-Tuning dataset We will fine-tune our T5 small model on different datasets obtained by mining code review data from Gerrit and GitHub repositories.
- Fine-Tuning dataset v1 (Small) Same dataset used by Tufano et al., not abstracted code and raw comments.
- Fine-Tuning dataset v2 (Small) Same dataset used by Tufano et al., not abstracted code and cleaned comments.
- Fine-Tuning dataset (Large) Our new Large dataset
All our datasets are already processed, and it's all set up to start pre-training and fine-tuning the models.
However, if you want to replicate our pre-processing steps, you just need to follow this Colab notebook. Here we will clean our raw datasets and train the Sentencepiece model to accommodate the expanded vocabulary given by the pre-training dataset.
To pre-train and then fine-tune T5, please follow the colab notebooks provided:
We generate results on different beams converting the model in PyTorch; if you want to generate predictions using a beam of 1, you can directly use the fine-tuning colab notebook linked above, once the model is fine-tuned, you can generate custom prediction. To convert the model use This Colab noteebook where you also have all the functionalities to compute perfect predictions, almost perfect predictions, codeBleu and BLEU.
here you can see Our results