USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE CODE REVIEW ACTIVITIES

In this work, we investigate the capabilities of Generative Pre-trained Transformers, T5(Text-To-Text Transfer Transformer) to support code review.

How to replicate our results

Step 1 - Set up a GCS Bucket

This GCS Bucket will hold all the data needed for Setting up, pre-training, fine-tuning, and testing our T5 model. To Set up a new GCS Bucket, please follow the original guide provided by Google.

Step 2 - Get the datasets and all our utilities

You need to have this folder on your GSC bucket. It will contain all of our data and some utilities to replicate our results.

In particular you will have:

Pre-Training dataset Obtained by mining Stack Overflow and CodeSearchNet data.
Fine-Tuning dataset We will fine-tune our T5 small model on different datasets obtained by mining code review data from Gerrit and GitHub repositories.
- Fine-Tuning dataset v1 (Small) Same dataset used by Tufano et al., not abstracted code and raw comments.
- Fine-Tuning dataset v2 (Small) Same dataset used by Tufano et al., not abstracted code and cleaned comments.
- Fine-Tuning dataset (Large) Our new Large dataset

(optional) Step 2.5 - Process the raw datasets

All our datasets are already processed, and it's all set up to start pre-training and fine-tuning the models.

However, if you want to replicate our pre-processing steps, you just need to follow this Colab notebook. Here we will clean our raw datasets and train the Sentencepiece model to accommodate the expanded vocabulary given by the pre-training dataset.

Step 3 - Pre-Training and Fine-Tuning

To pre-train and then fine-tune T5, please follow the colab notebooks provided:

Step 4 - Generate the predictions

We generate results on different beams converting the model in PyTorch; if you want to generate predictions using a beam of 1, you can directly use the fine-tuning colab notebook linked above, once the model is fine-tuned, you can generate custom prediction. To convert the model use This Colab noteebook where you also have all the functionalities to compute perfect predictions, almost perfect predictions, codeBleu and BLEU.

here you can see Our results

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
gerrit_github_processing		gerrit_github_processing
replication_package		replication_package
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gerrit_github_processing

gerrit_github_processing

replication_package

replication_package

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE CODE REVIEW ACTIVITIES

How to replicate our results

Step 1 - Set up a GCS Bucket

Step 2 - Get the datasets and all our utilities

(optional) Step 2.5 - Process the raw datasets

Step 3 - Pre-Training and Fine-Tuning

Step 4 - Generate the predictions

About

Releases

Packages

Languages

masies/CRA

Folders and files

Latest commit

History

Repository files navigation

USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE CODE REVIEW ACTIVITIES

How to replicate our results

Step 1 - Set up a GCS Bucket

Step 2 - Get the datasets and all our utilities

(optional) Step 2.5 - Process the raw datasets

Step 3 - Pre-Training and Fine-Tuning

Step 4 - Generate the predictions

About

Resources

Stars

Watchers

Forks

Languages