Using machine learning combined with attribute counting and structured based methods to obtain an accurate analysis of files for source code plagiarism Utilises the Rabin–Karp algorithm and AST's for improved performance.
- The data to train this model was taken from the PAN 2014 dataset. This dataset is not included in this repository, but details around it can be found here
- The actual data can be found here
- You may have to request access to the data from the PAN organisers
- If you're not able to acquire from here please contact me and I'll share the data I have with you
- The entry point is gui.py
- Please run these commands
# Make sure you're in the base directory first where the poetry.lock file is
poetry install
poetry shell
python scp/gui.py
- Note that if you're running this via ssh/wsl you will need to do extra steps to setup the GUI to display properly