A Plagiarism Detection System on Hybrid Cloud

Overview

This is a model for a plagiarism detection system functionally distributed over a combination of public and private clouds termed as hybrid cloud. The user-interface of the system is developed using Platform as a Service (PaaS) - Google App Engine (GAE). Google Datastore acts as the backend. The logic of plagiarism detection is distributed over a private Infrastructure as a Service (IaaS), OpenStack and a public IaaS, Amazon Elastic Compute Cloud (EC2) with usage of Amazon Simple Storage Service (S3) as intermediate storage. The dataset used for this project is a subset of files from PAN 2011 corpus which provides source and suspicious text documents for plagiarism detection research purposes.

This approach used the infrastructure of Amazon EC2 to produce the source document index and the infrastructure of Openstack to match a suspicious document against the source index produced to detect any text similarity in the suspicious document. Text similarity is matched by checking for words of provided windowsize and overlapsize on new lines. A web interface is developed using Python and jinja2 in Google App Engine to take user input for windowsize, overlapsize and number of instances to do the plagiarism detection.

FilesOnEC2 contains the python scripts and bash scripts that went into AWS EC2 AMI
- createindex.py This is the main script which creates source documents index
- createsourcefolders.py This script divides source documents into different folders based on no. of instances
- runthisfile.sh This script gets ami launch index id and instance number from metadata and saves in text files for other python scripts to use
- sourcefiles.py This script creates a text file for each source document with given words of windowsize and overlapsize on new lines
- sourceindex.py For each file created by sourcefiles.py, find hashes of each line and add it to an index file with its location
FilesOnGAE contains the files deployed on GAE to build the web app for the plagiarism detection system
- app.yaml required to tell version of code, location of index.py, static files and libraries to be used etc.
- boto library files required to connect to S3
- index.py Main file which has handlers for all htm pages
- runec2instance.py This file has code to start required EC2 instances
- static the css file used to design all pages
- templates all html pages used in this application
FilesOnOpenstack contains the python scripts and bash scripts that went into Openstack image
- dowork.py This is the main script which fetches work from GAE and calls required methods to compute matches
- every-300-seconds.sh This script runs in background continuously. It runs dowork.py every 300 seconds
- fetchS3files.py This script fetches source index and suspicious files from S3 bucket to work on
- matches_openstack.py This script computes the matches between source index and suspicious index
- suspiciousindex.py This script creates an index of the suspicious file to match it with source indexes

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FilesOnEC2		FilesOnEC2
FilesOnGAE/sa00393-cw		FilesOnGAE/sa00393-cw
FilesOnOpenStack		FilesOnOpenStack
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesOnEC2

FilesOnEC2

FilesOnGAE/sa00393-cw

FilesOnGAE/sa00393-cw

FilesOnOpenStack

FilesOnOpenStack

README.md

README.md

Repository files navigation

A Plagiarism Detection System on Hybrid Cloud

Overview

Contents

About

Releases

Packages

Languages

sappia/plagiarism_detection

Folders and files

Latest commit

History

Repository files navigation

A Plagiarism Detection System on Hybrid Cloud

Overview

Contents

About

Resources

Stars

Watchers

Forks

Languages