Skip to content

sappia/plagiarism_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Plagiarism Detection System on Hybrid Cloud

Overview

This is a model for a plagiarism detection system functionally distributed over a combination of public and private clouds termed as hybrid cloud. The user-interface of the system is developed using Platform as a Service (PaaS) - Google App Engine (GAE). Google Datastore acts as the backend. The logic of plagiarism detection is distributed over a private Infrastructure as a Service (IaaS), OpenStack and a public IaaS, Amazon Elastic Compute Cloud (EC2) with usage of Amazon Simple Storage Service (S3) as intermediate storage. The dataset used for this project is a subset of files from PAN 2011 corpus which provides source and suspicious text documents for plagiarism detection research purposes.

This approach used the infrastructure of Amazon EC2 to produce the source document index and the infrastructure of Openstack to match a suspicious document against the source index produced to detect any text similarity in the suspicious document. Text similarity is matched by checking for words of provided windowsize and overlapsize on new lines. A web interface is developed using Python and jinja2 in Google App Engine to take user input for windowsize, overlapsize and number of instances to do the plagiarism detection.

Contents

This project consists of 3 folders:

  • FilesOnEC2 contains the python scripts and bash scripts that went into AWS EC2 AMI
    • createindex.py This is the main script which creates source documents index
    • createsourcefolders.py This script divides source documents into different folders based on no. of instances
    • runthisfile.sh This script gets ami launch index id and instance number from metadata and saves in text files for other python scripts to use
    • sourcefiles.py This script creates a text file for each source document with given words of windowsize and overlapsize on new lines
    • sourceindex.py For each file created by sourcefiles.py, find hashes of each line and add it to an index file with its location
  • FilesOnGAE contains the files deployed on GAE to build the web app for the plagiarism detection system
    • app.yaml required to tell version of code, location of index.py, static files and libraries to be used etc.
    • boto library files required to connect to S3
    • index.py Main file which has handlers for all htm pages
    • runec2instance.py This file has code to start required EC2 instances
    • static the css file used to design all pages
    • templates all html pages used in this application
  • FilesOnOpenstack contains the python scripts and bash scripts that went into Openstack image
    • dowork.py This is the main script which fetches work from GAE and calls required methods to compute matches
    • every-300-seconds.sh This script runs in background continuously. It runs dowork.py every 300 seconds
    • fetchS3files.py This script fetches source index and suspicious files from S3 bucket to work on
    • matches_openstack.py This script computes the matches between source index and suspicious index
    • suspiciousindex.py This script creates an index of the suspicious file to match it with source indexes

About

A Plagiarism Detection System on Hybrid Cloud

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages