Skip to content

ICT1002 Python Project. A web application built in pure Python, using the Dash framework, that allows the user to generate meaningful visualizations of email records, and predict whether a given email is spam.

License

Notifications You must be signed in to change notification settings

zeyu2001/ICT1002-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spam or Ham

Project Logo

About

Spam, also known as junk email, is unwanted or unsolicited messages forwarded in bulk to users’ accounts. Such emails can clog up inboxes, take up unnecessary disk space, and in general cause a negative user experience for its recipients. Most major email service providers implement some form of a spam filter to automatically forward spam emails to a junk inbox, preventing such emails from impacting their users. Using an appropriate spam classification dataset from Kaggle, a data visualization and machine learning solution was developed. The end product is a web application built in pure Python, using the Dash framework, that allows the user to generate meaningful visualizations of email records, and predict whether a given email is spam. Using a bi-directional LSTM network, the spam classifier was able to achieve 99% accuracy on test data.

Usage

Running in a Development Environment

Install requirements first: pip install -r requirements.txt

To run the Dash server:

$ python3 runserver.py
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app "app" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)

After running runserver.py, navigate to localhost to view the application.

Debug Mode

By default, debug mode is turned off. To run the Dash server with debug mode turned on, use the --d or --debug option.

$ python3 runserver.py --d
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app "app" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: on

Running on debug mode will turn on Dash DevTools, giving you access to tools like callback graphs, hot reloading, in-app error reporting, etc.

Use this for debugging and development purposes.

Deploying to a Production Environment

In a development environment, the Dash app can be easily accessed by running runserver.py and navigating to localhost. In a production environment, the Dash app must be deployed to a server.

Dash is written on top of Flask. Hence, deploying a Dash app is exactly the same as deploying a Flask app. Refer to Flask's Deployment Guide for more details.

Note the following:

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.

A WSGI server should be used instead. Simple-to-use, affordable solutions include PythonAnywhere and Heroku.

Project Structure

.
├── LICENSE
├── README.md
├── classifier
│   ├── data
│   │   ├── x_test.npy
│   │   ├── x_train.npy
│   │   ├── y_test.npy
│   │   └── y_train.npy
│   ├── emails.csv
│   ├── exec.py
│   ├── metrics
│   │   └── 20200925163152_plot.png
│   ├── models
│   │   └── 20200925163152_spam_classifier.h5
│   ├── predict_input.py
│   └── process_data.py
├── dash_app
│   ├── app.py
│   ├── assets
│   │   └── main.css
│   ├── bm_alg.py
│   ├── callbacks.py
│   ├── data.py
│   ├── emails.csv
│   ├── index.py
│   ├── predict.py
│   ├── routes.py
│   ├── stats.py
│   └── temp
│       └── app_files
│           ├── output.csv
│           └── stats_output.html
├── requirements.txt
└── runserver.py

Classifier

  • process_data.py: Processes data from the dataset, removing irrelevant data in the spam text including punctuation, stop words, hyperlinks, etc. and representing the data as a feature matrix that allows the model architecture to effectively extract relationships between the sequence data and resulting label.
  • exec.py: Trains and saves the classifier model.
  • predict_input.py: Integration with the Dash Web GUI. Given a user input, predict whether the email is spam.

Dash App

  • app.py: Defines the Dash application.
  • runserver.py: Runs the application defined above. Integrates all routes and callbacks for the Dash application.
  • routes.py: Specifies the routes (URLs) of the application. The application is multi-paged, but the browser does not need to refresh. The content is dynamically updated here. Also defines the functions and request handlers to serve local files, allowing the user to download exported results.
  • index.py: Layout for '/' (homepage)
  • stats.py: Layout for '/stats'
  • predict.py: Layout for '/predict'
  • callbacks.py: Defines callback functions for the Dash app. This is how the application is able to dynamically update its content (tables, graphs, etc.) based on the user input (search bar, dropdown, etc.).
  • data.py: Extracts data from the dataset / exports data from the dataset using pandas.
  • bm_alg.py: The search algorithm. We use the Boyer-Moore algorithm. The precomputation time complexity is O(m+k), where k is the size of the alphabet. The time complexity for the searching phase is O(n).

Contributors

  • Zhang Zeyu
  • Jared Marc Song Kye-Jet
  • Lee Zhan Hong
  • Ivan Ng Say Mun
  • Bill Eng De Xian
  • Nicholas Ooi Jun Wei

License

Use of this project is governed by the MIT License.

Plagiarism

This project is an assignment submission in partial fulfillment of the Singapore Institute of Technology (SIT) module ICT1002 Programming Fundementals.

The University's policy on copying does not allow students to copy software as well as assessment solutions from another person.

About

ICT1002 Python Project. A web application built in pure Python, using the Dash framework, that allows the user to generate meaningful visualizations of email records, and predict whether a given email is spam.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published