Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

tasdikrahman/spamfilter

Repository files navigation

spamfilter


unmaintained: Please go to spammy for the latest development version which is pip installable


spamfilter is our Machine Learning project, where we build a custom Naive Bayes classifier to classify email into ham or spam.

Trained on close to 33,000 training emails

Feature sets

  • CAPSLOCK
  • attachments
  • numbers
  • Links
  • Words in text

You can use the pickled classifier objects to classify mail into spam or ham. (Refer the DEMO and API usage guide for details)

demo

Index

Development

⬆️ Back to top

Installing the dependencies

I prefer to use virtualenv's for keeping the global python interpreter clutter free. But you are free to do a system wide install for the dependencies.

$ git clone https://github.com/prodicus/spamfilter/ && cd spamfilter
$ pip install -r requirements.txt

Downloading the NLTK corpora

>>> import nltk
>>> nltk.download('stopwords')

Check whether you have everything set up

>>> from termcolor import colored
>>> import bs4
>>> from nltk.corpus import stopwords
>>> from nltk import stem
>>>

If the above imports work without giving you an error, you are good to go!

Running the classifier

⬆️ Back to top

After installing the dependencies make sure that you have make installed on your system

Loading the saved classifier

A trained classifier object, trained on the full_corpus dataset (close to 33,000 emails) can be loaded and used for classifying.

$ make pickle_run

Watch and lay back!

Manually running and training the classifier

$ make run

What this does is it will ask you which dataset to train the classifier upon.

And after it is trained, which dataset to test the classifier upon.

NOTE: For those not having make installed. You will have to do a

  • $ python test.py for $ make run
  • $ python test_classifier_pickle.py for $ make pickle_run

API usage

⬆️ Back to top

Custom NB classifier API

Refer API usage for the custom classifier (wiki) for implementation details

Textblob API

Refer API usage for the textblob classifier (wiki) for implementation details

##Training classifier on your own dataset ⬆️ Back to top

You can train the classifier on your own dataset!

Step 1

Put your dataset folder (eg: data_foo) inside the data folder

$ tree data/corpus2/ -L 1
data/data_foo/
├── ham
└── spam

Step 2

Step 3

$ make pickle

boom

FAQ

⬆️ Back to top

Accuracy of the classifier

I ran it one too many times apparantly and the accuracy is generally between

Accuracy
Spam 80 to 94%
Ham 70 to 80%

Watch the classifier in action here

Regarding the dataset

The dataset used is the Enron dataset.

We Trained our spam_classifier.pickle classifier object against the full_corpus dataset and then cross validated the pickled classifier with any of the datasets present in the data directory

Read more about the directory structure here

To the contributers

⬆️ Back to top

Refer CONTRIBUTING.md for more details

Ideas

⬆️ Back to top

  • Deploying a full blown app to heroku
  • To make a voting system which will take the best out of all the classifiers (increasing the accuracy is the aim)
  • Try out textblob and see how it performs with our classifier
  • To decide on whether to use clint or termcolor Using colorama as explained in commit 89da4cd
  • Try implementing some of the algorithms using scikit learn

Legal Stuff

⬆️ Back to top

Open sourced under GPLv3

spamfilter
Copyright (C) 2016  Tasdik Rahman(prodicus@outlook.com)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

You can a copy of the LICENSE file HERE