GitHub - tasdikrahman/spamfilter: DEPRECATED: Go to https://github.com/prodicus/spammy for DEV version

spamfilter

unmaintained: Please go to spammy for the latest development version which is pip installable

spamfilter is our Machine Learning project, where we build a custom Naive Bayes classifier to classify email into ham or spam.

Trained on close to 33,000 training emails

Feature sets

CAPSLOCK
attachments
numbers
Links
Words in text

You can use the pickled classifier objects to classify mail into spam or ham. (Refer the DEMO and API usage guide for details)

Index

Development
Running the classifier
- Loading the saved classifier
- Manually running and training the classifier
API usage
- Custom NB classifier API
- Textblob API
Training classifier on your own dataset
FAQ
- Accuracy of the classifier
- Regarding the dataset
To the contributers
- Ideas
Legal stuff

Development

⬆️ Back to top

Installing the dependencies

I prefer to use virtualenv's for keeping the global python interpreter clutter free. But you are free to do a system wide install for the dependencies.

$ git clone https://github.com/prodicus/spamfilter/ && cd spamfilter
$ pip install -r requirements.txt

Downloading the NLTK corpora

>>> import nltk
>>> nltk.download('stopwords')

Check whether you have everything set up

>>> from termcolor import colored
>>> import bs4
>>> from nltk.corpus import stopwords
>>> from nltk import stem
>>>

If the above imports work without giving you an error, you are good to go!

Running the classifier

⬆️ Back to top

After installing the dependencies make sure that you have make installed on your system

Loading the saved classifier

A trained classifier object, trained on the full_corpus dataset (close to 33,000 emails) can be loaded and used for classifying.

$ make pickle_run

Watch and lay back!

Manually running and training the classifier

$ make run

What this does is it will ask you which dataset to train the classifier upon.

And after it is trained, which dataset to test the classifier upon.

NOTE: For those not having make installed. You will have to do a

$ python test.py for $ make run
$ python test_classifier_pickle.py for $ make pickle_run

API usage

⬆️ Back to top

Custom NB classifier API

Refer API usage for the custom classifier (wiki) for implementation details

Textblob API

Refer API usage for the textblob classifier (wiki) for implementation details

##Training classifier on your own dataset ⬆️ Back to top

You can train the classifier on your own dataset!

Step 1

Put your dataset folder (eg: data_foo) inside the data folder

$ tree data/corpus2/ -L 1
data/data_foo/
├── ham
└── spam

Step 2

specify the folder name of your newly added dataset and the name of the pickle file to be created here here in file create_pickle.py
Choose the number of files to train the classifier againt here in file create_pickle.py

Step 3

$ make pickle

FAQ

⬆️ Back to top

Accuracy of the classifier

I ran it one too many times apparantly and the accuracy is generally between

	Accuracy
Spam	80 to 94%
Ham	70 to 80%

Watch the classifier in action here

Regarding the dataset

The dataset used is the Enron dataset.

We Trained our spam_classifier.pickle classifier object against the full_corpus dataset and then cross validated the pickled classifier with any of the datasets present in the data directory

Read more about the directory structure here

To the contributers

⬆️ Back to top

Refer CONTRIBUTING.md for more details

Ideas

⬆️ Back to top

Deploying a full blown app to heroku
~~To make a voting system which will take the best out of all the classifiers (increasing the accuracy is the aim)~~
Try out textblob and see how it performs with our classifier
~~To decide on whether to use clint or termcolor~~ Using colorama as explained in commit 89da4cd
Try implementing some of the algorithms using scikit learn

Legal Stuff

⬆️ Back to top

Open sourced under GPLv3

spamfilter
Copyright (C) 2016  Tasdik Rahman(prodicus@outlook.com)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

You can a copy of the LICENSE file HERE

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
logfiles		logfiles
references		references
saved_classifiers		saved_classifiers
textblob_api		textblob_api
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
classifier.py		classifier.py
create_pickle.py		create_pickle.py
exceptions.py		exceptions.py
requirements.txt		requirements.txt
test.py		test.py
test_classifier_pickle.py		test_classifier_pickle.py
train.py		train.py

License

tasdikrahman/spamfilter

Folders and files

Latest commit

History

Repository files navigation

spamfilter

Index

Development

Installing the dependencies

Downloading the NLTK corpora

Check whether you have everything set up

Running the classifier

Loading the saved classifier

Manually running and training the classifier

API usage

Custom NB classifier API

Textblob API

Step 1

Step 2

Step 3

FAQ

Accuracy of the classifier

Regarding the dataset

To the contributers

Ideas

Legal Stuff

About

Resources

License

Stars

Watchers

Forks

Languages