unmaintained: Please go to spammy for the latest development version which is pip installable
spamfilter is our Machine Learning project, where we build a custom Naive Bayes classifier to classify email into ham or spam.
Trained on close to 33,000 training emails
Feature sets
- CAPSLOCK
- attachments
- numbers
- Links
- Words in text
You can use the pickled classifier objects to classify mail into spam or ham. (Refer the DEMO and API usage guide for details)
- Development
- Running the classifier
- API usage
- Training classifier on your own dataset
- FAQ
- To the contributers
- Legal stuff
I prefer to use virtualenv
's for keeping the global python
interpreter clutter free. But you are free to do a system wide install for the dependencies.
$ git clone https://github.com/prodicus/spamfilter/ && cd spamfilter
$ pip install -r requirements.txt
>>> import nltk
>>> nltk.download('stopwords')
>>> from termcolor import colored
>>> import bs4
>>> from nltk.corpus import stopwords
>>> from nltk import stem
>>>
If the above imports work without giving you an error, you are good to go!
After installing the dependencies make sure that you have make
installed on your system
A trained classifier object, trained on the full_corpus
dataset (close to 33,000 emails) can be loaded and used for classifying.
$ make pickle_run
Watch and lay back!
$ make run
What this does is it will ask you which dataset to train the classifier upon.
And after it is trained, which dataset to test the classifier upon.
NOTE: For those not having make
installed. You will have to do a
$ python test.py
for$ make run
$ python test_classifier_pickle.py
for$ make pickle_run
Refer API usage for the custom classifier (wiki) for implementation details
Refer API usage for the textblob classifier (wiki) for implementation details
##Training classifier on your own dataset ⬆️ Back to top
You can train the classifier on your own dataset!
Put your dataset folder (eg: data_foo
) inside the data
folder
$ tree data/corpus2/ -L 1
data/data_foo/
├── ham
└── spam
-
specify the folder name of your newly added dataset and the name of the pickle file to be created here here in file
create_pickle.py
-
Choose the number of files to train the classifier againt here in file
create_pickle.py
$ make pickle
I ran it one too many times apparantly and the accuracy is generally between
Accuracy | |
---|---|
Spam | 80 to 94% |
Ham | 70 to 80% |
Watch the classifier in action here
The dataset used is the Enron dataset.
We Trained our spam_classifier.pickle
classifier object against the full_corpus dataset and then cross validated the pickled classifier with any of the datasets present in the data directory
Read more about the directory structure here
Refer CONTRIBUTING.md for more details
- Deploying a full blown app to heroku
-
To make a voting system which will take the best out of all the classifiers (increasing the accuracy is the aim) - Try out
textblob
and see how it performs with our classifier -
To decide on whether to useUsing colorama as explained in commit 89da4cdclint
ortermcolor
- Try implementing some of the algorithms using scikit learn
Open sourced under GPLv3
spamfilter
Copyright (C) 2016 Tasdik Rahman(prodicus@outlook.com)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
You can a copy of the LICENSE file HERE