txtcat-nb

Description:

txtcat-nb is a simple text file classifier using naive bayes. A somewhat rudimentary description of the math involved is in techdoc.tex (but better descriptions of the technique exist elsewhere).

Data

Training data is assumed to be in a single directory; subdirectories within the data directory used for labels.

For example,

ls data/ data/a data/b ls data/* data/a/1.txt data/a/2.txt data/b/3.txt data/b/4.txt

would correspond to a two class dataset with labels a and b, each with two data files.

Requirements:

python >= 2.7 (for collections.Counter and json). Also works with python3 (tested with 3.1.2).

Usage:

Training usage:

For training data as above, storing trained classifier in model.json, using 30% of training files for each label for statistics:

train.py -d data/ -o model.json -p 0.3

or

train.py --datadir=data/ --output=model.json --pct=0.3

Classification usage:

classify.py -m model.json -d unlabled-data/

or

classify.py --model=model.json --datadir=unlabled-data/

The smoothing constant can be changed if desired; the default of 1 seems to work reasonably well.

Known Issues:

Doesn't currently support full cross-validation in training.
No pre-processing of words yet - accuracy for my current test cases is sufficient with no preprocessing.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.hgignore		.hgignore
NaiveBayes.py		NaiveBayes.py
README.md		README.md
classify.py		classify.py
nbio.py		nbio.py
stats.py		stats.py
techdoc.tex		techdoc.tex
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.hgignore

.hgignore

NaiveBayes.py

NaiveBayes.py

README.md

README.md

classify.py

classify.py

nbio.py

nbio.py

stats.py

stats.py

techdoc.tex

techdoc.tex

train.py

train.py

Repository files navigation

txtcat-nb

Description:

Data

Requirements:

Usage:

Training usage:

Classification usage:

Known Issues:

About

Releases

Packages

Languages

pameyer52/txtcat-nb

Folders and files

Latest commit

History

Repository files navigation

txtcat-nb

Description:

Data

Requirements:

Usage:

Training usage:

Classification usage:

Known Issues:

About

Resources

Stars

Watchers

Forks

Languages