About

This is the result of my thesis for graduating on Electrical Engineering. It is a simple classification system with the following specs:

Naive Bayes classifier - algorithms are modified versions of Manning et. al.
Document Frequency (DF) feature selection - Yiming Yang
Web scraping framework (built upon scrapy) which uses Document Frequency feature selection.

Objective

This classification system's objective is to classify a thesis on its respective field of knowledge.

Experimental Setup

The system was subject to the following experiment:

647 theses were downloaded from Digital Library - USP, which is a thesis database for the Universidade de São Paulo
Courses were chosen at random
75% used for training (chosen at random)
25% used for testing
Objective: observe the relationship between the number of features and the output global accuracy

Results

By increasing the number of features, it was observed that the accuracy increases monotonically, as expected. The full results are shown in the final document (tcc.pdf in pt-BR).

This system achieved 84.66% global accuracy when using 12359 features. Even with this huge number of features and a relatively big document space, it still achieved considerable speed, where training took 50 secs, classifying 20 secs and extracting features 40 secs (total 110 secs all steps).

Concerning processing, training and classification throughput figures were respectively 280k words per sec and 1.6k words per sec.

Repeating the Experiment

If you're interested in repeating it or learning from it, feel free to contact me.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
experiments		experiments
old-broken		old-broken
theses		theses
README.md		README.md
__init__.py		__init__.py
bag.py		bag.py
classifiers.py		classifiers.py
database.py		database.py
pres.pdf		pres.pdf
scrapy.cfg		scrapy.cfg
selection.py		selection.py
tcc.pdf		tcc.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments

experiments

old-broken

old-broken

theses

theses

README.md

README.md

init.py

init.py

bag.py

bag.py

classifiers.py

classifiers.py

database.py

database.py

pres.pdf

pres.pdf

scrapy.cfg

scrapy.cfg

selection.py

selection.py

tcc.pdf

tcc.pdf

Repository files navigation

About

Objective

Experimental Setup

Results

Repeating the Experiment

Requirements

About

Releases

Packages

Big-Data/pattern-recognition-for-text-documents-classification

Folders and files

Latest commit

History

Repository files navigation

About

Objective

Experimental Setup

Results

Repeating the Experiment

Requirements

About

Resources

Stars

Watchers

Forks