This is the result of my thesis for graduating on Electrical Engineering. It is a simple classification system with the following specs:
- Naive Bayes classifier - algorithms are modified versions of Manning et. al.
- Document Frequency (DF) feature selection - Yiming Yang
- Web scraping framework (built upon scrapy) which uses Document Frequency feature selection.
This classification system's objective is to classify a thesis on its respective field of knowledge.
The system was subject to the following experiment:
- 647 theses were downloaded from Digital Library - USP, which is a thesis database for the Universidade de São Paulo
- Courses were chosen at random
- 75% used for training (chosen at random)
- 25% used for testing
- Objective: observe the relationship between the number of features and the output global accuracy
By increasing the number of features, it was observed that the accuracy increases monotonically, as expected. The full results are shown in the final document (tcc.pdf in pt-BR).
This system achieved 84.66% global accuracy when using 12359 features. Even with this huge number of features and a relatively big document space, it still achieved considerable speed, where training took 50 secs, classifying 20 secs and extracting features 40 secs (total 110 secs all steps).
Concerning processing, training and classification throughput figures were respectively 280k words per sec and 1.6k words per sec.
If you're interested in repeating it or learning from it, feel free to contact me.