Skip to content

Big-Data/pattern-recognition-for-text-documents-classification

 
 

Repository files navigation

About

This is the result of my thesis for graduating on Electrical Engineering. It is a simple classification system with the following specs:

  • Naive Bayes classifier - algorithms are modified versions of Manning et. al.
  • Document Frequency (DF) feature selection - Yiming Yang
  • Web scraping framework (built upon scrapy) which uses Document Frequency feature selection.

Objective

This classification system's objective is to classify a thesis on its respective field of knowledge.

Experimental Setup

The system was subject to the following experiment:

  • 647 theses were downloaded from Digital Library - USP, which is a thesis database for the Universidade de São Paulo
  • Courses were chosen at random
  • 75% used for training (chosen at random)
  • 25% used for testing
  • Objective: observe the relationship between the number of features and the output global accuracy

Results

By increasing the number of features, it was observed that the accuracy increases monotonically, as expected. The full results are shown in the final document (tcc.pdf in pt-BR).

This system achieved 84.66% global accuracy when using 12359 features. Even with this huge number of features and a relatively big document space, it still achieved considerable speed, where training took 50 secs, classifying 20 secs and extracting features 40 secs (total 110 secs all steps).

Concerning processing, training and classification throughput figures were respectively 280k words per sec and 1.6k words per sec.

Repeating the Experiment

If you're interested in repeating it or learning from it, feel free to contact me.

Requirements

About

Final thesis for my bsc. in electrical engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published