Skip to content

Daanielvb/text-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Requirements

  • Python 3.x and pip
  • Gemsim, Numpy, NLTK, NLTK Trainer, Spacy, Sklearn, Pandas, Pyphen, Pyspellchecker

Configuration

  • It's highly recommended creating a virtualenv before installing the dependencies

  • Dependencies

pip3 install virtualenv
virtualenv <YOU_NAME_IT>
source <THE_NAME_ABOVE>/bin/activate
pip install -r requirements.txt
sh setup.sh
  • NLTK setup (Within a python terminal)
import nltk
nltk.download('punkt')
nltk.download('mac_morpho')
nltk.download('stopwords')

The step above should install the dependencies in your nltk_data folder (~/nltk_data)

#Usage

  • TBD

ML text-extractor

  • Extract textual document content from different sources (PDF, Docs and text files)
  • Convert textual document into stylometric features
  • Contains Random Forest and Simple Neural Network classifiers over the data described in the next section

The data

  • There are two main types of data set inside the data/parsed-data folder: -- Regular data files, with textual content and masked author name -- Stylometric data files, that represent the conversion of the raw text into stylometric features (~50)

PS: Each data set has two versions of it, 'selected' means that samples with less than 3 per author were removed, 'data' is the complete data set with no exclusions

About

Textual document extractor with machine learning features

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published