Requirements

Python 3.x and pip
Gemsim, Numpy, NLTK, NLTK Trainer, Spacy, Sklearn, Pandas, Pyphen, Pyspellchecker

Configuration

It's highly recommended creating a virtualenv before installing the dependencies
Dependencies

pip3 install virtualenv
virtualenv <YOU_NAME_IT>
source <THE_NAME_ABOVE>/bin/activate
pip install -r requirements.txt
sh setup.sh

NLTK setup (Within a python terminal)

import nltk
nltk.download('punkt')
nltk.download('mac_morpho')
nltk.download('stopwords')

The step above should install the dependencies in your nltk_data folder (~/nltk_data)

#Usage

TBD

ML text-extractor

Extract textual document content from different sources (PDF, Docs and text files)
Convert textual document into stylometric features
Contains Random Forest and Simple Neural Network classifiers over the data described in the next section

The data

There are two main types of data set inside the data/parsed-data folder: -- Regular data files, with textual content and masked author name -- Stylometric data files, that represent the conversion of the raw text into stylometric features (~50)

PS: Each data set has two versions of it, 'selected' means that samples with less than 3 per author were removed, 'data' is the complete data set with no exclusions

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
data		data
src		src
stylometry		stylometry
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

stylometry

stylometry

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

setup.sh

setup.sh

Repository files navigation

Requirements

Configuration

ML text-extractor

The data

About

Releases

Packages

Languages

Daanielvb/text-extractor

Folders and files

Latest commit

History

Repository files navigation

Requirements

Configuration

ML text-extractor

The data

About

Resources

Stars

Watchers

Forks

Languages