Setup

Most convinient way is just to run ./setup script. If you have any troubles with it, here is the full list of manual operations to complete:

Install pyhon 2.7
Create virtual environment virtualenv venv --no-site-packages
Activate it source venv/bin/activate
Install python setup utils / easy_install (instructions: https://pypi.python.org/pypi/setuptools)
Install pip sudo easy_install pip
Install scikit_learn, scipy, numpy, spacy pip install -U numpy scipy scikit-learn spacy && python -m spacy.en.download all

Usage

Firstly, you want to activate virtual environment created on previous step (to use local python libraries, not global). You could do this with source venv/bin/activate. It modifies your PATH in this purpose. After end of work you could leave venv with deactivate command.

Secondly, there are 3 main utilities: train, predict, and check_accuracy With train.py you are training model on your data, and then predict authorship probabilities with predict.py for anonymous texts. check_accuracy.py performs cross-validation on training data, so you can test and compare performans on different approaches.

Here is full list of command-line options for them:

CLI options for check_accuracy:

-i, --input - path to file with train data. Data have to be in the same format as example_train.py
-s, --selection [chi2, logreg, svd1000] - feature selection method.
- chi2 - chi2-based statistical test
- logreg - most important features with L1 logisitic regression
- svd1000 - Singular Value Decomposition. 1000 or other number specifies final dimensionality
- pca500 - Principal Component Analysis. 500 or other number specifies dimensionality like with svd.
-c, --cls [logreg, svc, rf100] - classifier type:
- logreg - logistic regression with L2 regularization
- svc - SVM classifier.
- rf100 - Random Forest classifier with 100 estimators. You could also use rf500, rf1000, and so on.
-n - number of samples to use (for quick accuracy check you may want to use small number of samples)
-v, --vectorizer [bow, word2vec, word2vec2] - text vectorizer type
- bow - Bag of Words - vector representation of texts based of frequencies of n-grams, POS tags, and some other additonal statistics
- word2vec - straitforward sum of word2vec vectors for words in paragraph
- word2vec2 - TF-IDF weighted sum of word2vec vectors in text
- combined - frequency features + sum of word vectors

CLI options for train:

-i (same as check_accuracy)
-s (same as check_accuracy)
-c (same as check_accuracy)
-o, --output - output file for model. Default is model.pkl

CLI options for predict:

-i, --input - location of file with test data, the file have to be in the same format as example_test.txt
-m, --model - location of model file
-o, --output - name of file where to store predictions

Binary distributions:

As alternative way to launching python scripts, you can use binary distributions. How to do that:

Unzip dist.zip
Unzip data.zip into dist folder
Run train and predict binaries with ./dist/train -i train.txt and ./dist/predict -i test.txt
Check accuracy with ./dist/check_accuracy -i enrone.txt

How to create binaries:

Install cxfreeze (http://cx-freeze.sourceforge.net)
Run cxfreeze , for example cxfreeze train.py
By default binaries are saved in ./dist As well you could specify another directory (see cxfreeze documentation)

TODO:

Try approach with cumulative TF-IDF weighted word2vec semantic vectors
Provide command options for: a) feature generation algorithm (-f) [ngrams, word2vec]
provide options for logreg, rf, svc parameters as a part of their name

Future:

Use full reddit corpus for training? https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/ 42 Gb compressed

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.gitignore		.gitignore
README.md		README.md
algorithm.py		algorithm.py
check_accuracy.py		check_accuracy.py
clear_emails.py		clear_emails.py
convert_law.py		convert_law.py
datameer_reviews.txt		datameer_reviews.txt
dist.zip		dist.zip
en_data_all-0.9.0.tgz._DWClR.tmp		en_data_all-0.9.0.tgz._DWClR.tmp
enrone.txt		enrone.txt
example_test.txt		example_test.txt
example_train.txt		example_train.txt
main.py		main.py
options.py		options.py
pred.txt		pred.txt
predict.py		predict.py
requirements.txt		requirements.txt
setup		setup
setup_train_data.py		setup_train_data.py
train.py		train.py
utils.py		utils.py

0-1-0/authorship

Folders and files

Latest commit

History

Repository files navigation

Setup

Usage

Binary distributions:

TODO:

Future:

About

Resources

Stars

Watchers

Forks

Languages