stylometry

Stylometric (parallel) framework in Python for big data in clusters

Features

Parallelized (thus fast)
Intended to integrate with a database-based corpus
A variety of feature-generation techniques:
- byte-ngrams
- word-ngrams
- readability metrics
- simple statistics
- part-of-speech tagging
- part-of-speech ngrams
- word/pos hybrids
Plugs into a variety of stylometric techniques:
- ppm-c (compression)
- dmc (compression)
- gvc (spam-filter)
- sofia-ml (machine learning)
Some graphing utilities to show performance

We also provide some plugs to transform existing corpora into database format. We also provide some plugs to export features into SVM-light sparse data format.

Assumptions

We assume you have lots of RAM or lots of time or lots of CPU cores or all 3.

Haphazard off-the-cuff observed metrics

30 million comments generally takes about a day to process 1 type of feature
3 million posts generally takes about an hour to process 1 type of feature

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
classifiers		classifiers
corpus		corpus
crawl		crawl
feature		feature
graph		graph
main		main
parallel		parallel
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base36.py		base36.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classifiers

classifiers

corpus

corpus

crawl

crawl

feature

feature

graph

graph

main

main

parallel

parallel

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

base36.py

base36.py

Repository files navigation

stylometry

Features

Assumptions

Haphazard off-the-cuff observed metrics

About

Releases

Packages

Languages

License

pombredanne/stylometry

Folders and files

Latest commit

History

Repository files navigation

stylometry

Features

Assumptions

Haphazard off-the-cuff observed metrics

About

Resources

License

Stars

Watchers

Forks

Languages