Skip to content

pombredanne/stylometry

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stylometry

Stylometric (parallel) framework in Python for big data in clusters

Features

  • Parallelized (thus fast)
  • Intended to integrate with a database-based corpus
  • A variety of feature-generation techniques:
    • byte-ngrams
    • word-ngrams
    • readability metrics
    • simple statistics
    • part-of-speech tagging
    • part-of-speech ngrams
    • word/pos hybrids
  • Plugs into a variety of stylometric techniques:
    • ppm-c (compression)
    • dmc (compression)
    • gvc (spam-filter)
    • sofia-ml (machine learning)
  • Some graphing utilities to show performance

We also provide some plugs to transform existing corpora into database format. We also provide some plugs to export features into SVM-light sparse data format.

Assumptions

We assume you have lots of RAM or lots of time or lots of CPU cores or all 3.

Haphazard off-the-cuff observed metrics

  • 30 million comments generally takes about a day to process 1 type of feature
  • 3 million posts generally takes about an hour to process 1 type of feature

About

Stylometric framework in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%