Skip to content

AuthorDetector is a framework to quickly develop, experiment and prototype author identification models.

License

Notifications You must be signed in to change notification settings

AmineMab/author-detector

 
 

Repository files navigation

AUTHORDETECTOR README

OVERVIEW

AuthorDetector is a framework to quickly develop and test author identification algorithms.

The main purpose of this project is to be:

  • Modular: can just create new algorithms and modules without even knowing how other modules work.
  • Reusable: no need to recode preprocessors, readers, featuresextractor and postprocessors, you just focus on your new algorithm.
  • Flexible: you can create new modules, replace any module in the workflow or even change the entire workflow by creating new modules categories.
  • Low overhead: you don't have to care about usual repetitive stuff like how to set a new variable in the config or how the learned parameters will be saved and reloaded, most of those "administrative" stuff are automated.
  • User-friendly: you can easily define new config variables using myvar = self.config.get("myvar"), without having to define anything outside the scope of your own module. There is also a nice GUI thank's to IPython Notebook, and the application can also be fully scripted in a user's Python script.
  • Simple and tiny: the core framework consists of only a few python scripts with few methods and functions, easily comprehensible and extendable if you need to.

For more information on development and usage, read the developper's documentation inside the "doc" folder, as well as the pdf introduction.

COLLABORATIVE SETTING

This application was conceived as a flexible and powerful development framework, and at the same time as an easy to use prototype application.

As such, the ideal collaborative team setting to get the most from this software would be composed of:

  • at least one algorithm developper, whose goal would be to extend the functionalities of the application. The developper team would then simply send this software with the modules they developped and perhaps an example config file as a quickstarter for the other teams (see below).
  • one or several scientists from other fields (ie: linguistics, litterature, etc.) who would use this software to primarily conduct experiments on texts. This "experiment" team would primarily work by tweaking the config files, and compare the results of experiments. A bit of Python knowledge would help to further the experiments using the IPython Notebook GUI.
  • At last, if the final model is meant to be shipped to a third-party (eg: content publisher, libraries, etc.), the software could be used as a first hand-on prototype by packaging the software with predefined config files and a set of texts. The recipient would then be able to try the model for themselves (either on the included texts or on others) with ease by simply launching the software from the commandline (a bash script can be included to make the usage very simple and quick).

INSTALL

You will need Python 2.7 and a few common scientific libraries. See the "install.md" file for more info on how to proceed.

Note: TreeTagger for Windows and MacOSX is supplied in authordetector/lib/treetagger/TreeTagger. If you want support for Linux or just update the version, you can replace the files inside (will overwrite the support for MacOSX since the files are named the same for Mac and for Linux).

USAGE

First you need to configure the software using a main config file and then a text config file.

Use the sample "config.json" as a kickstart for the main config file, and "textconfig.json" + "textconfig_detection.json" for respectively the learning text and identification text configs.

Then you can use the software in several ways:

  • Either launch the GUI using IPython Notebook with the file "ipynb/AuthorDetector.ipynb"

  • Either by commandline. To launch the learning phase:

    python authordetector.py --learn-c config.json --textconfig textconfig.json -p parameters.txt

For the detection/identification phase:

python authordetector.py -c config.json -p parameters.txt --textconfig_detection textconfig_detection.json

Type python authordetector.py --help to get more info on possible arguments.

  • Either as a python module in your own program (Python but also Java, C++, etc... using embedded Python interpreter) with the following snippet:

    import authordetector.main runner = authordetector.main.main(['--script'])

Here, the '--script' commandline argument does all the magic and prompts the AuthorDetector application to be imported as a module. You can then use:

runner.learn() # to launch the learning phase
runner.run() # to launch the identification phase

DOCUMENTATION

  • The developper's documentation generated by Doxygen is available in doc/index.html
  • Presentation slides (in french) are available inside doc/slides

AUTHOR

The framework was originally developped by Stephen Larroque < l r q 3 0 0 0 a t g m a i l d o t c o m> under the supervision of Pr. Jean-Gabriel Ganascia at Labex OBVIL and LIP6.

THIRD-PARTY LICENSING

This project makes a heavy use of third-party libraries, mainly:

  • The TreeTagger project, that we included as the default featuresextractor in this framework.
  • IPython
  • Pandas
  • Numpy

Please refer to their licensing informations if you want to reuse these libraries.

About

AuthorDetector is a framework to quickly develop, experiment and prototype author identification models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published