Skip to content

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

License

Sandy4321/AutoCorpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INTRODUCTION

Autocorpus is a set of utilities that enable automatic extraction of
language corpora and language models from publicly available
datasets. For example, it provides the full set of tools to translate
the entire English Wikipedia from a 30+GB XML file to a clean n-gram
language model, all in a matter of a few hours.



BUILDING

Before building autocorpus, make sure you have the required
dependencies. These are:

    - Python 2.7.1+
    - g++ 4.6.1
    - libpcre3-dev
    - libboost-dev 1.46
    - libboost-thread-dev 1.46

Older versions *might* work, but have not been tested.

Once you've verified that you have the prerequisites, build autocorpus
by calling make:

    $ make

The binaries will be placed in the 'bin' directory.


INSTALLING

To install Autocorpus, build it first using the instructions in the
previous section, then type "make install". Note that you need to be
root for the installation to succeed, which on most desktop Linux
distributions means you need to run "sudo make install".
 


USING AUTOCORPUS

Assuming you have properly installed the documentation from the 'man'
directory, you can get a quick overview of how to use Autocorpus by
typing:

    $ man 7 autocorpus

This manpage can also be viewed at 
http://mpacula.com/autocorpus/1.0/man/autocorpus.7.html

Man pages are also available for individual tools, both locally and online 
at http://mpacula.com/autocorpus/1.0/man



PROJECT WEBSITE

The project's website is http://mpacula.com/autocorpus. Use it to
download new releases and submit bug reports. 



AUTHOR & LICENSING

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com) 
and is distributed as free software under the terms of the AGPL v3
license. See the file COPYING for details.

If you would like to incorporate one or more Autocorpus tools in
proprietary product, please contact the author and inquire about a
commercial license.

Wikipedia-based corpora are distributed under the "Creative Commons
Attribution - ShareAlike 3.0 Unported License". The full text of this
license can be found at: 

http://en.wikipedia.org/wiki/Wikipedia:CC-BY-SA

About

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 68.2%
  • Python 8.7%
  • HTML 8.2%
  • Shell 7.9%
  • Makefile 3.8%
  • CSS 2.3%
  • C 0.9%