Skip to content

ptorrestr/clean_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text cleaner

Remove stopwords and perform stemming

Installation

OpenBLAS: It is pretty easy to build an own optimized version of openBLAS. First you need get the code and compile it as usually. Then setup the environment and indicate to distribute to build numpy with the openblas library. See here for more info: http://osdf.github.io/blog/numpyscipy-with-openblas-for-ubuntu-1204-second-try.html

Lapack: Because clean_text is based on nltk, you need install blas and lapack libraries. It is recommeded to install an optimized version of both. This libraries can be found on the package management tools of the linux distribution. (debian: liblapack-dev). If you think is worthy, you can build your own optimized version of the library. This tutorial explain exacltly the necessary steps to do so. http://theoryno3.blogspot.ie/2010/12/compiling-lapack-as-shared-library-in.html

It is recomendable to use virtualenv to avoid package conflicts

  • virtualenv /SOME/PATH -p python3
  • source /SOME/PATH/bin/activate

t2db_objects:

  • pip install git+https://github.com/ptorrestr/t2db_objects.git

Manually install:

  • git clone CLEAN_TEXT_URL
  • cd clean_text; python setup install

Numpy over Openblas

  • go to the virtualenv folder
  • add source /opt/env/c++/openblas_default to the bin/activate file
  • source /virtualenv/folder/bin/activate
  • mkdir download
  • mkdir build
  • pip install -d download numpy
  • tar -xvf download/numpy*.tar.gz
  • mv download/numpy* build/
  • create file build/numpy*/site.cfg. Add the following data: ` [default] library_dirs = /opt/usr/local/openblas/lib

[openblas] libraries = openblas library_dirs = /opt/usr/local/openblas/lib include_dirs = /opt/usr/local/openblas/include

[atlas] atlas_libs = openblas library_dirs = /opt/usr/local/openblas/lib

[lapack] lapack_libs = openblas library_dirs = /opt/usr/local/openblas/lib `

  • python setup.py build/install

Dependencies:

  • nltk
  • numpy
  • t2db_objects
  • sphinx
  • pyyaml

Configuration

You need to install NLTK data. python -m nltk.downloader all

To configure this project, pleae see the configuration example file (etc/example.config)

Execution

Just do clean_text -c CONFIG_FILE -o OUTPUT_FILE INPUTFILE

Where:

  • CONFIG_FILE The path to the configure file
  • OUTPUT_FILE The path to the output file (if it doesn't exist, it will be created)
  • INPUT_FILE The path to the input file

Documentation

cd docs; make html