GitHub - elishowk/TinasoftPytextminer: A python text-mining module producing semantic network graphs

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 877 Commits
shared		shared
source_files		source_files
tests		tests
tinasoft		tinasoft
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README		README
apitests.py		apitests.py
config_unix.yaml		config_unix.yaml
config_win.yaml		config_win.yaml
empty_document_content.py		empty_document_content.py
freeze_linux.py		freeze_linux.py
freeze_mac.py		freeze_mac.py
freeze_win.py		freeze_win.py
httpserver.py		httpserver.py
servertests.py		servertests.py
setup.py		setup.py
user_stopwords.csv		user_stopwords.csv

Repository files navigation

Thanks for using Tinasoft Pytextminer

Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/

A text-mining python 2.6 module producing bottom-up semantic network production.
It uses :
- NLTK, the natural language processing toolkit (http://www.nltk.org/),
- SQLite3, the embedded database library (http://sqlite.org),
- Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com),
- Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/),
- pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/)

Classical task are :
- multiple kinds of source file support
- extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc)
- creation of document/corpus/ngram graphs databases
- key-phrases cooccurrences calculation on a corpus basis
- production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database)
- an httpserver exposing the API, sending json results

This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C :
 - http://tinasoft.eu/
The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr).

SOURCE CODE REPOSITORY

    https://forge.iscpif.fr/projects/tinasoft-pytextminer
    http://github.com/moma/TinasoftPytextminer

AUTHORS

- Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France)
    julian bilcke <julian.bilcke (at) iscpif (dot) fr>
    david chavalarias <david.chavalarias (at) polytechnique (dot) edu>
    jean philippe cointet <jphcoi (at) yahoo (dot) fr>
    elias showk <elishowk (at) nonutc (dot) fr>

MAINTAINER

    elias showk <elishowk (at) nonutc (dot) fr>

DOCUMENTATION, SUPPORT AND FEEDBACK

    http://tinasoft.eu/ (project homepage)
    https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development)

PYTEXTMINER AS A USER

    Download standalone packages from http://tinasoft.eu

    DEVELOPER DOCUMENTATION

        http://tina.csregistry.org/tinauserdoc

PYTEXTMINER AS A DEVELOPER


    * we provide a http server exposing the main API from the TinaApp class
    * alternatively, the apitests.py script provides examples to properly use the TinaApp class methods

    - get the source code :

    https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository
    OR
    git clone https://sources.iscpif.fr/tinasoft.pytextminer.git

    PYTHON : you'll need Python 2.6 interpreter : http://python.org/

    INSTALL THE PYTHON PACKAGE

        $ sudo python setup.py install
        or
        $ sudo python setup.py develop

    Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml

    OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES

        - they're listed in setup.py

    DOWNLOAD NLTK DATA

    You'll need to install manually required nltk corpus data
        $ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data"
        $ python
        > import nltk.download()
        Downloader> d punkt
        Downloader> d brown
        Downloader> d conll2000

    on MS WINDOWS:

            $ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data"
            $ PATH C:\Python26;%PATH%
            $ python apitests.py ... (see usage)

        - finally open your web browser at http://localhost:8888 (no internet connection needed)

    GNU/LINUX (and probable UNIX-like systems)
        - use the standalone freezed httpserver software

            $ export NLTK_DATA=shared/nltk_data
            $ python apitests.py ... (see usage)

    DEVELOPER DOCUMENTATION

        http://tina.csregistry.org/tinadevdoc

CONFIGURATION

    config_*.yaml are a YAML configuration files.
    The main application (TinaApp class) searches it during init, its path is a required parameter

    GUIDELINES

    - declare each column name of your csv file into the corresponding field name of the configuration file
    - not declared columns will be ignored by the software
    - here are possible required and optional entries :

        #### REQUIRED
        titleField: document title
        contentField: document content
        authorField: document acronyme
        corpusNumberField: corpus number
        docNumberField: document number
        ##### optional
        index1Field: document index 1
        index2Field: document index 2
        dateField: document publication date
        keywordsField: document keywords

    - check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar"
    - "minSize", and "maxSize" means the length of n-grams extracted
    - all other fields are the script configuration, or the default values for testing purpose

    WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML

SOURCE FILES DIRECTORY

    - "source_files" is dedicated to the storage of your source files
    - these files are used during indexation and extraction steps of the workflow
    - given an existing file name in this directory, the software will be able to read it

TESTED OPERATING SYSTEMS

    Tinasoft Pytextminer was tested on the following platforms:

        GNU/Linux (amd4, i386) with Python 2.6
        Windows XP (32bit) with Python 2.6
        Mac OS X >= 10.6

COPYRIGHT AND LICENSE

Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr)

    This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by

    the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/gpl.html>.