libfnl is an API and CLI facilitating data and text mining by providing a collection of easy-to-use tools. The library is designed to work with Python 3 (only). It is specifically tuned towards mining biomedical/scientific texts, but can be used in other contexts if need be, too. It is a complementary piece in the gnamed gene name repository daemon and the medic PubMed mirroring tool collection. In addtion, an (orphan) couchpy repository could provide a document storage facility.
The library contains the following packages:
fnl.nlp
tools to linguistically analyze text (tokenization, PoS tagging, phrase chunking, entity detection); modules to segment sentences (based on NLTK), and map text (strings) to entries in dictionaries this includes a Python wrapper for the GENIA Tagger, a Python wrapper for the NER Suite, and a handler for the GENIA corpus; furthermore, via NLTK 's wrapper for MegaM, a Maximum Entropy classifier is available, too;
fnl.stat
a module to evaluate inter-rater Kappa scores and a module to develop text classifiers based on Scikit-Learn
fnl.text
wrappers to work with text data (strings, tokens, segments, annotations, etc.)
fnl.utils
additional utilities and tools (currently, just for handling JSON)
scripts
the CLI scripts to manage data/text, representing the main value provided by this collection
The script directory provides the following command-line interfaces:
fnlclassi
generate a classifier for [NER-tagged] text using Scikit-Learn.fnlcorpus
store corpora in JSON format in a CouchDB.fnldgrep
"grep" for tokens using a dictionary.fnldictag
tag semantic tokens from a dictionary in linguistically annotated text.fnlgpcounter
count gene/protein symbols in MEDLINE.fnlkappa
calculate inter-rater agreement scores.fnlsegment
segment text into sentences using NLTK (PunktSentenceTokenizer).fnlsegtrain
train a nltk.punkt.PunktSentenceTokenizer.fnltok
a fast, pure-Python, Unicode-aware string tokenizer.
Warning
This project is under "continuous development", better take your own snapshot.
- Python 3.2+
- Numpy, SciPy, and Scikit-Learn 0.14+ (for
fnlclassi
) - NLTK 3.0+ (for the sentence segmenting tools
fnlseg*
) - DAWG (for
fnlgpcounter
; see Installation below)
Optional projects that work together with this project:
- GENIA Tagger (optional, latest version)
- NER Suite (optional, latest version, in turn requires CRF Suite)
- MegaM - a MaxEnt classifier for NLTK with a (fast) L-BFGS optimizer
- gnamed for creating gene/protein name repositories
- medic for mirroring and handling PubMed citations
- txtfnnl natural language processing tools based on Apache OpenNLP and UIMA
Into a Python 3 virtual environment:
pip install virtualenv # if virtualenv is not yet installed
git clone git://github.com/fnl/libfnl.git libfnl
virtualenv libfnl
cd libfnl
. bin/activate
pip install argparse # for python3 < 3.2
pip install numpy # because installing scipy fails if numpy isn't installed already
pip install -e . # installs all other dependencies
# if you prefer to install all other dependencies manually
# and/or prefer to use setup.py instead of pip:
# python setup.py install
pip install sqlalchemy
pip install sklearn
pip install matplotlib
pip install nltk --pre # to get 3.0
# if you want to install the test environment:
pip install pytest
# special steps to install DAWG
git clone git@github.com:fnl/DAWG.git
cd DAWG
python setup.py install
cd ..
All parts of this library are licensed under the GNU Affero GPL v3
See the attached LICENSE.txt file.
© 2006-2014 Florian Leitner. All rights reserved.