Abydos NLP/IR library
Copyright 2014-2018 by Chris Little
Abydos is a library of phonetic algorithms, string distance metrics, stemmers, and keyers, including:
- Phonetic algorithms
- Robert C. Russell's Index
- American Soundex
- Refined Soundex
- Daitch-Mokotoff Soundex
- Kölner Phonetik
- NYSIIS
- Match Rating Algorithm
- Metaphone
- Double Metaphone
- Caverphone
- Alpha Search Inquiry System
- Fuzzy Soundex
- Phonex
- Phonem
- Phonix
- SfinxBis
- phonet
- Standardized Phonetic Frequency Code
- Statistics Canada
- Lein
- Roger Root
- Beider-Morse Phonetic Matching
- String distance metrics
- Levenshtein distance (incl. a [0, 1] normalized variant)
- Optimal String Alignment distance (incl. a [0, 1] normalized variant)
- Levenshtein-Damerau distance (incl. a [0, 1] normalized variant)
- Hamming distance (incl. a [0, 1] normalized variant)
- Tversky index
- Sørensen–Dice coefficient & distance
- Jaccard similarity coefficient & distance
- overlap similarity & distance
- Tanimoto coefficient & distance
- Minkowski distance & similarity (incl. a [0, 1] normalized option)
- Manhattan distance & similarity (incl. a [0, 1] normalized option)
- Euclidean distance & similarity (incl. a [0, 1] normalized option)
- Chebyshev distance & similarity (incl. a [0, 1] normalized option)
- cosine similarity & distance
- Jaro distance
- Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
- Longest common substring
- Ratcliff-Obershelp similarity & distance
- Match Rating Algorithm similarity
- Normalized Compression Distance (NCD) & similarity
- Monge-Elkan similarity & distance
- Matrix similarity
- Needleman-Wunsch score
- Smither-Waterman score
- Gotoh score
- Length similarity
- Prefix, Suffix, and Identity similarity & distance
- Modified Language-Independent Product Name Search (MLIPNS) similarity & distance
- Bag distance (incl. a [0, 1] normalized variant)
- Editex distance (incl. a [0, 1] normalized variant)
- Stemmers
- the Lovins stemmer
- the Porter and Porter2 (Snowball English) stemmers
- Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
- CLEF German, German plus, and Swedish stemmers
- Caumann's German stemmer
- Keyers
- string fingerprint
- q-gram fingerprint
- phonetic fingerprint
- skeleton key
- omission key
Required:
- Numpy
Recommended:
- PylibLZMA (Python 2 only--for LZMA compression string distance metric)
Suggested for development, testing, & QA:
- Nose (for unit testing)
- coverage.py (for code coverage checking)
- Pylint (for code quality checking)
- PEP8 (for code quality checking)
To install Abydos from PyPI using pip:
pip install abydos
It should run on Python 2.7 and Python 3.3+
To build/install/unittest from source in Python 2:
sudo python setup.py install
nosetests -v --with-coverage --cover-erase --cover-html --cover-branches --cover-package=abydos .
To build/install/unittest from source in Python 3:
sudo python3 setup.py install
nosetests3 -v --with-coverage --cover-erase --cover-html --cover-branches --cover-package=abydos .
For pylint testing, run:
pylint --rcfile=pylint.rc abydos > pylint.log
A simple shell script is also included, which will build, install, test, and code-quality check (with Pylint & PEP8) the package and build the documentations. To run it, execute:
./btest.sh