Skip to content

gnarph/DIRT

Repository files navigation

DIRT

Master: Build Status

Develop: Build Status Coverage Status

What is DIRT?

DIRT (Dynamic Identification of Reused Text) aims to allow users (primarily academics) to find passages that are shared by pairs of documents within a corpus. It will allow them to view pairs of documents and their common passages, as well as show which documents within a corpus have common passages with one particular document within the same corpus, known as the focus document.

DIRT also aims to be extensible to support other languages, although ancient Chinese will be the focus for the prototype. DIRT should be able to find matches in a UTF-8 encoded corpus in any language, with a language specific module improving the permissiveness of matching.

Install Dependencies

Dependencies can be installed with

pip install --allow-external jianfan --allow-unverified jianfan -r requirements.txt

Running Tests

Tests can be run from the root directory with

nosetests

Coverage can be checked using

./check_test_coverage.sh

Contributing

Python code should follow PEP 8 and have tests before pull requesting or merging to develop.