Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko's Ph.D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

For more detail and overview, read the wiki
Join our Google group for updates
See our presentation at ChiPy

Python Dependencies

This library requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions.

After numpy is set up, then install the following:

Installation

Using pip:

pip install numpy
pip install -r requirements.txt
python setup.py install

Using easy_install:

easy_install numpy
easy_install fastcluster
easy_install hcluster
easy_install networkx
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

python examples/csv_example/csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

Annotated source code of csv_example

sqlite example (10,000 - 1,000,000 rows)

python examples/sqlite_example/init_db.py
python examples/sqlite_example/sqlite_blocking.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

python examples/sqlite_example/sqlite_clustering.py

Please note: We have been having performance with sqlite on some machines, particular in writing the blocking map table. If sqlite_blocking.py doesn't complete within eight hours, it probably will take days to finish on your machine.

We are not sure if this is A. a problem with how we are using sqlite, B. a problem with using sqlite with this much data, C. a problem we will have with any database engine. We will implement a version using MySQL soon to try to narrow down the problem. In the meantime, if you are an sqlite guru, we could use your eyeballs.

Testing

Unit tests of core dedupe functions

python test/test_dedupe.py

Test using canonical dataset from Bilenko's research

Using random sample data for training

python test/canonical_test.py

Using active learning for training

python test/canonical_test.py --active True

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches.

Copyright

See LICENSE for details

Name		Name	Last commit message	Last commit date
Latest commit History 502 Commits
dedupe		dedupe
examples		examples
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
THANKS		THANKS
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedupe

dedupe

examples

examples

src

src

test

test

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

THANKS

THANKS

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Dedupe Python Library

Python Dependencies

Installation

Usage examples

CSV example (<10,000 rows)

sqlite example (10,000 - 1,000,000 rows)

Testing

Team

Errors / Bugs

Note on Patches/Pull Requests

Copyright

About

Releases

Packages

JeffDonovan/dedupe

Folders and files

Latest commit

History

Repository files navigation

Dedupe Python Library

Python Dependencies

Installation

Usage examples

CSV example (<10,000 rows)

sqlite example (10,000 - 1,000,000 rows)

Testing

Team

Errors / Bugs

Note on Patches/Pull Requests

Copyright

About

Resources

Stars

Watchers

Forks