A free python library for accurate and scaleable deduplication and entity-resolution.
Based on Mikhail Yuryevich Bilenko's Ph.D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.
Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.
- For more detail and overview, read the wiki
- Join our Google group for updates
- See our presentation at ChiPy
This library requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions.
After numpy is set up, then install the following:
Using pip:
pip install numpy
pip install -r requirements.txt
python setup.py install
Using easy_install:
easy_install numpy
easy_install fastcluster
easy_install hcluster
easy_install networkx
python setup.py install
Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.
python examples/csv_example/csv_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)
python examples/sqlite_example/init_db.py
python examples/sqlite_example/sqlite_blocking.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)
python examples/sqlite_example/sqlite_clustering.py
Please note: We have been having performance with sqlite on some machines, particular in writing the blocking map table. If sqlite_blocking.py doesn't complete within eight hours, it probably will take days to finish on your machine.
We are not sure if this is A. a problem with how we are using sqlite, B. a problem with using sqlite with this much data, C. a problem we will have with any database engine. We will implement a version using MySQL soon to try to narrow down the problem. In the meantime, if you are an sqlite guru, we could use your eyeballs.
Unit tests of core dedupe functions
python test/test_dedupe.py
Test using canonical dataset from Bilenko's research
Using random sample data for training
python test/canonical_test.py
Using active learning for training
python test/canonical_test.py --active True
If something is not behaving intuitively, it is a bug, and should be reported. Report it here
- Fork the project.
- Make your feature addition or bug fix.
- Send us a pull request. Bonus points for topic branches.
Copyright (c) 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.