Skip to content

JeffDonovan/dedupe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko's Ph.D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

This library requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions.

After numpy is set up, then install the following:

Installation

Using pip:

pip install numpy
pip install -r requirements.txt
python setup.py install

Using easy_install:

easy_install numpy
easy_install fastcluster
easy_install hcluster
easy_install networkx
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

python examples/csv_example/csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

sqlite example (10,000 - 1,000,000 rows)

python examples/sqlite_example/init_db.py
python examples/sqlite_example/sqlite_blocking.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

python examples/sqlite_example/sqlite_clustering.py

Please note: We have been having performance with sqlite on some machines, particular in writing the blocking map table. If sqlite_blocking.py doesn't complete within eight hours, it probably will take days to finish on your machine.

We are not sure if this is A. a problem with how we are using sqlite, B. a problem with using sqlite with this much data, C. a problem we will have with any database engine. We will implement a version using MySQL soon to try to narrow down the problem. In the meantime, if you are an sqlite guru, we could use your eyeballs.

Testing

Unit tests of core dedupe functions

python test/test_dedupe.py

Test using canonical dataset from Bilenko's research

Using random sample data for training

python test/canonical_test.py

Using active learning for training

python test/canonical_test.py --active True

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details

About

A free python library for accurate and scalelable deduplication and entity-resolution. *Under construction*

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published