Skip to content

derwiki/dedupe

 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko's Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

Usage

python setup.py install python examples/csv_example.py (use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

Testing

Unit tests of core dedupe functions

python test/test_dedupe.py

Test using canonical dataset from Bilenko's research

Using random sample data for training

python test/canonical_test.py

Using active learning for training

python test/canonical_test.py --active True

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details

About

A free python library for accurate and scalelable deduplication and entity-resolution. *Under construction*

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 71.4%
  • Python 28.6%