Skip to content

arowla/dedupe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

<img src=“https://travis-ci.org/open-city/dedupe.png” />

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

Usage

> python setup.py install
> python examples/csv_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

Testing

Unit tests of core dedupe functions

> python tests/test_dedupe.py

Test using canonical dataset from Bilenko’s research

Using random sample data for training

> python tests/canonical_test.py

Using active learning for training

> python tests/canonical_test.py --active True

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Send us a pull request. Bonus points for topic branches.

Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details github.com/open-city/dedupe/wiki/License

About

A free python library for accurate and scalelable deduplication and entity-resolution. *Under construction*

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 82.1%
  • Python 17.9%