Skip to content

yoyossy/open-city__dedupe

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

Team

Usage

> python setup.py install
> cd examples
> python active_canonical_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning)

Example datasets

As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.

The following use human input to flag duplicates:

  • active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research

  • early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago

  • tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.

The following do not use human input:

  • canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Send us a pull request. Bonus points for topic branches.

Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details github.com/open-city/dedupe/wiki/License

About

A free python library for accurate and scalelable deduplication and entity-resolution. *Under construction*

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 73.3%
  • Python 26.7%