Skip to content

cojito/dedupe

 
 

Repository files navigation

Dedupe Python Library

Deduplication, entity resolution, record linkage, author disambiguation, and others ...

As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.

Dedupe is an open source python library that quickly de-duplicates large sets of data.

Features

  • machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
  • runs on a laptop - makes intelligent comparisons so you don't need a powerful server to run it
  • built as a library - so it can be integrated in to your applications or import scripts
  • extensible - supports adding custom data types, string comparators and blocking rules
  • open source - anyone can use, modify or add to it

How it works

Community

Installation and dependencies

Dedupe requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions. You'll need to version 1.6 of numpy or higher.

After numpy is set up, then install the following:

Using pip:

git clone git://github.com/datamade/dedupe.git
cd dedupe
pip install "numpy>=1.6"
# for python 2.7
pip install -r requirements.txt
# OR for python 2.6
pip install -r py26_requirements.txt
python setup.py install

Using easy_install:

git clone git://github.com/datamade/dedupe.git
cd dedupe
easy_install "numpy>=1.6"
easy_install "fastcluster>=1.1.8"
easy_install "hcluster>=0.2.0"
easy_install networkx
easy_install zope.interface
easy_install zope.index
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

cd examples/csv_example
python csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

MySQL example (10,000 - 1,000,000+ rows)

This can take a few hours and will noticeably tax your laptop. You might want to run it overnight.

To follow this example you need to

  • Create a MySQL database called 'contributions'
  • Copy examples/mysql_example/mysql.cnf_LOCAL to examples/mysql_example/mysql.cnf
  • Update examples/mysql_example/mysql.cnf with your MySQL username and password
  • easy_install MySQL-python or pip install MySQL-python

Once that's all done you can run the example:

cd examples/mysql_example
python mysql_init_db.py 
python mysql_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

We are trying to figure out a range of typical runtimes for diferent hardware. Please let us know your run time for the MySQL example.

This example links two datasets, where each dataset, individually has no duplicates.

python examples/record_linkage_example/record_linkage_example.py 

To see how you might use dedupe for linking datasets, see the annotated source code for record_linkage_example.py.

Documentation

The documentation for the dedupe library is on our wiki.

Testing

Coverage Status

Build extensions in place

python setup.py build_ext --inplace

Unit tests of core dedupe functions

nosetests

Test using canonical dataset from Bilenko's research

Using Deduplication

python tests/canonical_test.py

Using Record Linkage

python tests/canonical_test_matching.py

Team

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2013 Forest Gregg and Derek Eder. Released under the MIT License.

See LICENSE for details

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Gregg, Forest, and Derek Eder. 2013. Dedupe. https://github.com/datamade/dedupe.

githalytics.com alpha

About

A python library for accurate and scaleable data deduplication and entity-resolution.

Resources

Stars

Watchers

Forks

Packages

No packages published