Skip to content

nilesh-c/dedupe

 
 

Repository files navigation

Dedupe Python Library

Deduplication, entity resolution, record linkage, author disambiguation, and others ...

As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.

Dedupe is an open source python library that quickly de-duplicates large sets of data.

Features

  • machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
  • runs on a laptop - makes intelligent comparisons so you don't need a powerful server to run it
  • built as a library - so it can be integrated in to your applications or import scripts
  • extensible - supports adding custom data types, string comparators and blocking rules
  • open source - anyone can use, modify or add to it

How it works

Community

Installation and dependencies

Dedupe requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions. You'll need to version 1.6 of numpy or higher.

After numpy is set up, then install the following:

Using pip:

git clone git://github.com/open-city/dedupe.git
cd dedupe
pip install "numpy>=1.6"
pip install -r requirements.txt
python setup.py install

Using easy_install:

git clone git://github.com/open-city/dedupe.git
cd dedupe
easy_install "numpy>=1.6"
easy_install fastcluster
easy_install hcluster
easy_install networkx
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

python examples/csv_example/csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

MySQL example (10,000 - 1,000,000+ rows)

This can take a few hours and will noticeably tax your laptop. You might want to run it overnight.

To follow this example you need to

  • Create a MySQL database called 'contributions'
  • Copy examples/mysql_example/mysql.cnf_LOCAL to examples/mysql_example/mysql.cnf
  • Update examples/mysql_example/mysql.cnf with your MySQL username and password
  • easy_install MySQL-python or pip install MySQL-python

Once that's all done you can run the example:

python examples/mysql_example/mysql_init_db.py 
python examples/mysql_example/mysql_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

We are trying to figure out a range of typical runtimes for diferent hardware. Please let us know your run time for the MySQL example.

Testing

Unit tests of core dedupe functions

python test/test_dedupe.py

Test using canonical dataset from Bilenko's research

Using random sample data for training

python test/canonical_test.py

Using active learning for training

python test/canonical_test.py --active True

Team

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2013 Forest Gregg and Derek Eder. Released under the MIT License.

See LICENSE for details

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Gregg, Forest, and Derek Eder. 2013. Dedupe. https://github.com/open-city/dedupe.

githalytics.com alpha

About

A free python library for accurate and scalelable deduplication and entity-resolution.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 78.9%
  • Python 21.1%