Dedupe Python Library

Deduplication, entity resolution, record linkage, author disambiguation, and others ...

As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.

Dedupe is an open source python library that quickly de-duplicates large sets of data.

Features

machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
runs on a laptop - makes intelligent comparisons so you don't need a powerful server to run it
built as a library - so it can be integrated in to your applications or import scripts
extensible - supports adding custom data types, string comparators and blocking rules
open source - anyone can use, modify or add to it

How it works

Community

Dedupe Google group
ChiPy presentation
IRC channel, #dedupe on irc.freenode.net

Installation and dependencies

Dedupe requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions. You'll need to version 1.6 of numpy or higher.

After numpy is set up, then install the following:

Using pip:

git clone git://github.com/datamade/dedupe.git
cd dedupe
pip install "numpy>=1.6"
# for python 2.7
pip install -r requirements.txt
# OR for python 2.6
pip install -r py26_requirements.txt
python setup.py install

Using easy_install:

git clone git://github.com/datamade/dedupe.git
cd dedupe
easy_install "numpy>=1.6"
easy_install "fastcluster>=1.1.8"
easy_install "hcluster>=0.2.0"
easy_install networkx
easy_install zope.interface
easy_install zope.index
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

cd examples/csv_example
python csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

MySQL example (10,000 - 1,000,000+ rows)

This can take a few hours and will noticeably tax your laptop. You might want to run it overnight.

To follow this example you need to

Create a MySQL database called 'contributions'
Copy examples/mysql_example/mysql.cnf_LOCAL to examples/mysql_example/mysql.cnf
Update examples/mysql_example/mysql.cnf with your MySQL username and password
easy_install MySQL-python or pip install MySQL-python

Once that's all done you can run the example:

cd examples/mysql_example
python mysql_init_db.py 
python mysql_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

We are trying to figure out a range of typical runtimes for diferent hardware. Please let us know your run time for the MySQL example.

Record Linkage example

This example links two datasets, where each dataset, individually has no duplicates.

python examples/record_linkage_example/record_linkage_example.py

To see how you might use dedupe for linking datasets, see the annotated source code for record_linkage_example.py.

Documentation

The documentation for the dedupe library is on our wiki.

Testing

Build extensions in place

python setup.py build_ext --inplace

Unit tests of core dedupe functions

nosetests

Test using canonical dataset from Bilenko's research

Using Deduplication

python tests/canonical_test.py

Using Record Linkage

python tests/canonical_test_matching.py

Team

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches.

Copyright

See LICENSE for details

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Gregg, Forest, and Derek Eder. 2013. Dedupe. https://github.com/datamade/dedupe.

Name		Name	Last commit message	Last commit date
Latest commit History 1,047 Commits
dedupe		dedupe
examples		examples
src		src
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTORS.md		CONTRIBUTORS.md
MANIFEST.in		MANIFEST.in
README.md		README.md
THANKS.md		THANKS.md
py26_requirements.txt		py26_requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

cojito/dedupe

Folders and files

Latest commit

History

Repository files navigation