Dedupe Python Library

Deduplication, entity resolution, record linkage, author disambiguation, and others ...

As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.

Dedupe is an open source python library that quickly de-duplicates large sets of data.

Features

machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
runs on a laptop - makes intelligent comparisons so you don't need a powerful server to run it
built as a library - so it can be integrated in to your applications or import scripts
extensible - supports adding custom data types, string comparators and blocking rules
open source - anyone can use, modify or add to it

How it works

Community

Dedupe Google group
ChiPy presentation
IRC channel, #dedupe on irc.freenode.net

Installation and dependencies

Dedupe requires numpy, which can be complicated to install. If you are installing numpy for the first time, follow these instructions. You'll need to version 1.6 of numpy or higher.

After numpy is set up, then install the following:

Using pip:

git clone git://github.com/open-city/dedupe.git
cd dedupe
pip install "numpy>=1.6"
pip install -r requirements.txt
python setup.py install

Using easy_install:

git clone git://github.com/open-city/dedupe.git
cd dedupe
easy_install "numpy>=1.6"
easy_install fastcluster
easy_install hcluster
easy_install networkx
python setup.py install

Usage examples

Dedupe is a library and not a stand-alone command line tool. To demonstrate its usage, we have come up with a few example recipes for different sized datasets.

CSV example (<10,000 rows)

python examples/csv_example/csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

MySQL example (10,000 - 1,000,000+ rows)

This can take a few hours and will noticeably tax your laptop. You might want to run it overnight.

To follow this example you need to

Create a MySQL database called 'contributions'
Copy examples/mysql_example/mysql.cnf_LOCAL to examples/mysql_example/mysql.cnf
Update examples/mysql_example/mysql.cnf with your MySQL username and password
easy_install MySQL-python or pip install MySQL-python

Once that's all done you can run the example:

python examples/mysql_example/mysql_init_db.py 
python examples/mysql_example/mysql_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

We are trying to figure out a range of typical runtimes for diferent hardware. Please let us know your run time for the MySQL example.

Testing

Unit tests of core dedupe functions

python test/test_dedupe.py

Test using canonical dataset from Bilenko's research

Using random sample data for training

python test/canonical_test.py

Using active learning for training

python test/canonical_test.py --active True

Team

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches.

Copyright

See LICENSE for details

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Gregg, Forest, and Derek Eder. 2013. Dedupe. https://github.com/open-city/dedupe.

Name		Name	Last commit message	Last commit date
Latest commit History 602 Commits
dedupe		dedupe
examples		examples
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
THANKS.md		THANKS.md
parallel_report.md		parallel_report.md
requirements.txt		requirements.txt
setup.py		setup.py

nilesh-c/dedupe

Folders and files

Latest commit

History

Repository files navigation