Simple Deduplication

Given some image database linked to a users database, identify potential duplicate users and expose them via API.

Getting Started

Docker

Have Docker
docker build -t dedupe .
docker run -ti --rm -p 5000:5000 dedupe
In another prompt curl localhost:5000/duplicates to build the match index.

No Docker (highly recommend using Docker though)

Have python
pip install -r requirements.txt
python main.py
In another prompt curl localhost:5000/duplicates to build the match index.

Database

Replicate a simple database just using a JSON

Image Database

JSON mapping imagefiles to user unique IDs

API

/duplicates (GET): Ask for any potential duplicates
/duplicate/<id> (GET): Ask if user with id is or has any potential duplicates
/users (GET): Get all users
/user/<id> (GET): Get a user by an id

What is really happening

The users.json and images.json simulate user and image databases (maybe a pg table and a mongodb, respectively). When you run the container (or execute main.py), a flask API powered by gunicorn is started. Immediately, one should request localhost:5000/duplicates to construct the SIFT matches index. In a production environment, this would likely be continuously evaluated with a hadoop cluster against millions of users and millions of images.

Once you have the index, queries will be much faster as the API caches the index on the filesystem. You can also request if a specific user is a possible duplicate by id (hint, users 4 and 7 are duplicates, they have the same driver's license).

The notion of deduplication centers upon creating a similarity threshhold for aliases and corresponding metadata across a domain of users. In this simple example, only one piece of information is considered: the "similarity" of users identifying documents to one another. One powerful option is Scale-Invariant Feature Extraction. SIFT is a powerful algorithmic form of feature engineering on a given corpus, and excellent at comparisons in low-cpu environments. Essentially, SIFT determines key points and a standard description of said key points for a given image (learn more about how SIFT works here). Then, a pattern matcher examines the number of strong matches (i.e. low distance) each image has to one another, and computes an image-size adjusted index for fair comparison across all images regardless of resolution. Thus, more matches on the index between a given two images, the greater liklihood that they are either pictures taken of the same document, or the same document uploaded twice (the latter case applies to ids 4 and 7.)

This is just one aspect of deduplication given a domain of users. Distance algorithms should be applied to all text data in the users database (literal pattern matches to find name + dob + ssn + location dupes). OCR and entity extraction should be performed on all identifying documents to yield additional text-based duplication estimates.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
images.json		images.json
main.py		main.py
requirements.txt		requirements.txt
users.json		users.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

images.json

images.json

main.py

main.py

requirements.txt

requirements.txt

users.json

users.json

Repository files navigation

Simple Deduplication

Getting Started

Docker

No Docker (highly recommend using Docker though)

Database

Image Database

API

What is really happening

About

Releases

Packages

Languages

TuringFantasy/simple_dedupe

Folders and files

Latest commit

History

Repository files navigation

Simple Deduplication

Getting Started

Docker

No Docker (highly recommend using Docker though)

Database

Image Database

API

What is really happening

About

Topics

Resources

Stars

Watchers

Forks

Languages