Big Data ER Analysis

The report is about brief discussion of three different types of blocking methods. The blocking methods token blocking, attribute clustering blocking and meta blocking method. Language used to implement the algorithms is Python and tools used were Pycharm, Jupyter Notebook and Emacs. Programs are run using command python filename.py.

Dataset used here are dataset from IMDb (https://datasets.imdbws.com/). IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies. We have created two Clean-Clean datasets based on list of some directors. Which means datasets does not contain any duplicate within themselves. Dataset 𝐷1 is created with four attribute-value pairs (dataset1id, name, birthYear, movieTitles, year) with total of 100 entities. Dataset 𝐷2 is also created with four attribute-value pairs (dataset2id, tag, yearOfBirth, knownFor, year) with total of 105 entities. Where name and tag contain directors name, birthyear and yearOfBirth contains directors date of birth, movieTitles and KnownFor contains list of movie IDs that the director is known for and finally year contains year of release. Each entity has around 3 to 4 attributes. There are around 43 true matches available between the two datasets.

Added an evaluation to calculate three performance metrics for blocking methods. These methods require that the true identity information for record pairs is available in the test datasets.

The first metric is the reduction ratio (RR), which is defined as: RR = 1 − 𝑠⁄𝑁 , where s is the number of record pairs produced by a blocking method for comparison and N is the number of possible record pairs in the entire data sets (assuming we link two data sets with n records each N = n×n). RR is the relative reduction in the number of record pairs to be compared.
The second metric is the pairs completeness (PC) metric which is defined as PC = 𝑆𝑀⁄𝑁𝑀, where 𝑆𝑀is the number of true match record pairs in the set of record pairs produced for comparison by the blocking method and 𝑁𝑀is the total number of true match record pairs in the entire data.
The third is pair quality (PQ) metric which is defined as PQ = 𝑆𝑀⁄𝑠, where 𝑆𝑀 is the number of true match record pairs in the set of record pairs produced for comparison by the blocking method and where s is the number of record pairs produced by a blocking method for comparison.

Results

Described two blocking methods (token blocking and attribute clustering blocking) for entity resolution. We also discussed how the blocking methods can be optimized via meta-blocking schemes, which restructure the block collection to improve efficiency while trying to keep effectiveness intact. In conclusion meta-blocking did improve efficiency and did not reduce effectiveness at all. We conclude that this is due to the simplicity of our datasets, i.e. they are not very heterogenous. With other datasets it could be possible that the pruning steps in the various meta-blocking schemes lose some legitimate matches.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
attributeClusteringBlocking.py		attributeClusteringBlocking.py
dataset1.json		dataset1.json
dataset2.json		dataset2.json
ground_truth.json		ground_truth.json
meta_blocking.py		meta_blocking.py
tokenBlocking.py		tokenBlocking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

attributeClusteringBlocking.py

attributeClusteringBlocking.py

dataset1.json

dataset1.json

dataset2.json

dataset2.json

ground_truth.json

ground_truth.json

meta_blocking.py

meta_blocking.py

tokenBlocking.py

tokenBlocking.py

Repository files navigation

Big Data ER Analysis

Results

About

Releases

Packages

Languages

shamshadkhan/Big_Data_ER

Folders and files

Latest commit

History

Repository files navigation

Big Data ER Analysis

Results

About

Resources

Stars

Watchers

Forks

Languages