Data integration is a notoriously difficult and heuristic-driven process, especially when ground-truth data are not readily available. This paper presents a measure of uncertainty in two-table, one-to-many data integration workflows. Users can use these query results to guide a search through different matching parameters, similarity metrics, and constraints. Even though there are exponentially many such matchings, we show that in appropriately constrained circumstances that this result range can be calculated in polynomial time with the Blossom algorithm, a generalization of the Hungarian Algorithm used in Bipartite Graph Matching. We evaluate this on 3 real-world datasets and synthetic datasets, and find that uncertainty estimates are more robust when a graph-matching based approach is used for data integration.
Build a bipartite matching algorithm in python -- maximal bipartite matching.
Steps:
-
Define a bipartite graph structure
-
1 set of keys from table a. one set of keys from table b.
Notes:
-
The other set would describe the relationship between the tables (weighted relationship).
-
Any possible match can be an edge in this graph
-
Allow for an arbitrary weight function to handle both maximum and minimum
-
Allow to be given two dataframes, and a function that gives the correspondence
-
See areas where we can further generalize the code for bipartite matching
-
Check the accuracy performance of the bipartite matching (using the ground truth dataset)
Note: The dataset is not added to this repository due to memory constraints. Please put the dataset in the same folder as the notebook file while running the code.
Link for the dataset: https://www.openicpsr.org/openicpsr/project/100843/version/V2/view
-
Use the "_1" and "_2" labels so that the code can be further generalized to different matchings
-
Use a built in similarity metric library written in source C/C++ so that we can eliminate any slowness stemming from the similarity metrics that I've handcoded myself.
- Jupyter did not recognize some of the already installed libraries (the libraries that I tried to install through PyPI's similarity metric packages). Running jupyter in a conda environment solved the problem. In order to run this jupyter notebook in a conda environment, follow these steps:
-
Make sure you have anaconda installed https://docs.continuum.io/anaconda/install/#
-
Run a
git pull
on this repository to get the latest commits -
Run
conda env create -f environment.yml
-
Run
conda activate datares1-env
-
Run
jupyter notebook
or runjupyter lab
The code structure is now as follows:
-
one_to_one.py
includes all the essential functions that are contained in the1-1
bipartite matching algorithm -
Tests use the
one_to_one
module in thesrc
folder. The tests can be found in thetests/
folder -
There are 2 test files:
-
The
one_to_one_basic.py
file contains tests that check whether the functions in the core module are working correctly. -
The
one_to_one_advanced.py
file contains tests that evaluate the precision and accuracy of the matchings according to the perfect mapping that is provided as an example in this file.
-
-
Duplicate the "1" table to have a user inputted amount of duplicates
-
Do the bipartite matching on that duplicated "1" table. The resulting output would look like this:
Resulting Matching Excerpt:
USA_1 ------> US
USA_2 ------> U.S.A
- De-duplicate / Collapse the results to have a
1-n
matching that looks like follows:
Collapsing the duplicates:
USA ---> US, U.S.A
The updated code structure is now as follows:
-
one_to_n.py
includes all the essential functions that are contained in the1-n
bipartite matching algorithm -
Tests use the
one_to_n
module in thesrc
folder. The tests can be found in thetests/
folder -
There are 2 test files:
-
The
one_to_n_basic.py
file contains tests that check whether the functions in the1-n
module are working correctly. -
The
one_to_n_advanced.py
file contains tests that evaluate the efficiency of the1-n
matching for the large dataset exampleDBLP-ACM
dataset.
-
- The source file for the work is aggragated in
src/transitive_closure.py
and the tests are aggragated intests/transitive_basic.py
andtests/transitive_advanced.py
. The latter file is still in the works with the hope of finishing it once the problem is solved using the small scale data that covers all the edge cases.
Problem Statement: Assume that we have many of the same entities in different representations in the same column of the table. We want to find a way to indicate that they are the same entities by mapping them.
For example, assume the following table:
Table A |
_________|
Countries|
_________|
USA
US
CHINA
CHNA
CANADA
The matching outcome we want is the following:
Table A (Duplicated Table A)
USA ---------- US
US --------- USA
CHINA ------- CHNA
CHNA -------- CHINA
CANADA ---X-- CANADA
The ultimate de-duplicated outcome we want is the following:
(USA, US), (CHINA: CHNA)