Sample AWS Glue script for data linking

A first attempt at a Glue job that performs data deduplication.

High level overview of the job (use VS Code Markdown Preview Enhanced to view).

graph TD
A[Ground truth dataset] -->|from fake data| B
B[Train test split] --> C
C[Tokenise records] -->|Concat all columns we're using for matching and split into array of tokens| C2
C2[Compute lookup table containing relative frequency of each token] -->D
 D[Apply Blocking rules] --> |Apply series of OR rules like 'firstname' and 'surname' or 'firstname' and 'dob' to produc|E
E[Dataset of potentially matching pairs] --> F
F[Compute features] -->G
F -->H
G[Edit distance] -->J
H[Probability score] -->|Lookup each matching token in token frequency table and multply together to produce score|J
J[Train logit model] --> K
K[Apply trained model to test data] --> L
L[Compute accuracy statistics on test data]

Further details

You can find a full example with output dataframes at each stage here

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
match		match
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile_testrunner		Dockerfile_testrunner
README.md		README.md
dag.png		dag.png
main.py		main.py
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

match

match

test

test

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile_testrunner

Dockerfile_testrunner

README.md

README.md

dag.png

dag.png

main.py

main.py

run_tests.sh

run_tests.sh

Repository files navigation

Sample AWS Glue script for data linking

Further details

About

Releases

Packages

Languages

RobinL/data_linking_example

Folders and files

Latest commit

History

Repository files navigation

Sample AWS Glue script for data linking

Further details

About

Resources

Stars

Watchers

Forks

Languages