Quick Start Instructions

This repository contains my submission for the Sortable coding challenge.

Quick Start Instructions

Assuming Java is already installed, these instructions will run the matching program on a single node Spark cluster

Clone this repository with git clone https://github.com/yuguang/sortable.git

Hadoop is a dependency for Spark. Install Hadoop, replacing ubuntu with your username

 wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
 sudo tar zxvf hadoop-* -C /usr/local
 sudo mv /usr/local/hadoop-* /usr/local/hadoop
 sudo chown -R ubuntu /usr/local/hadoop

Install Spark to /usr/local by replacing ubuntu with your username and running

 wget http://apache.mirror.gtcomm.net/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
 sudo tar zxvf spark-1.6.2-bin-hadoop2.6.tgz -C /usr/local
 sudo mv /usr/local/spark-* /usr/local/spark
 sudo chown -R ubuntu /usr/local/spark

cd sortable to move into the project's root directory
Install dependencies by running ./setup-virtualenv.sh
Run ./run.sh
Results will be saved in data_output/matchings.txt.
Optionally, run unittests on the core classes
```
 cd src
 python3 model.py
```

Methodology

Before starting this project, I researched the existing literature on Record Linkage and Entity Resolution. The problem of determining whether two pieces of information from disparate data sources are referencing the same same product is well studied. There were papers and libraries from several years ago. I decided to use the dedupe Python library because it implements algorithms from recent literature. However, using such a library instead of rolling my own with Spark MLlib comes with drawbacks. Among the obvious drawbacks are performance degradation and conversion of formats.

In order to fit the dedupe library into the Spark framework, I am using Spark for preprocessing and post processing before and after running Entity Resolution algorithms. Spark reads in JSON lines, maps them to Listing and Product objects, then converts them to dictionaries with fields for the dedupe library. I train the model for a many to one mapping of listings to products. Then I use the model to identify records that all refer to the same entity. Because the dedupe library returns pairs of product and listing matchings, I post process results to collect together listings under the same products.

Training Data

For training the record linkage algorithm, I made training datasets data_training/listings.txt and data_training/products.txt. The lines within these files are taken directly from their respective files in data_input. The script process_training_data.py processes the raw training data to write in a JSON format readable by the dedupe library. process_training_data.py reuses the classes I wrote for products and listings to convert them to dictionaries.

Data Cleaning

For record linkage algorithms to work more effectively, I split listing data into separate fields. These fields are manufacturer, family, name, and model. The major advantage of comparing field by field is that the algorithm does not treat each field string distance equally. Behind the scenes, record linkage algorithms assign different weights to each field.

Listings have inconsistent capitalization. My classes for products and listings return lowercased fields for manufacturer, family, and model. For product_name and title which are stored as the name field in their classes, I remove hyphens and punctuation.

Class Structure

I have two classes, Product and Listing, all inheriting from an AbstractProduct class. AbstractProduct is an abstract base class. Product and Listing are concrete implementations. Concrete classes also implement the populate method, which takes a Spark DataFrame row and sets the respective attributes. Listings only have the manufacturer field in the JSON. Therefore, the populate method is implemented by two separate classes.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data_input		data_input
data_output		data_output
data_training		data_training
img		img
jupyter		jupyter
src		src
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh
setup-virtualenv.sh		setup-virtualenv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_input

data_input

data_output

data_output

data_training

data_training

img

img

jupyter

jupyter

src

src

.gitignore

.gitignore

README.md

README.md

run.sh

run.sh

setup-virtualenv.sh

setup-virtualenv.sh

Repository files navigation

Quick Start Instructions

Methodology

Training Data

Data Cleaning

Class Structure

About

Releases

Packages

Languages

yuguang/sortable

Folders and files

Latest commit

History

Repository files navigation

Quick Start Instructions

Methodology

Training Data

Data Cleaning

Class Structure

About

Resources

Stars

Watchers

Forks

Languages