PRIDA: Pruning Irrelevant Datasets for Data Augmentation

Let Q be an input (query) dataset, t a target variable from Q, and M a machine learning model that uses Q to predict t. Given a set C of datasets that can be used to augment Q, the goal of this project is to prune the candidate datasets that are unlikely to improve the performance of M through data augmentation.

The main steps are:

Find Candidate Datasets. The first step is to, given Q, efficiently retrieve a set of candidate datasets that can be used to augment Q. For now, we focus on augmentation by joins. Efficient data structures and algorithms have been recently proposed to tackle this problem, such as Lazo.
Predict Performance Improvements from Candidate Augmentations. The second step is to, given the set C of candidate datasets from step 1, predict whether these datasets are likely to improve M and prune accordingly. To do that, we create a metamodel that, for each candidate dataset c from C, classifies it as relevant or irrelevant for augmentation without having to do the augmentation or to re-train M.
Generate Training Data. To train, validate, and test our model from step 2, we need to generate training (ground-truth) data, composed of different Q and C, with their corresponding labels (relevant or irrelevant) after augmentation.

Predicting Performance Improvements and Classifying Candidates

The code to predict performance improvement is available here

Generating Training Data

To generate ground-truth data for training and testing, we use datasets from the D3M project and from OpenML. The main idea is to break each dataset into different query and candidate datasets, randomly choosing the columns. We also randomly remove records from query and candidate datasets, to avoid perfect joins.

The code to generate training data is available here.

Name		Name	Last commit message	Last commit date
Latest commit History 541 Commits
data-generation		data-generation
data		data
improvement-prediction		improvement-prediction
use-cases		use-cases
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-generation

data-generation

data

data

improvement-prediction

improvement-prediction

use-cases

use-cases

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

PRIDA: Pruning Irrelevant Datasets for Data Augmentation

Predicting Performance Improvements and Classifying Candidates

Generating Training Data

About

Contributors 3

Languages

License

VIDA-NYU/prida

Folders and files

Latest commit

History

Repository files navigation

PRIDA: Pruning Irrelevant Datasets for Data Augmentation

Predicting Performance Improvements and Classifying Candidates

Generating Training Data

About

Resources

License

Stars

Watchers

Forks

Languages