Optimized Conversion of Categorical and Numerical Features in Machine Learning Models

A study surveying categorical conversion methods for machine learning programs. Six datasets were provided, the largest containing over 40,000,000 tuples and 20 features. The task was to explore different strategies of converting categorical features into numerical features to be used as inputs to supervised learning algorithms. The goal was to determine which encoding techniques are the most effective and why. Methods were evaluated by the accuracy of predictive models, area under the receiver operating characteristics curve, and computation time of the conversion process.

Problem was provided by Adobe Research.

Paper Abstract

While some data have an explicit, numerical form, many other data, such as genderor nationality, do not typically use numbers and are referred to as categorical data.Thus, machine learning algorithms need a way of representing categorical informationnumerically in order to be able to analyze them. Our project specifically focuses on op-timizing the conversion of categorical features to a numerical form in order to maximizethe effectiveness of various machine learning models. Of the methods we used, we foundthat Wide & Deep is the most effective model for datasets that contain high-cardinalityfeatures, as opposed to learned embedding and one-hot encoding.

Background of Problem

Supervised learning models are the cornerstone of the many Machine Learning models we encounter in our lives every day. These start with a pair of values (x, y), where x is the vector of features and y is the label. For a mathematical model that maps x to y, we need x to be a vector of numbers. Unfortunately, for many problems of interest, the inputs are not numeric. For example, a person’s gender may take one of the following values Male, Female, Other, or Missing (Facebook allows 56 possible values for a person’s gender). Such features, with no inherent ordering of the values, are called categorical features. It is not always clear as to how we can convert such a feature into a numeric value.

Research Goal

Using a number of real datasets, explore different strategies of converting categorical features into numeric features. All the datasets have a set of input features/covariates (categorical and numerical). The comparisons should be in terms of the following metrics, (1) Accuracy of a predictive model, (2) AUC of the ROC Curve, (3) Compute time of the conversion process.

Datasets

Name of Data Set	Size	Training Size	Testing Size	Features	Prediction Task	Comments
Criteo Conversion	15,898,883	70%	30%	9 numerical + 9 categorical	Click	-
Amazon Employee Access	32,769	70%	30%	9 categorical	Is access appropriate for an employee?	-
Avazu Click Through Rate Prediction Rate Prediction	40,428,968	50%	50%	20 categorical	Click on advertisement	Mobile app advertisement data
KDD 2009	50,000	70%	30%	189 categorical + 20 continuous	Two responses churn (16% positive) and appetency (2% positive)	The original data has 15K vars, kept all available categorical variables and only the top 20 (by abs(cor)) cont variables
US Census 1990	2,458,285	70%	30%	67 categorical	Artificial task created to predict if a person is married	This is US census data that has been obfuscated. A number of interesting variables are available. The task is a concocted one
Adult	48,842	67%	33%	8 categorical	Predict if one's income is > 50k	-

Authors: Wren Paris-Moe, Thomas Butler, Emily Liang, Andrea Stine

Note: only 2 of 6 datasets were under the size limit for uploading to GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LaTex Files		LaTex Files
processed_data		processed_data
sample_code		sample_code
Cumulative Presentation.pdf		Cumulative Presentation.pdf
README.md		README.md
feature_conversion.pdf		feature_conversion.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LaTex Files

LaTex Files

processed_data

processed_data

sample_code

sample_code

Cumulative Presentation.pdf

Cumulative Presentation.pdf

README.md

README.md

feature_conversion.pdf

feature_conversion.pdf

Repository files navigation

Optimized Conversion of Categorical and Numerical Features in Machine Learning Models

Paper Abstract

Background of Problem

Research Goal

Datasets

About

Releases

Packages

Languages

wrenparismoe/Categorical-Feature-Conversion

Folders and files

Latest commit

History

Repository files navigation

Optimized Conversion of Categorical and Numerical Features in Machine Learning Models

Paper Abstract

Background of Problem

Research Goal

Datasets

About

Resources

Stars

Watchers

Forks

Languages