Skip to content

Given an online ad, this application uses SVM, Naive Bayes, and Logistic Regression to predict whether the ad will be clicked by a user or not. The application works on Apache Spark and is designed to work on a large distributed dataset on a cluster.

Notifications You must be signed in to change notification settings

Abhishek-Arora/Ad-Click-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ad Click Prediction

Given an online ad, this application uses Apache Spark's MLlib library to implement SVM, Naive Bayes, and Logistic Regression to predict whether the ad will be clicked by a user or not.

Data

Data was taken from Criteo Labs and is sample of Kaggle Display Advertising Challenge Dataset. It can be downloaded after you accept the agreement http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/.

It is structured as lines of observations where first is click or no click(1,0) and rest are features.

The columns are tab separeted with the following schema: <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>

When a value is missing, the field is just empty. There is no label field in the test set.

The training dataset consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not. The positive (clicked) and negatives (non-clicked) examples have both been subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes. The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.

Process

  1. Sample is first parsed and loaded in context.
  2. Transformed so it can be used in support vector machines, logistic regression, and naive bayes
  3. Model created from the training data after making a 70-15-15 split for the training, validation, and testing set, respectively.
  4. Iterations are performed on each three algorithms to generate the model with best parameters.

About

Given an online ad, this application uses SVM, Naive Bayes, and Logistic Regression to predict whether the ad will be clicked by a user or not. The application works on Apache Spark and is designed to work on a large distributed dataset on a cluster.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages