Skip to content

enchantedToMeetYou/Spark_Linear_Regression

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

title author date output
README
Dusan Grubjesic email: grubjesic.dusan@gmail.com
August 11, 2015
html_document

Click rate prediction algorithm

This is click rate prediction algorithm using spark, writen in python api of spark: pyspark.

Data

Data was taken from Criteo Labs and is sample of Kaggle Display Advertising Challenge Dataset. It can be downloaded after you accept the agreement http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset/.

It is structured as lines of observations where first is click or no click(1,0) and rest is features

Before start

You must have installed apache spark and python. Also you have to change location of sample in ClickRate.py to where you downloaded it and spark context if you want to change from local to cluster. Sh file is only used for simpler starting and if you want to use it you have to change to your settings.

I have apache spark pre-bult with hadoop 2.6, python 3.4 and numpy package installed

Process

  1. Sample is first parsed and loaded in context.
  2. Transformed so it can be used in logistic regression
  3. Model created from train data
  4. Set of log loss validations
  5. Iterations of logistic regressions for best hyperparamaters

additional explanations are in code

About

Spark (pyspark) linear regression on clickthrough rate (CTR) prediction form Kaggle

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Shell 1.6%