Skip to content

danielnazareth89/603-Masters-Project-Apache-Spark-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

603-Masters-Project-Apache-Spark-

Abstract: Online advertising is a billion dollar industry consisting of three major players- Publishers such as the New York Times, ESPN etc. which make money by displaying ads on their websites, advertisers, typically product based companies which pay to have their products displayed on the publishers page and matchmakers such as Google, Yahoo, Microsoft etc. which decide dynamically which kind of ads to display for various search and other pages and earn revenue based on how often a user clicks. Since user engagement can easily go as low as 1%, the click through rate prediction problem aims to estimate the conditional probability that a user will click on an ad based on a massive dataset of predicted features such as ad content, historical performance, user and publisher specific information wherever possible and much more. We address the problem using logistic regression and analyze the scalability and efficiency of our solution on a dataset of approximately 40 million rows of anonymized user-ad interaction data.

Spark can be used using various configurations of the environment. The parameters that can be varied are nodes, executors, cores per executor, and memory per executor. We show that performance is highly dependent on the configuration of spark. We have tried to find the optimal values of these parameters for the logistic regression.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published