Abstract: Online advertising is a billion dollar industry consisting of three major players- Publishers such as the New York Times, ESPN etc. which make money by displaying ads on their websites, advertisers, typically product based companies which pay to have their products displayed on the publishers page and matchmakers such as Google, Yahoo, Microsoft etc. which decide dynamically which kind of ads to display for various search and other pages and earn revenue based on how often a user clicks. Since user engagement can easily go as low as 1%, the click through rate prediction problem aims to estimate the conditional probability that a user will click on an ad based on a massive dataset of predicted features such as ad content, historical performance, user and publisher specific information wherever possible and much more. We address the problem using logistic regression and analyze the scalability and efficiency of our solution on a dataset of approximately 40 million rows of anonymized user-ad interaction data.
Spark can be used using various configurations of the environment. The parameters that can be varied are nodes, executors, cores per executor, and memory per executor. We show that performance is highly dependent on the configuration of spark. We have tried to find the optimal values of these parameters for the logistic regression.