Analysing Semi-structured data with Apache Spark.
Log data comes from many sources, such as web, file, and compute servers, application logs, user-generated content, and can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more. Here Spark is used to perform data exploration and mining on real Apache web server log files.A data set from NASA Kennedy Space Center web server in Florida is used here. The full data set is freely available at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html, and it contains all HTTP requests for two months. We are using a subset that only contains several days' worth of requests.