Skip to content

BssMsi/PySpark-Training

Repository files navigation

Assignment for the PySpark Training

Extracting some insights by tranforming Apache access logs and visualizing through plots
View the main notebook here to render the dynamic map

Dataset

Data contains Apache Access logs obtained from open source freely available sources.
The access logs contain a total of 6 million+ rows.
It has not been uploaded here to GitHub due to size constraints, you can see the urls in the "GetData.ipynb" notebook.

Guide to files

  1. ProcessLogs.ipynb - Main notebook containing all the Transformations and plot
  2. ProcessRDDLogs.ipynb - Trying to tranform the same data using RDD, took much more time, hence abandoned
  3. iplocation/ - Contains the python code for the web crawler built on scrapy
  4. GetData.ipynb - Helper notebook to download the data and process it, also includes some RnD
  5. locations.json - contains location information based on the IP (obtained by crawling the web with a web crawler(iplocation) built on scrapy)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published