Skip to content

azuranop/Cloudera-Data-Scientist-Challenge-3

 
 

Repository files navigation

This is the submission package for Doug Needham

Data Science Challenge 3

The Cloudera Data Science Challenge 3 Description

The proper write up for this solution is in this directory as Doug_Needham_DSC3_Write_Up.pdf

The assumptions for this code is that it will run under the userid "dln" The HDFS directory structure required is:

  • /user/dln/problem1
  • /user/dln/problem1/driver
  • /user/dln/problem1/svm
  • /user/dln/problem2
  • /user/dln/problem3
  • /user/dln/problem3/inGraph
  • /user/dln/problem3/OutGraph

The shell script "setup.sh" performs the appropriate hadoop fs -mkdir -p commands to create the directories.

As to the source data. All of the code that follows assumes the data for the challenge is in the following location and structure:

  • /user/dsc/famous/spam.log
  • /user/dsc/famous/web.log
  • /user/dsc/winklr/Winklr-network.csv
  • /user/dsc/winklr/Winklr-topClickPairs.csv
  • /user/dsc/smartfly/smartfly_historic.csv
  • /user/dsc/smartfly/smartfly_scheduled.csv

Both of the previous assumptions are used to set these environment variables in the individual shell scripts:

  • SRC_DATA=/user/dsc
  • TGT_DATA=/user/dln

The three requested deliverables are under the directory named "answer", these are the "master" answers and no automation is used to copy the files from the individual code directories to the answer directory:

  • answer/
  • answer/problem1.csv
  • answer/problem2.json
  • answer/problem3.csv

The structure of the directories for the code is as follows (The output directories created by sbt are eliminated for brevity) :

  • answer
  • problem1
    • analysis
    • data
    • log
    • PredictFlights
  • problem2
    • data
    • json
    • log
  • problem3
    • AnalyzeGraph
    • data
    • final
    • inGraph
    • OutGraph
    • log

The shell script to run each problem is in the individual problem directory.

  • problem1/problem1.sh
  • problem2/problem2.sh
  • problem3/problem3.sh

These can all be run as a background process using problem1.sh & for example, since logging within the shell script is being done to the log directory.

problem1.sh and problem3.sh can be run with a single command line argument. Both of these scripts are data driven, in that they each have a file that drives the process. In the case of problem1, it is a list of airports, in problem3 it is a list of originating vertices. The command line argument "throttles" the proces to only run a certain number of airports, or from-vertices for problems 1 and 3 respectively.

Thank you,

Doug Needham

dougthedataguy@gmail.com

About

My solution for the Cloudera Data Science Challenge 3. Spark MLlib for Smartfly. Spark GraphX for Winklr. Python Streaming for web log analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.5%
  • XSLT 5.6%
  • Shell 3.2%
  • Scala 1.4%
  • CSS 1.2%
  • Awk 0.1%