Skip to content

data processing with Hadoop MapReduce and Spark (pyspark)

Notifications You must be signed in to change notification settings

riomat13/hadoop_playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Algorithm Practice

This is for practicing developing hadoop and spark data algorithms project.

Directory Structure

root
 |─ etc
 |    |─ hadoop     # files used for configurations for hadoop. copy them to $HADOOP_HOME/etc/hadoop/
 |─ input   # store input data
 |─ output  # output data will come here
 |─ src
 |    |─ main
 |    |    |─ java
 |    |    |     |─org.dataalgorithms.border.mapreduce      # Border Crossing Entry
 |    |    |     |─org.dataalgorithms.mag                   # Open Academic Graph
 |    |    |     |─org.dataalgorithms.netflix               # Process data from Netflix Prize Data
 |    |    |     |─org.dataalgorithms.stock.mapreduce       # Stock moving average
 |    |    |     |─org.dataalgorithms.wordcount.mapreduce   # Tokenize and count words
 |    |    |
 |    |    └─ resources  # store
 |    |─ test  # all tests for java
 |    └─ python
 |         |─ main
 |         |     |─ settings
 |         |     |─ border  # border crossing entry with Hadoop streaming and pyspark
 |         |     |─ netflix # calculate content similarity
 |         |     |─ stock   # Stock moving average with pyspark
 |         |     |─ base    # helper classes for MapReduce
 |         |
 |         |─ setup.py
 |
 |─ pom.xml    # project build settings
 └─ README.md  # this file

Setting up

Set up with single cluster for standalone and pseudo-distribution: link

Versions

# Java (Maven)
hadoop 2.7.7
spark 2.4.5

# python
python 3.7
pyspark 2.4.5

Set up data in hdfs

## run hadoop cluster
$ hadoop namenode -format
$ hadoop --daemon start namenode
$ hadoop --daemon start datanode
$ yarn --daemon start resourcemanager
$ yarn --daemon start nodemanager

# send data to hdfs
# (change the path in hdfs accordingly)
$ hadoop fs -mkdir -p /user/hdfs/input
$ hadoop fs -put /path/to/dataset input

Border Crossing Entry data processing

Directory paths

# Hadoop MapReduce
src/main/java/org/dataalgorithms/border

# python scripts (Hadoop streaming and PySpark)
src/python/main/border

Data link: Border Crossing Entry Data

Target output is the same as link

Execute code

MapReduce with Hadoop

# Copy data into hdfs
hdfs dfs -put /path/to/data/* input

# run mapreduce to group by date, border, measure
# the result will be saved as `report.csv`
hadoop jar /path/to/jar org.dataalgorithms.border.mapReduce.Executor input output

# run mapreduce to get top N data from processed data (report.csv)
# (argument -n is optional, default value is 10)
hadoop jar /path/to/jar org.dataalgorithms.border.mapReduce.TopNExtractor -i output/report.csv -o output/topN -n 10

Hadoop Streaming

# run mapreduce by hadoop streaming
# (note: currently run only grouping and sorting by ascending (old -> new)
#        for later use of aggregating and calculating average)
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar
    -input <input>
    -output <output>
    -mapper /path/to/main/border/mapper.py
    -reducer /path/to/main/border/reducer.py
    -file /path/to/`wheel-file-name`.whl

Spark (PySpark)

# run with spark on local machine (pyspark)
spark-submit \
    --master local[*] \
    /path/to/spark.py

# if run on cluster
spark-submit \
    --master 'cluster path' \
    /path/to/dir/main/border/spark.py \

Netflix Prize Data

Data link: Netflix Prize Data (kaggle)

Detail link: doc

Directory path

# Hadoop MapReduce
src/main/java/org/dataalgorithms/netflix

# Spark app for data analysis
src/python/main/netflix

Huge stock market dataset

Data link: Huge stock market dataset (kaggle)

Purpose

Calculate moving average of stock market price.

Input data structure:
$ head -n 5 aadr.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
2010-07-21,24.333,24.333,23.946,23.946,43321,0
2010-07-22,24.644,24.644,24.362,24.487,18031,0
2010-07-23,24.759,24.759,24.314,24.507,8897,0
2010-07-26,24.624,24.624,24.449,24.595,19443,0
Target data structure:
Code    Date    MovingAverage

# Parameters
#   Code: company code extracted from input file name
#   Date: Latest date in the windows of moving average
#       e.g. range of window is 2010-01-01 - 2010-01-05 => Date: 2010-01-05
#   MovingAverage: double value of average

Each code represents company code and calculate moving average by close price. (Currently window size is set to 5)

Directory paths

# Hadoop MapReduce
src/main/java/org/dataalgorithms/stock

# Spark (pyspark)
src/python/main/stock

Execute code

# execute
# hadoop
hadoop jar /path/to/jar org.dataalgorithms.stock.mapreduce.StockDriver <input> <output> [-n <window size>]

# pyspark
spark-submit \
    --master local[*] \
    /path/to/dir/main/stock/app.py -i <input> -o <output> [-n <window size>]

Word count

Count words based on tutorial and sorted by the counts, then return top N words with counts..

# Haoop MapReduce
src/main/java/org/dataalgorithms/wordcount

# execute
hadoop jar /path/to/jar org.dataalgorithms.wordcount.mapreduce.WordCounter <input> <output> [-n 10]

Open Academic Graph

Process data from Open Academic Graph. To get full data, used OAV v1.

This is used for data analysis practice.

About

data processing with Hadoop MapReduce and Spark (pyspark)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published