This is for practicing developing hadoop and spark data algorithms project.
root
|─ etc
| |─ hadoop # files used for configurations for hadoop. copy them to $HADOOP_HOME/etc/hadoop/
|─ input # store input data
|─ output # output data will come here
|─ src
| |─ main
| | |─ java
| | | |─org.dataalgorithms.border.mapreduce # Border Crossing Entry
| | | |─org.dataalgorithms.mag # Open Academic Graph
| | | |─org.dataalgorithms.netflix # Process data from Netflix Prize Data
| | | |─org.dataalgorithms.stock.mapreduce # Stock moving average
| | | |─org.dataalgorithms.wordcount.mapreduce # Tokenize and count words
| | |
| | └─ resources # store
| |─ test # all tests for java
| └─ python
| |─ main
| | |─ settings
| | |─ border # border crossing entry with Hadoop streaming and pyspark
| | |─ netflix # calculate content similarity
| | |─ stock # Stock moving average with pyspark
| | |─ base # helper classes for MapReduce
| |
| |─ setup.py
|
|─ pom.xml # project build settings
└─ README.md # this file
Set up with single cluster for standalone and pseudo-distribution: link
# Java (Maven)
hadoop 2.7.7
spark 2.4.5
# python
python 3.7
pyspark 2.4.5
## run hadoop cluster
$ hadoop namenode -format
$ hadoop --daemon start namenode
$ hadoop --daemon start datanode
$ yarn --daemon start resourcemanager
$ yarn --daemon start nodemanager
# send data to hdfs
# (change the path in hdfs accordingly)
$ hadoop fs -mkdir -p /user/hdfs/input
$ hadoop fs -put /path/to/dataset input
# Hadoop MapReduce
src/main/java/org/dataalgorithms/border
# python scripts (Hadoop streaming and PySpark)
src/python/main/border
Data link: Border Crossing Entry Data
Target output is the same as link
# Copy data into hdfs
hdfs dfs -put /path/to/data/* input
# run mapreduce to group by date, border, measure
# the result will be saved as `report.csv`
hadoop jar /path/to/jar org.dataalgorithms.border.mapReduce.Executor input output
# run mapreduce to get top N data from processed data (report.csv)
# (argument -n is optional, default value is 10)
hadoop jar /path/to/jar org.dataalgorithms.border.mapReduce.TopNExtractor -i output/report.csv -o output/topN -n 10
# run mapreduce by hadoop streaming
# (note: currently run only grouping and sorting by ascending (old -> new)
# for later use of aggregating and calculating average)
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar
-input <input>
-output <output>
-mapper /path/to/main/border/mapper.py
-reducer /path/to/main/border/reducer.py
-file /path/to/`wheel-file-name`.whl
# run with spark on local machine (pyspark)
spark-submit \
--master local[*] \
/path/to/spark.py
# if run on cluster
spark-submit \
--master 'cluster path' \
/path/to/dir/main/border/spark.py \
Data link: Netflix Prize Data (kaggle)
Detail link: doc
# Hadoop MapReduce
src/main/java/org/dataalgorithms/netflix
# Spark app for data analysis
src/python/main/netflix
Data link: Huge stock market dataset (kaggle)
Calculate moving average of stock market price.
$ head -n 5 aadr.us.txt
Date,Open,High,Low,Close,Volume,OpenInt
2010-07-21,24.333,24.333,23.946,23.946,43321,0
2010-07-22,24.644,24.644,24.362,24.487,18031,0
2010-07-23,24.759,24.759,24.314,24.507,8897,0
2010-07-26,24.624,24.624,24.449,24.595,19443,0
Code Date MovingAverage
# Parameters
# Code: company code extracted from input file name
# Date: Latest date in the windows of moving average
# e.g. range of window is 2010-01-01 - 2010-01-05 => Date: 2010-01-05
# MovingAverage: double value of average
Each code represents company code and calculate moving average by close price. (Currently window size is set to 5)
# Hadoop MapReduce
src/main/java/org/dataalgorithms/stock
# Spark (pyspark)
src/python/main/stock
# execute
# hadoop
hadoop jar /path/to/jar org.dataalgorithms.stock.mapreduce.StockDriver <input> <output> [-n <window size>]
# pyspark
spark-submit \
--master local[*] \
/path/to/dir/main/stock/app.py -i <input> -o <output> [-n <window size>]
Count words based on tutorial and sorted by the counts, then return top N words with counts..
# Haoop MapReduce
src/main/java/org/dataalgorithms/wordcount
# execute
hadoop jar /path/to/jar org.dataalgorithms.wordcount.mapreduce.WordCounter <input> <output> [-n 10]
Process data from Open Academic Graph. To get full data, used OAV v1.
This is used for data analysis practice.