SparkService

Version 2.0

FEATURES:

Start spark via the command line with parameters, the input parameters are intput/output files path.
Use spark to train a machine learning model, and make predictions.
Save result as JSON into output path

Supported machine learning models:

LinearRegression
NaiveBayes
RandomForest
KMeans
LogisticRegression
SVM
Decision Tree

SETUP INSTRUCTION:

Download Spark from Official Website (latest version:1.6.0). Please choose the appropriate package type according to Hadoop version.
To build Spark and its example programs, run: build/mvn
Install python 2.7(should also support Python 3)
Login Cluster: ssh honeycomb@128.2.7.38 (password: ask teammates)
Copy files into CLuster local host:

scp source_file_name honeycomb@128.2.7.38:/home/honeycomb/SparkTeam

e.g:

scp /Users/jacobliu/PySpark.py honeycomb@128.2.7.38:/home/honeycomb/SparkTeam
Put files into HDFS:

HADOOP_USER_NAME=hdfs hdfs dfs -put LOCAL_FILE_PATH HDFS_FILE_PATH

e.g:

hdfs dfs -put /home/honeycomb/SparkTeam/sample_multiclass_classification_data_test.txt /user/spark/input
Put PySpark.py and train/test dataset into HDFS and run command line:

YOUR_SPARK_PATH/spark-submit YOUR_SPARK_PATH/PySpark.py YOUR_TRAIN_DATA_HDFS_PATH YOUT_TEST_DATA_HDFS_PATH YOUR_OUTPUT_HDFS_PATH

e.g:

/bin/spark-submit /home/honeycomb/SparkTeam/PySpark.py /user/spark/input/sample_multiclass_classification_data.txt /user/spark/input/sample_multiclass_classification_data_test.txt /user/spark/out/

Resources:

Deploy Spark: http://spark.apache.org/docs/latest/programming-guide.html
Hadoop Version: http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
Python API Docs: https://spark.apache.org/docs/1.5.2/api/python/index.html
Machine Learning Library(MLib) Guide: http://spark.apache.org/docs/latest/mllib-guide.html

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.idea		.idea
bin		bin
data		data
doc		doc
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

bin

bin

data

data

doc

doc

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

SparkService

FEATURES:

Supported machine learning models:

SETUP INSTRUCTION:

Resources:

About

Releases

Packages

Languages

jacobl16/SparkService

Folders and files

Latest commit

History

Repository files navigation

SparkService

FEATURES:

Supported machine learning models:

SETUP INSTRUCTION:

Resources:

About

Resources

Stars

Watchers

Forks

Languages