About Me

Hello, my name is Sophie Wang and I'm receiving a Master's Degree in Data Science at University of San Francisco. This is the individual project I completed for Distributed Computing course.

MSDS694-IoT-sensor-project

Activity Recognition using smartphone and smartwatch data in Apache Spark

In this repository, you will find all my python script code as well as corresponding jupyter notebook where you can see all the intermediate results printed.

Project Description

The dataset used is from 'UCI WISDM Smartphone and Smartwatch Activity and Biometrics' which contains information collected by gyroscopes or accelerometers of smartphone and smartwatch. The goal is to classify and recognize human activity categories by applying machine learning techniques in a distributed computing setting (SparkML and Spark+H2O).

The project consists of six parts (including EDA and machine learning):

Part 1

Load all data from subfolders at once as RDDs.
Remove all the null values
Convert RDDs to Spark dataframe
Join the activity code dataframe with sensor info dataframe

Part 2

Identify which activity is related to eating
Check the number of activity types for each device, sensor and user
Check the min, max, std, percentiles of the readings from gyroscopes or accelerometers

Part 3

Encode the categorical column by first applying StringIndexer and then OneHotEncoder
Combine all the feature columns using Vector Assembler
Scale the assembled features by StandardScaler
Divide the dataset into training (80%) and test set(20%)
Fit a logistic regression model with cross validation
Check the evaluation metric areaUnderROC on the test set (0.611)

Part 4

Fit a random forest classifier model with cross validation
Check the evaluation metric areaUnderROC on the test set (0.803, much better than the logistic regression)
Fit a gradient boosted tree classifier model with cross validation
Check the evaluation metric areaUnderROC on the test set (0.933, better than the random forest classifier)

Part 5

(note: part 3-4 uses SparkML, part 5-6 uses Sparkling Water--H2O with Spark)

Fit a H2O gradient boosted tree classifier model with cross validation
Check the evaluation metric areaUnderROC on the test set (0.866)
Fit a H2O deep learning model with cross validation
Check the evaluation metric areaUnderROC on the test set (0.945)

Part 6

Apply AutoML on the dataset and return the leaderboard
Fit the leader model from the screenshot above and check the evaluation metric areaUnderROC on the test set (0.9597, highest score so far!)

By comparing Spark ML and H2O, H2O is much easier to use since it takes care of all the data-preprocessing steps automatically (StringIndexer, OneHotEncoder, Vector Assembler, StandardScaler). The AutoML in H2O package is even simplier since it automatically search for the best performing algorithm and provide a lot of model interpretation visualizations (feature importance for example). Note: If you want to use H2O instead of Spark ML, you have to convert Spark dataframe (row-based) to H2O Frame (column-based).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
notebooks		notebooks
.DS_Store		.DS_Store
README.md		README.md
WISDM-dataset-description.pdf		WISDM-dataset-description.pdf
part_1.py		part_1.py
part_2.py		part_2.py
part_3.py		part_3.py
part_4.py		part_4.py
part_5.py		part_5.py
part_6.py		part_6.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

.DS_Store

.DS_Store

README.md

README.md

WISDM-dataset-description.pdf

WISDM-dataset-description.pdf

part_1.py

part_1.py

part_2.py

part_2.py

part_3.py

part_3.py

part_4.py

part_4.py

part_5.py

part_5.py

part_6.py

part_6.py

Repository files navigation

About Me

MSDS694-IoT-sensor-project

In this repository, you will find all my python script code as well as corresponding jupyter notebook where you can see all the intermediate results printed.

Project Description

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

About

Releases

Packages

Languages

sophieyuefeiwang/MSDS694-IoT-sensor-project

Folders and files

Latest commit

History

Repository files navigation

About Me

MSDS694-IoT-sensor-project

In this repository, you will find all my python script code as well as corresponding jupyter notebook where you can see all the intermediate results printed.

Project Description

About

Resources

Stars

Watchers

Forks

Languages