Skip to content

predict simple human activity from smartphone accelerometer data

Notifications You must be signed in to change notification settings

bfetler/human_activity

Repository files navigation

Human Activity Prediction Using Smartphones Data

Can we predict human physical activity using smartphone data, using smartphone accelerometer and gyroscope measurements? In a study by researchers, the Human Activity Recognition Using Smartphones Data Set is available from the UC Irvine Machine Learning Repository to help address this question.

Accelerator and gyroscope data from a Samsung Galaxy S II smartphone was measured for 30 subjects performing six activites. The smartphone data contained 561 variable columns, plus one column for the subject. Most of the accelerometer and gyroscope data is reported as seemingly random variables in the range -1.0 to +1.0. Each subject repeated the activities over 50 times, resulting in over 10,000 rows of data. Data was split into 21 subjects for training data and 9 subjects for test data. The six activities were:

  • WALKING
  • WALKING_UPSTAIRS
  • WALKING_DOWNSTAIRS
  • SITTING
  • STANDING
  • LAYING

The set of activity categories is fairly simple, but may serve as a template for more complex behavior. Using the Python Scikit-learn library, we were able to predict human behavior with reasonable accuracy.

Prediction Methods

Both methods show that only 20 to 30 variables out of 500+ are needed to classify behavior.

Random Forest Optimization

Data exploration of training data is given in read_clean_data.py, with script output in read_clean_data.txt and plots in human_activity_plots/. Multiple column labels seemed duplicated or redundant, and were reduced to 478 columns. A grid search exploration of the maximum number of features per split, and number of estimators (trees) gave 80% to 90% fit accuracy on the training set with validation. Each parameter set was cross validated three times, showing some variation by max_features. However, variation due to max_features was often not statistically significant (t-test p-value greater than 0.05), and fit score variation due to number of estimators (n_estimators) was also not significant. Boxplots were created from cross validation scores, showing the range of variation within the data.
Random Forest Score by Max Features at each split (max_features) Random Forest Score by Number of Estimators (n_estimators)

Near optimum parameters were estimated at {n_estimators: 100, max_features: 'sqrt'}. Prediction on test data with optimum parameters gave 90% accuracy.

Test data accuracy for each activity is shown by classification report below. Laying may be more separable from the others due to inactivity, but a score of 100% is probably not statistically reliable.

Activity Walking Walking Upstairs Walking Downstairs Sitting Standing Laying
Precision 0.83 0.89 0.95 0.95 0.92 1.00

Random Forest Prediction

Further prediction was done with the full set of columns, given in clean_predict_allvar.py, with script output in clean_predict_allvar.txt and plots in human_activity_plots/. Train, validation and test data were reduced to a smaller set of subjects. Classification parameters were explored more thoroughly in a grid search with cross-validation, with similar results: up to 90% prediction accuracy of validation data. Variation is shown in a boxplot. Near optimum parameters {n_estimators: 50, max_features: 'log2'} were used to predict test data with 80% accuracy.

A cross-validation table and confusion matrix plot show a clear distinction between active (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS) and sedentary (SITTING, STANDING, LAYING) activities, with some correlation within each group. Prediction accuracy of each activity is between 60% and 90%.

Confusion Matrix

The top ten columns account for 19% of total importance, and varied from one repetition to the next. They typically include:

  • tGravityAcc_min_X
  • tGravityAcc_Mean_X
  • tGravityAcc_max_X
  • angle_X_gravityMean
  • tGravityAcc_energy_X
  • tGravityAcc_energy_Z
  • tGravityAcc_max_Z
  • angle_Z_gravityMean
  • tGravityAcc_Mean_Z
  • angle_Y_gravityMean

Prediction: PCA with SVM and Logistic Regression

Prediction using PCA as input to classifiers is given in pca_clf.py, with output in pca_clf_output.txt and plots in human_activity_pca_plots/. PCA dimensionality reduction was performed using all 562 variable columns. Just the first 10 primary components account for 91% of explained variance ratio, while 100 components accounts for 98% of explained variance. Using the first 30 components, representing 5.5% of the total 562 columns, accounts for 95% of the explained variance, and seems a reasonable value for classifier input.

Using PCA as input to Logistic Regression to fit training data gives reasonable accuracy (85%) with only 10 components, increasing with the number of PCA components. Using 30 PCA components gives 89% +- 5% training data accuracy for all activities, with a standard error estimated by 10x cross-validation due to variation in the data. Test data prediction accuracy is within the training data fit margin of error.

Logistic Regression Score with Varying Number of PCA Components

Test data accuracy for each activity is between 85% and 93%, as shown by classification report below. Again, laying may be more separable from the others due to inactivity, but a score of 100% is probably not reliable.

Activity Walking Walking Upstairs Walking Downstairs Sitting Standing Laying
Precision 0.89 0.93 0.92 0.91 0.85 1.00

Repeating the procedure using PCA as input to LinearSVC gives results similar to Logistic Regression.

PCA with Logistic Regression or SVM seems to be faster and more accurate than Random Forest alone.

Conclusion:

Can we predict human physical activity using smartphone data? Yes, within reasonable bounds for some simple activities.

About

predict simple human activity from smartphone accelerometer data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages