We analyze the Weight Lifting Exercises Dataset, which gathers data from accelerometers on the belt, forearm, arm, and dumbell of 6 people performing barbell lifts correctly and incorrectly in 5 different ways. After fitting a predictive model for activity recognition based on these data, we use it to predict the correctness of the exercises performed by new users.
We will use the dplyr package for working with data frames, the ggplot2
package for graphs, the caret package for training and predicting.
library(dplyr); library(ggplot2); library(caret); set.seed(0)
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from webpage of the project (see the section on the Weight Lifting Exercise Dataset).
The data set originally comes from the Weight Lifting Exercises Dataset (see the section on the Weight Lifting Exercise Dataset on the webpage of the project - possibly see the web archive version of it at web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har). The training data for this project are also available at:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are also available at:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
We downloaded the data into a data folder as train.csv and test.csv.
raw_training <- read.csv("data/train.csv")
raw_testing <- read.csv("data/test.csv")
We are dealing with a 19622 x 160 training data set and a 20 x 160 test data set. Let us inspect the training data and leave aside the test data set. Details are suppressed for the sake of brevity, but we explicitly show the names of the variables in the data set for the convenience of the reader.
head(raw_training)
tail(raw_training)
str(raw_training)
summary(raw_training)
names(raw_training)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "classe"
The first 7 columns contain data that are not useful for our purposes, hence we drop them. We also convert to numeric those columns that are erroneously interpreted as integers or characters. Then we drop those columns that are mostly (at least 90%) NAs. Finally, we check whether there are columns with almost zero variance, but there are not and we consider ourselves satisfied.
training <- raw_training %>%
select(!c(X, user_name,
raw_timestamp_part_1,
raw_timestamp_part_2,
cvtd_timestamp,
new_window,
num_window)) %>%
mutate(across(!classe, as.numeric),
classe = factor(classe)) %>%
select(where(function(X) { mean(is.na(X)) < 0.9 }))
nzv <- nearZeroVar(training)
nzv
## integer(0)
# training <- select(training, !all_of(nzv))
dim(training)
## [1] 19622 53
In order to select a model, we compare the accuracy of three models seen in the course: decision tree, random forest and gradient-boosting trees model.
We split our training data set into a data set for training the algorithms (70% of the data) and a data set for validating the results and compare their accuracy (30% of the data).
inTrain <- createDataPartition(y = training$classe,
p = 0.7,
list = FALSE)
trainSet <- training[inTrain,]
validationSet <- training[-inTrain,]
We also decide to perform 3-fold cross validation in the training phase, to reduce the bias and improve the prediction capability of our models.
control <- trainControl(method = "cv", number = 3)
We start by training a single decision tree.
Train the decision tree
Tree <- train(classe ~ ., data = trainSet,
method = "rpart", trControl = control)
rattle::fancyRpartPlot(Tree$finalModel)
Estimate accuracy on the training data
confusionMatrix(predict(Tree, trainSet), trainSet$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.4973429 0.3438398 0.4889467 0.5057403 0.2843416
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
Predict on the validation data set and check out-of-sample performance
predTree <- predict(Tree, validationSet)
CnfMtxTree <- confusionMatrix(predTree, validationSet$classe)
CnfMtxTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1519 484 475 422 171
## B 42 394 32 184 147
## C 109 261 519 358 278
## D 0 0 0 0 0
## E 4 0 0 0 486
##
## Overall Statistics
##
## Accuracy : 0.4958
## 95% CI : (0.483, 0.5087)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3408
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9074 0.34592 0.50585 0.0000 0.44917
## Specificity 0.6314 0.91466 0.79296 1.0000 0.99917
## Pos Pred Value 0.4946 0.49312 0.34033 NaN 0.99184
## Neg Pred Value 0.9449 0.85352 0.88372 0.8362 0.88953
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2581 0.06695 0.08819 0.0000 0.08258
## Detection Prevalence 0.5218 0.13577 0.25913 0.0000 0.08326
## Balanced Accuracy 0.7694 0.63029 0.64940 0.5000 0.72417
We conclude that this single decision tree has an accuracy of 0.4958369
As second option, we train a random forest.
Train the random forest
Forest <- train(classe ~ ., data = trainSet,
method = "rf", trControl = control)
Estimate accuracy on the training data
confusionMatrix(predict(Forest, trainSet), trainSet$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9985441 0.9981584 0.9977523 0.9991105 0.2843416
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
Predict on the validation data set and check out-of-sample performance
predRF <- predict(Forest, validationSet)
CnfMtxRF <- confusionMatrix(predRF, validationSet$classe)
CnfMtxRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1134 3 0 0
## C 0 3 1023 2 0
## D 0 0 0 961 0
## E 0 0 0 1 1082
##
## Overall Statistics
##
## Accuracy : 0.9981
## 95% CI : (0.9967, 0.9991)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9976
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9956 0.9971 0.9969 1.0000
## Specificity 0.9995 0.9994 0.9990 1.0000 0.9998
## Pos Pred Value 0.9988 0.9974 0.9951 1.0000 0.9991
## Neg Pred Value 1.0000 0.9989 0.9994 0.9994 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1927 0.1738 0.1633 0.1839
## Detection Prevalence 0.2848 0.1932 0.1747 0.1633 0.1840
## Balanced Accuracy 0.9998 0.9975 0.9980 0.9984 0.9999
We conclude that this random forest model has an accuracy of 0.9981308
Finally, we perform gradient boosting with decision trees as base learners.
Train the GBM
GradBoostTrees <- train(classe ~ ., data = trainSet, method = "gbm",
verbose = FALSE, trControl = control)
Estimate accuracy on the training data
confusionMatrix(predict(GradBoostTrees, trainSet), trainSet$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.716095e-01 9.640846e-01 9.686939e-01 9.743230e-01 2.843416e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 2.954127e-14
Predict on the validation data set and check out-of-sample performance
predGBM <- predict(GradBoostTrees, validationSet)
CnfMtxGBM <- confusionMatrix(predGBM, validationSet$classe)
CnfMtxGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1648 31 0 2 3
## B 14 1080 20 1 7
## C 9 27 992 24 8
## D 2 1 13 933 10
## E 1 0 1 4 1054
##
## Overall Statistics
##
## Accuracy : 0.9698
## 95% CI : (0.9651, 0.974)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9617
##
## Mcnemar's Test P-Value : 9.289e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9845 0.9482 0.9669 0.9678 0.9741
## Specificity 0.9915 0.9912 0.9860 0.9947 0.9988
## Pos Pred Value 0.9786 0.9626 0.9358 0.9729 0.9943
## Neg Pred Value 0.9938 0.9876 0.9930 0.9937 0.9942
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2800 0.1835 0.1686 0.1585 0.1791
## Detection Prevalence 0.2862 0.1907 0.1801 0.1630 0.1801
## Balanced Accuracy 0.9880 0.9697 0.9764 0.9813 0.9864
We conclude that this gradient-boosted trees model has an accuracy of 0.9697536
| Model | In sample accuracy | In sample error | Out of sample accuracy | Out of sample error | 95% CI OS Acc |
|---|---|---|---|---|---|
| Decision Tree | 0.4973 | 0.5027 | 0.4958 | 0.5042 | [0.483,0.5087] |
| Random Forest | 0.9985 | 0.0015 | 0.9981 | 0.0019 | [0.9967,0.9991] |
| Gradient-boosted Trees | 0.9716 | 0.0284 | 0.9698 | 0.0302 | [0.9651,0.974] |
Therefore we decide to select the random forest model, since it performed pretty well on the testing data having a 99.81% out-of-sample accuracy rate.
Since we opted for the random forest model, we can extract the variable importance from it. This tells us how the variables helped the model in predicting the class of the data. Higher importance means that the variable is useful to the model.
var_imp <- varImp(Forest, scale = F)$importance
var_imp <- data.frame(variables = row.names(var_imp),
importance = var_imp$Overall)
ggplot(aes(x = reorder(variables, importance), y = importance),
data = var_imp) + geom_bar(stat = 'identity', fill = 'navy') +
coord_flip() + labs(x = 'Variables',
y = 'Importance',
title='Random forest variable importance') +
theme(axis.text = element_text(size = 7),
axis.title = element_text(size = 15),
plot.title = element_text(size = 20))
We may finally apply our model on testing data.
testing <- raw_testing %>%
select(names(training[,-53]))
dim(testing)
## [1] 20 52
TestRF <- predict(Forest, testing)
TestRF
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E