MthStat 568/768 Multivariate Statistical Analysis Spring 2025
Homework 5
Due Wednesday, April 9
1. Consider the spambase data set, where emails are classied as spam or not, and 57 feature variables are measured on each of them (see full description on p. 259 of the book).
(a) Split the data set into training and test sets (roughly a 70/30 split). Compute a logistic classier using the training data. (There might be perfect separation between the groups, but that should not matter as long as you dont get NA coe¢ cients.)
(b) Find the misclassication table for the test data and compute the misclassication rate.
2. Consider the pendigits data set, which are samples of handwritten digits 0;1;:::;9. The feature variables in this case are the (x;y) coordinates of the pen tip, discretized at eight time points (see section 7.2.1 of the book for more details).
(a) Split the data set into training and test sets (roughly a 70/30 split). Compute the multinomial logistic classier using the training data.
(b) Construct the misclassication table for the test data and compute the misclassication rate. Which digit seems to be the hardest to classify correctly?