Statistical Learning

Course material for BST 263

Instructor: Jeff Miller
Spring 2019
Harvard T.H. Chan School of Public Health
Department of Biostatistics

Synopsis

Statistical learning is a collection of flexible tools and techniques for using data to construct prediction algorithms and perform exploratory analysis. This course will introduce students to the theory and application of methods for supervised learning (classification and regression) and unsupervised learning (dimension reduction and clustering). Students will learn the mathematical foundations underlying the methods, as well as how and when to apply different methods. Topics will include the bias-variance tradeoff, cross-validation, linear regression, logistic regression, KNN, LDA/QDA, variable selection, penalized regression, generalized additive models, CART, random forests, gradient boosting, kernels, SVMs, PCA, and K-means. Homework will involve mathematical and programming exercises, and exams will contain conceptual and mathematical problems. Programming in R will be used throughout the course to provide hands-on training and practical examples.

General information

Syllabus
Textbooks:

An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Lecture notes

1. Introduction (Course overview, Choosing among methods)
2. Probability and linear algebra basics
3. Measuring performance (K-nearest neighbors, MSE, Bias-variance, Classification error rate, Bayes optimal)

knn.r (R code for KNN regression, MSE, Bias-variance tradeoff)
knn-classifier.r (R code for KNN classifier, Error rate, Bayes optimal classifier)

4. Lab on KNN and measuring performance
5. Linear regression (Probabilistic model, Basis functions, Estimation, Uncertainty quantification)
6. Lab on Linear regression
7. Classification (Loss functions, Confusion matrix, ROC curve, Logistic regression, LDA/QDA)

classification.r (R code for classification topics)

9. Cross-validation (k-fold CV, Choosing model settings with CV, Choosing # of folds)

cv.r (R code for cross-validation topics)

11. Penalized regression (Subset selection, Model selection, Ridge, Lasso, Elastic net)
12. Lab on Penalized regression
13. Principal components analysis (Intuition, Covariance method, SVD method, Principal components regression)
14. Lab on PCA
(In progress)

Homework assignments

Homework 1 (Probability and linear algebra basics)
Homework 2 (Measuring performance, Bias-variance)
Homework 3 (Linear regression)
Homework 4 (Classification)
Homework 5 (Cross-validation)
Homework 6 (Penalized regression)