Session: Model and hyperparameterselection

In this exercise set, you will be introduced to cross validation to perform model and hyperparameterselection, allowing us to tackle over and underfitting. The models used will be regularized linear models, where we will also look at how the two canonical models, the Ridge and Lasso, compare to eachother.

The structure of this notebook is as follows: 1. The holdout method 2. Cross validation and pipelines

Packages

First, we need to import our standard stuff. Notice that we are not interested in seeing the convergence warning in scikit-learn, so we suppress them for now.

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

%matplotlib inline

Part 1: The holdout method

To evaluate out of sample performance, we utilize the holdout method. The holdout method entails splitting the data into two parts, one for training/development of your model, and one for testing your models. In this first part, we will look into the simplest holdout method, splitting just once into training and test sets, to get a feel for the method.

To do this, we will try to predict houseprices using a lot of covariates (or features as they are called in Machine Learning). We are going to work with Kaggle’s dataset on house prices, see information here. Kaggle is an organization that hosts competitions in building predictive models.

Ex. 1.1 Load the california housing data with scikit-learn using the code below. Now: 1. Inspect cal_house. How are the data stored? 2. Create a pandas DataFrame called X, using data. Name the columns using feature_names. 3. Crate a pandas Series called y using target.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

cal_house = fetch_california_housing()

Ex. 1.2: Make a for loop with 10 iterations where you: 1. Split the input data into, train (also know as development) and test where the test sample should be one third. Set a new random state for each iteration of the loop, so each iteration makes a different split. 2. Further split the training (aka development) data into two even sized bins; the first data is for training models and the other is for validating them. Therefore these data sets are often called training and validation. 3. Train a linear regression model with sub-training data. Compute the RMSE for out-of-sample predictions for both the test data and the validation data. Save the RMSE.

You should now have a 10x2 DataFrame with 10 RMSE from both the test data set and the train data set. Compute descriptive statistics of RMSE for the out-of-sample predictions on test and validation data. Are they similar?
They hopefuly are pretty simular. This shows us, that we can split the train data, and use this to fit the model.

Hint: DataFrames have a method called describe, which is handy for computing summary statistics

Having now (hopefully) convinced you that the holdout method works, we return to the full dataset again. We will now look closer at preprocessing and how we can achieve the best out of sample performance using the Lasso.

Ex. 1.3: Split the dataset into a train and test set of equal sizes

Hint: Try importing train_test_split from sklearn.model_selection

Now we have split the data into train/development data and test data, and are ready to start preprocessing our data.

Ex. 1.4: Generate interactions between all features to third degree (make sure you exclude the bias/intercept term). How many variables are there? Will OLS fail? After making interactions, rescale the features to have zero mean, unit std. deviation. Should you use the distribution of the training data to rescale the test data?

Hint 1: Try importing PolynomialFeatures from sklearn.preprocessing

Hint 2: If in doubt about which distribution to scale, you may read this post.

With the data preprocessed, we can now estimate our model, namely the Lasso.

Ex. 1.5: Estimate the Lasso model on the rescaled train data set, using values of \(\lambda\) in the range from \(10^{-4}\) to \(10^4\). For each \(\lambda\) calculate and save the Root Mean Squared Error (RMSE) for the rescaled test and train data. Take a look at the fitted coefficients for different sizes of \(\lambda\). What happens when \(\lambda\) increases? Why?

Hint 1: use logspace in numpy to create the range.

Hint 2: read about the coef_ feature here.

(OPTIONAL) Ex. 1.6: Make a plot with \(\lambda\) on the x-axis and the RMSE measures on the y-axis. What happens to RMSE for train and test data as \(\lambda\) increases? The x-axis should be log scaled. Which one are we interested in minimizing?

Bonus: Can you find the lambda that gives the lowest MSE-test score?

Many different models exist, and trying out multiple and selecting the best is common procedure. Here we implement a Ridge model as well, which requires the same preprocessing.

Ex. 1.7: Repeat the two previous exercises, now estimating the Ridge model instead. Consider the following: 1) How does the fitted coefficients differ between the two models? 2) Which model performs better? 3) Are you happy with the specified hyperparameterspace?

As we saw in the geometric interpretation of the minimization objectives, the way the weights behave as a function of \(\lambda\).

(OPTIONAL) Ex. 1.8: Create two plots where you lineplot the individual weights as a function of lambda for each value of lambda, one for Lasso and one for Ridge. Does this confirm your earlier conclusions?

Part 2: Cross validation and pipelines

In machine learning, we have two types of parameters: those that are learned from the training data, for example, the weights in linear regression, and the parameters of a learning algorithm that are optimized separately. The latter are the tuning parameters, also called hyperparameters, of a model. These could for example be the regularization parameter in a regularized linear regression, but also the depth parameter of a decision tree, which we will look into later.

Below, we investigate how we can choose optimal hyperparameters using cross validation using pipelines.

In what follows, we will regard the “train” (aka. development, non-test) data for two purposes. - First we are interested in getting a credible measure of models under different hyperparameters to perform a model selection. - Then - with the selected model - we estimate/train it on all the training data.

A powerful tool for making and applying models are pipelines, which allows to combine different preprocessing and model procedures into one. This has many advantages, mainly being more safe but also has the added side effect being more code-efficient.

Ex. 2.1: Construct a model building pipeline which: 1. adds polynomial features of degree 3 without bias; 2. scales the features to mean zero and unit std.

Hint: a modelling pipeline can be constructed with Pipeline from sklearn.pipeline.

If we know what model we want to implement, we can also include it in our pipeline. Try it out!m

Ex. 2.2: Construct a model building pipeline which 1. adds polynomial features of degree 3 without bias; 2. scales the features to mean zero and unit std. 3. estimates a Lasso model

K fold cross validation

The simple validation procedure that we used above has one disadvantage: it only uses parts of the development data for validation. To avoid this issue, we can utilize K fold cross validation.

When we want to optimize over both normal parameters and hyperparameters, we do this using nested loops (two-layered cross validation). In the outer loop, we vary the hyperparameters, and then in the inner loop, we do cross validation for the model with the specific selection of hyperparameters. This way, we can find the model with the lowest mean MSE.

(OPTIONAL) Ex. 2.3: Run a Lasso regression using the Pipeline from Ex 2.2. In the outer loop, search through the lambdas specified below. In the inner loop, make 5 fold cross validation on the selected model and store the average MSE for each fold. Which lambda, from the selection below, gives the lowest test MSE? python lambdas = np.logspace(-4, 4, 10) Hint: KFold in sklearn.model_selection may be useful.

When you have more than one hyperparameter, you will want to fit the model to all the possible combinations of hyperparameters. This is done in an approch called Grid Search, which is implementet in sklearn.model_selection as GridSearchCV.

However, this is also very useful when you only have one hyperparameter, as it removes a lot of the boilerplate code.

Ex. 2.4: To get to know Grid Search, we want to implement it in one dimension. Using GridSearchCV, implement the Lasso pipeline, with the same lambdas as before (lambdas = np.logspace(-4, 4, 10)), 5-fold CV and (negative) mean squared error as the scoring variable. Which value of \(\lambda\) gives the lowest test error?

(OPTIONAL) Ex. 2.5 Now set lambdas = np.logspace(-4, 4, 100), and repeat the previous exercise now with RandomizedSearchCV with n_iter=12. What’s the difference between the two gridsearches?

We will now use the search functions with more than one hyperparameter, displaying their flexibility and power.

To do this, we need a model with more than one hyperparameter. The Elastic Net is one such example, which has two hyperparameters. The first hyperparametes determines how much to regularize, and the second determins how to weigh between Lasso and Ridge regularization.

(OPTIONAL) Ex. 2.6 Implement an Elastic Net using RandomizedSearchCV with n_iter=10 and the previous lambda values. > Hints: - Try using np.linspace to create linearly spaced hyperparameters. - Try importing ElasticNet from sklearn.linear_model. - The documentation for ElasticNet has information on the hyperparameters and their exact names.

Tools for model selection

Below we review two useful tools for performing model selection. The first tool, the learning curve, can be used to assess whether there is over- and underfitting.

(OPTIONAL) Ex. 2.7 Learning curves

Create a learning curve using 5 fold cross validation and the \(\lambda\) found in exercise 2.4. What does it tell you about over- and underfitting?

Hint: Try importing learning_curve from sklearn.model_selection.

(OPTIONAL) Ex.2.8: Automated Cross Validation in one dimension
When you are doing cross validation with one hyperparameter, you can automate the process by using validation_curve from sklearn.model_selection and easily plot validation curves afterwards. Use this function to search through the values of lambdas, and find the value of lambda, which gives the lowest test error.

(OPTIONAL) Ex. 2.9: Plot the average MSE-test and MSE-train (validation curve) against the different values of lambda. Does this differ from the one in exercise 1.6? If yes, why?

Hints: - Use logarithmic axes, and lambda as index - Have you done the same sample splitting in this and exercise 1.6?