import warnings
from sklearn.exceptions import ConvergenceWarning
='ignore', category=ConvergenceWarning)
warnings.filterwarnings(action
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
Session: Model and hyperparameterselection
In this exercise set, you will be introduced to cross validation to perform model and hyperparameterselection, allowing us to tackle over and underfitting. The models used will be regularized linear models, where we will also look at how the two canonical models, the Ridge and Lasso, compare to eachother.
The structure of this notebook is as follows: 1. The holdout method 2. Cross validation and pipelines
Packages
First, we need to import our standard stuff. Notice that we are not interested in seeing the convergence warning in scikit-learn, so we suppress them for now.
Part 1: The holdout method
To evaluate out of sample performance, we utilize the holdout method. The holdout method entails splitting the data into two parts, one for training/development of your model, and one for testing your models. In this first part, we will look into the simplest holdout method, splitting just once into training and test sets, to get a feel for the method.
To do this, we will try to predict houseprices using a lot of covariates (or features as they are called in Machine Learning). We are going to work with Kaggle’s dataset on house prices, see information here. Kaggle is an organization that hosts competitions in building predictive models.
Ex. 1.1 Load the california housing data with scikit-learn using the code below. Now: 1. Inspect cal_house. How are the data stored? 2. Create a pandas DataFrame called X, using
data
. Name the columns usingfeature_names
. 3. Crate a pandas Series called y usingtarget
.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
= fetch_california_housing() cal_house
Ex. 1.2: Make a for loop with 10 iterations where you: 1. Split the input data into, train (also know as development) and test where the test sample should be one third. Set a new random state for each iteration of the loop, so each iteration makes a different split. 2. Further split the training (aka development) data into two even sized bins; the first data is for training models and the other is for validating them. Therefore these data sets are often called training and validation. 3. Train a linear regression model with sub-training data. Compute the RMSE for out-of-sample predictions for both the test data and the validation data. Save the RMSE.
You should now have a 10x2 DataFrame with 10 RMSE from both the test data set and the train data set. Compute descriptive statistics of RMSE for the out-of-sample predictions on test and validation data. Are they similar?
They hopefuly are pretty simular. This shows us, that we can split the train data, and use this to fit the model.Hint: DataFrames have a method called
describe
, which is handy for computing summary statistics
Having now (hopefully) convinced you that the holdout method works, we return to the full dataset again. We will now look closer at preprocessing and how we can achieve the best out of sample performance using the Lasso.
Ex. 1.3: Split the dataset into a train and test set of equal sizes
Hint: Try importing
train_test_split
fromsklearn.model_selection
Now we have split the data into train/development data and test data, and are ready to start preprocessing our data.
Ex. 1.4: Generate interactions between all features to third degree (make sure you exclude the bias/intercept term). How many variables are there? Will OLS fail? After making interactions, rescale the features to have zero mean, unit std. deviation. Should you use the distribution of the training data to rescale the test data?
Hint 1: Try importing
PolynomialFeatures
fromsklearn.preprocessing
Hint 2: If in doubt about which distribution to scale, you may read this post.
With the data preprocessed, we can now estimate our model, namely the Lasso.
Ex. 1.5: Estimate the Lasso model on the rescaled train data set, using values of \(\lambda\) in the range from \(10^{-4}\) to \(10^4\). For each \(\lambda\) calculate and save the Root Mean Squared Error (RMSE) for the rescaled test and train data. Take a look at the fitted coefficients for different sizes of \(\lambda\). What happens when \(\lambda\) increases? Why?
Hint 1: use
logspace
in numpy to create the range.
Hint 2: read about the
coef_
feature here.
(OPTIONAL) Ex. 1.6: Make a plot with \(\lambda\) on the x-axis and the RMSE measures on the y-axis. What happens to RMSE for train and test data as \(\lambda\) increases? The x-axis should be log scaled. Which one are we interested in minimizing?
Bonus: Can you find the lambda that gives the lowest MSE-test score?
Many different models exist, and trying out multiple and selecting the best is common procedure. Here we implement a Ridge model as well, which requires the same preprocessing.
Ex. 1.7: Repeat the two previous exercises, now estimating the Ridge model instead. Consider the following: 1) How does the fitted coefficients differ between the two models? 2) Which model performs better? 3) Are you happy with the specified hyperparameterspace?
As we saw in the geometric interpretation of the minimization objectives, the way the weights behave as a function of \(\lambda\).
(OPTIONAL) Ex. 1.8: Create two plots where you lineplot the individual weights as a function of lambda for each value of lambda, one for Lasso and one for Ridge. Does this confirm your earlier conclusions?
Part 2: Cross validation and pipelines
In machine learning, we have two types of parameters: those that are learned from the training data, for example, the weights in linear regression, and the parameters of a learning algorithm that are optimized separately. The latter are the tuning parameters, also called hyperparameters, of a model. These could for example be the regularization parameter in a regularized linear regression, but also the depth parameter of a decision tree, which we will look into later.
Below, we investigate how we can choose optimal hyperparameters using cross validation using pipelines.
In what follows, we will regard the “train” (aka. development, non-test) data for two purposes. - First we are interested in getting a credible measure of models under different hyperparameters to perform a model selection. - Then - with the selected model - we estimate/train it on all the training data.
A powerful tool for making and applying models are pipelines, which allows to combine different preprocessing and model procedures into one. This has many advantages, mainly being more safe but also has the added side effect being more code-efficient.
Ex. 2.1: Construct a model building pipeline which: 1. adds polynomial features of degree 3 without bias; 2. scales the features to mean zero and unit std.
Hint: a modelling pipeline can be constructed with
Pipeline
fromsklearn.pipeline
.
If we know what model we want to implement, we can also include it in our pipeline. Try it out!m
Ex. 2.2: Construct a model building pipeline which 1. adds polynomial features of degree 3 without bias; 2. scales the features to mean zero and unit std. 3. estimates a Lasso model
K fold cross validation
The simple validation procedure that we used above has one disadvantage: it only uses parts of the development data for validation. To avoid this issue, we can utilize K fold cross validation.
When we want to optimize over both normal parameters and hyperparameters, we do this using nested loops (two-layered cross validation). In the outer loop, we vary the hyperparameters, and then in the inner loop, we do cross validation for the model with the specific selection of hyperparameters. This way, we can find the model with the lowest mean MSE.
(OPTIONAL) Ex. 2.3: Run a Lasso regression using the Pipeline from
Ex 2.2
. In the outer loop, search through the lambdas specified below. In the inner loop, make 5 fold cross validation on the selected model and store the average MSE for each fold. Which lambda, from the selection below, gives the lowest test MSE?python lambdas = np.logspace(-4, 4, 10)
Hint:KFold
insklearn.model_selection
may be useful.
When you have more than one hyperparameter, you will want to fit the model to all the possible combinations of hyperparameters. This is done in an approch called Grid Search
, which is implementet in sklearn.model_selection
as GridSearchCV
.
However, this is also very useful when you only have one hyperparameter, as it removes a lot of the boilerplate code.
Ex. 2.4: To get to know
Grid Search
, we want to implement it in one dimension. UsingGridSearchCV
, implement the Lasso pipeline, with the same lambdas as before (lambdas = np.logspace(-4, 4, 10)
), 5-fold CV and (negative) mean squared error as the scoring variable. Which value of \(\lambda\) gives the lowest test error?
(OPTIONAL) Ex. 2.5 Now set
lambdas = np.logspace(-4, 4, 100)
, and repeat the previous exercise now with RandomizedSearchCV withn_iter=12
. What’s the difference between the two gridsearches?
We will now use the search functions with more than one hyperparameter, displaying their flexibility and power.
To do this, we need a model with more than one hyperparameter. The Elastic Net is one such example, which has two hyperparameters. The first hyperparametes determines how much to regularize, and the second determins how to weigh between Lasso and Ridge regularization.
(OPTIONAL) Ex. 2.6 Implement an Elastic Net using
RandomizedSearchCV
withn_iter=10
and the previous lambda values. > Hints: - Try usingnp.linspace
to create linearly spaced hyperparameters. - Try importingElasticNet
fromsklearn.linear_model
. - The documentation forElasticNet
has information on the hyperparameters and their exact names.
Tools for model selection
Below we review two useful tools for performing model selection. The first tool, the learning curve, can be used to assess whether there is over- and underfitting.
(OPTIONAL) Ex. 2.7 Learning curves
Create a learning curve using 5 fold cross validation and the \(\lambda\) found in exercise 2.4. What does it tell you about over- and underfitting?
Hint: Try importing
learning_curve
fromsklearn.model_selection
.
(OPTIONAL) Ex.2.8: Automated Cross Validation in one dimension
When you are doing cross validation with one hyperparameter, you can automate the process by usingvalidation_curve
fromsklearn.model_selection
and easily plot validation curves afterwards. Use this function to search through the values of lambdas, and find the value of lambda, which gives the lowest test error.
(OPTIONAL) Ex. 2.9: Plot the average MSE-test and MSE-train (validation curve) against the different values of lambda. Does this differ from the one in exercise 1.6? If yes, why?
Hints: - Use logarithmic axes, and lambda as index - Have you done the same sample splitting in this and exercise 1.6?