Magnus Nielsen, SODAS, UCPH
Parameters are the same as weights
Intercept is the same as bias
Covariates are the same as features
\[y = X w + \epsilon\] \[y = X \beta + \epsilon\]
Notation is a bit different, but it’s the same
Interested in minizing the expected loss, \(E_{(x, y)} [L(\hat f(x)), y]\), on unseen data
Sadly, we have no unseen data, so how do we find the best model?
We only train on some fraction of the data, leaving a part of the sample only for model evaluation!
We split the data into two parts:
The testing data must only be used ONCE at the end to evaluate the models.
The development dataset can be used as one sees fit
Source: Raschka & Mirjalili, 2019, ch. 6
Using K-fold cross-validation, we split the development data into K folds, train on \(K-1\) and validate on one, repeating \(K\) times
Source: Raschka & Mirjalili, 2019, ch. 6
Repeated K-fold is also popular, but other methods exist, most commonly due to:
More splits require more computation!
There exists no unbiased estimator of the variance of K-fold (Bengio & Grandvalet, 2003), so you should:
We will not cover this and use point estimates only
OLS is a minimization problem
\[\hat w = \underset{w}{\frac{1}{N}\text{argmin}} \left\{|| y - Xw||^2_2 \right\}\]
Where \(||\cdot||_2\) is the Euclidean norm
It also has a very familiar closed form solution
\[\hat w = (X'X)^{-1}(X'y)\]
In general, we let the people who implement the models care about solving the problems, and I won’t spend much time on it
We don’t control the bias-variance (over-underfitting) trade-off, as OLS is unbiased
Can only be done by changing the input complexity (polynomials, interactions, etc.)
We can add a term to the minimization problem which penalizes model complexity
\[\hat w = \underset{w}{\text{argmin}}\left\{\frac{1}{N} || y - Xw||^2_2 + \lambda f(w)\right\}, \lambda \geq 0\]
\(\lambda\) is a so-called hyperparameter, and the values of these are chosen by us
Regularization happens implicitly or explicitly in all machine learning, as this allows us to control the bias-variance trade-off
The penalty could be the sum of the absolute size of the weights, as in the Lasso introduced by Tisbhirani (1996) \[\hat w = \underset{w}{\text{argmin}} \left\{\frac{1}{N}|| y - Xw||_2^2 + \lambda ||w||_1\right\}, \lambda \geq 0\]
where \(||\cdot||_1\) is the L1 or Taxicab norm, corresponding to \(\sum_{i=1}^k |w_i|\)
Another penalty could be the sum of the squared weights, as in the Ridge introduced by Hoerl & Kennard (1970) \[\hat w = \underset{w}{\text{argmin}} \left\{\frac{1}{N}|| y - Xw||_2^2 + \lambda ||w||_2^2\right\}, \lambda \geq 0\]
where \(||\cdot||_2\) again is the L2 or Euclidean norm, corresponding to \(\sqrt{\sum_{i=1}^k w_i^2}\)
Source: Raschka & Mirjalili, 2019, ch. 4
As we penalize weights based on their size, the scale of each covariate matters!
Therefore we scale all inputs before it goes into the model
It is common to z-standardize (StandardScaler
), which corresponds to subtracting the mean and dividing with the standard deviation
How to encode dummies?
How to include an intercept?
We fit on our train data, transform all data
If we wanted to make polynomial interactions, it would be the same procedure!
If we want to make predictions, we use predict instead of transform
scaler = StandardScaler()
polfeats = PolynomialFeatures(degree=3, intercept=False)
regr = Lasso(alpha=lambda_value)
# Polynomial features
polfeats.fit(X_train)
X_train_pol = polfeats.transform(X_train)
X_test_pol = polfeats.transform(X_test)
# Scaling
scaler.fit(X_train_pol)
X_train_std = scaler.transform(X_train_pol)
X_test_std = scaler.transform(X_test_pol)
# Model
regr.fit(X_train_std)
predicted_y_train = regr.predict(X_train_std)
predicted_y_test = regr.predict(X_test_std)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=10000,
n_features=15,
n_informative=3,
noise=10,
n_targets=1,
bias=2,
random_state=1)
y = y + 6 * (X[:,0] * X[:,1]) - 5 * (X[:,2] * X[:,3] * X[:,4]) + 25 * (X[:,5] * X[:,6] * X[:,7])
from sklearn.model_selection import KFold, GridSearchCV, train_test_split, validation_curve, learning_curve
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error
# Use mean_squared_error, Lasso, KFold, train_test_split, PolynomialFeatures and StandardScaler in example
# Development and test data
X_dev, X_test, y_dev, y_test = train_test_split(X, y, train_size=0.75, random_state=73)
To reiterate what’s going to happen
for each lambda:
for each fold:
fit preprocess on training data
transform training data
transform validation data
fit model on transformed training data
predict on transformed validation data
get score on validation data
save score
save mean score for all folds
# Splits data into 5
kf = KFold(n_splits=5)
# Hyperparameterspace
lambdas = np.logspace(-4,3,10)
# Mean MSE for each lambda
mean_MSE_train = []
mean_MSE_val = []
# For each hyperparameter...
for lambda_ in lambdas:
MSE_val_list = []
MSE_train_list = []
# For each fold...
for train_index, test_index in kf.split(X_dev):
# SPLIT INTO TRAIN AND VALIDATION
X_train, X_val = X_dev[train_index], X_dev[test_index]
y_train, y_val = y_dev[train_index], y_dev[test_index]
# CREATE INSTANCES
pol_features = PolynomialFeatures(degree=3, include_bias=False)
scaler = StandardScaler()
regr = Lasso(alpha=lambda_, random_state=1)
### PREPROCESSING ###
# FIT
pol_features.fit(X_train)
# TRANSFORM
X_train_pol = pol_features.transform(X_train)
X_val_pol = pol_features.transform(X_val)
# FIT
scaler.fit(X_train_pol)
# TRANSFORM
X_train_std = scaler.transform(X_train_pol)
X_val_std = scaler.transform(X_val_pol)
### PREDICTION ###
# FIT
regr.fit(X_train_std, y_train)
# PREDICT
y_train_hat = regr.predict(X_train_std)
y_val_hat = regr.predict(X_val_std)
# SAVE STATISTICS
mse_train = mean_squared_error(y_train, y_train_hat)
mse_val = mean_squared_error(y_val, y_val_hat)
MSE_train_list.append(mse_train)
MSE_val_list.append(mse_val)
mean_MSE_val.append(np.mean(MSE_val_list))
mean_MSE_train.append(np.mean(MSE_train_list))
lambda_df = pd.DataFrame({'Lambda':lambdas, 'MSE Training':mean_MSE_train, 'MSE Validation':mean_MSE_val})
Lambda | MSE Training | MSE Validation | |
---|---|---|---|
0 | 0.000100 | 86.010738 | 121.000621 |
1 | 0.000599 | 86.011340 | 120.800925 |
2 | 0.003594 | 86.032232 | 119.655928 |
3 | 0.021544 | 86.648761 | 114.128397 |
4 | 0.129155 | 93.965888 | 102.462882 |
5 | 0.774264 | 102.925777 | 103.214830 |
6 | 4.641589 | 227.640247 | 228.388808 |
7 | 27.825594 | 2978.498180 | 2980.176269 |
8 | 166.810054 | 11912.563899 | 11917.745461 |
9 | 1000.000000 | 11912.563899 | 11917.745461 |
That was a lot of code for finding the best hyperparamter!
You asked for a recipe:
We can do better!
There was a whole lot of fit
, transform
, fit
, transform
in there…
We must be able to remove all this boilerplate!
This is exactly what the pipeline does:
fit
and transform
and (can) finish with a fit
and predict
step, i.e. a regression or classification modelWhy?
Source: Raschka & Mirjalili, 2019, ch. 6
# Splits data into 5
kf = KFold(n_splits=5)
# Mean MSE for each lambda
mean_MSE_train = []
mean_MSE_val = []
# Hyperparameterspace
lambdas = np.logspace(-4,3,10)
# For each hyperparameter...
for lambda_ in lambdas:
MSE_val_list = []
MSE_train_list = []
# For each fold...
for train_index, test_index in kf.split(X_dev):
# SPLIT INTO TRAIN AND VALIDATION
X_train, X_val = X_dev[train_index], X_dev[test_index]
y_train, y_val = y_dev[train_index], y_dev[test_index]
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('lasso', Lasso(alpha=lambda_, random_state=1))
]
)
# FIT
pipeline.fit(X_train, y_train)
# PREDICT
y_train_hat = pipeline.predict(X_train)
y_val_hat = pipeline.predict(X_val)
# SAVE STATISTICS
mse_train = mean_squared_error(y_train, y_train_hat)
mse_val = mean_squared_error(y_val, y_val_hat)
MSE_train_list.append(mse_train)
MSE_val_list.append(mse_val)
mean_MSE_val.append(np.mean(MSE_val_list))
mean_MSE_train.append(np.mean(MSE_train_list))
lambda_df_pipe = pd.DataFrame({'Lambda':lambdas, 'MSE Training':mean_MSE_train, 'MSE Validation':mean_MSE_val})
Lambda | MSE Training | MSE Validation | |
---|---|---|---|
0 | 0.000100 | 86.010738 | 121.000621 |
1 | 0.000599 | 86.011340 | 120.800925 |
2 | 0.003594 | 86.032232 | 119.655928 |
3 | 0.021544 | 86.648761 | 114.128397 |
4 | 0.129155 | 93.965888 | 102.462882 |
5 | 0.774264 | 102.925777 | 103.214830 |
6 | 4.641589 | 227.640247 | 228.388808 |
7 | 27.825594 | 2978.498180 | 2980.176269 |
8 | 166.810054 | 11912.563899 | 11917.745461 |
9 | 1000.000000 | 11912.563899 | 11917.745461 |
We found the best model, but haven’t tested on our test data yet!
# GET BEST LAMBDA
idx_min = lambda_df['MSE Validation'].idxmin()
best_lambda = lambda_df.iloc[idx_min, 0]
# MAKE PIPELINE WITH BEST LAMBDA
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('lasso', Lasso(alpha=best_lambda, random_state=1))
]
)
# FIT ON DEVELOPMENT
pipeline.fit(X_dev, y_dev)
# PREDICT ON TEST
y_test_hat = pipeline.predict(X_test)
print(f"MSE on test: {mean_squared_error(y_test, y_test_hat):.2f}")
MSE on test: 100.76
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('ols', LinearRegression())
]
)
# FIT ON DEVELOPMENT
pipeline.fit(X_dev, y_dev)
# PREDICT ON TEST
y_test_hat = pipeline.predict(X_test)
print(f"MSE on test: {mean_squared_error(y_test, y_test_hat):.2f}")
MSE on test: 112.55
Thank god!
There was a whole lot of for loops
in there, both iterating over folds and hyperparameters values
This seems repetetive and verbose
This is exactly what the Gridsearch does:
# Hyperparameterspace
lambdas = np.logspace(-4,3,10)
# Pipeline
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('lasso', Lasso(random_state=1)) # No lambda
]
)
# Gridsearch
gs = GridSearchCV(estimator=pipeline,
param_grid=[{'lasso__alpha':lambdas}],
scoring='neg_mean_squared_error',
cv=5,
n_jobs=-1)
# Fit
gs = gs.fit(X_dev, y_dev)
print(gs.best_params_)
# Predict on test
y_test_hat = gs.predict(X_test)
print(f"MSE on test: {mean_squared_error(y_test, y_test_hat):.2f}")
{'lasso__alpha': 0.1291549665014884}
MSE on test: 100.76
Learning and validation curves tells us how our model performs for either:
Not an important output per se, but very helpful while building models!
sklearn
has a short write-up with code and Raschka & Mirjalili write about them in chapter 6.
Source: Raschka & Mirjalili, 2019, ch. 6
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('lasso', Lasso(alpha=best_lambda, random_state=1)) # No lambda
]
)
n_obs, train_scores, test_scores = \
learning_curve(estimator=pipeline,
X=X_dev,
y=y_dev,
train_sizes=np.linspace(0.1,1,10),
scoring='neg_mean_squared_error',
cv=5)
mean_values = pd.concat({'train': pd.DataFrame(-train_scores).mean(1),
'test': pd.DataFrame(-test_scores).mean(1),
'n_obs': pd.DataFrame(n_obs).mean(1)}, axis =1)
pd.concat({'train': pd.DataFrame(-train_scores).mean(1),
'test': pd.DataFrame(-test_scores).mean(1)},
axis=1)\
.pipe(np.sqrt)\
.set_index(pd.Index(n_obs, name='# obs'))\
.plot(logy=True)
plt.show()
Source: Raschka & Mirjalili, 2019, ch. 6
pipeline = Pipeline(
[
('pol_feats', PolynomialFeatures(degree=3, include_bias=False)),
('scaler', StandardScaler()),
('lasso', Lasso(random_state=1)) # No lambda
]
)
train_scores, test_scores = \
validation_curve(estimator=pipeline,
X=X_dev,
y=y_dev,
param_name='lasso__alpha',
param_range=lambdas,
scoring='neg_mean_squared_error',
cv=5)
# plot curves
pd.concat({'train': pd.DataFrame(-train_scores).mean(1),
'test': pd.DataFrame(-test_scores).mean(1)},
axis=1)\
.pipe(np.sqrt)\
.set_index(pd.Index(lambdas, name='lambda'))\
.plot(logx=True, logy=True)
plt.show()
Bengio, Y., & Grandvalet, Y. (2003). No unbiased estimator of the variance of k-fold cross-validation. Advances in Neural Information Processing Systems, 16.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
Nadeau, C., & Bengio, Y. (1999). Inference for the generalization error. Advances in neural information processing systems, 12.
Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.