# Suppress convergencewarnings if they appear
import warnings
from sklearn.exceptions import ConvergenceWarning
='ignore', category=ConvergenceWarning)
warnings.filterwarnings(action
# Actual code to load
import pandas as pd
= pd.read_csv('adult_preprocessed.csv')
df df.describe()
Exercise Set 4: Supervised learning
In this exercise set, we will mainly be looking at different supervised learning algorithms, both tinkering around with them and seeing how the models perform for a given dataset. We will look at:
- Logistic regression
- Decision tree
- Ensemble methods
- Random Forest
- AdaBoost
- Neural network
If you in general need more information about models or how to tune their hyperparameters, try looking up the documentation or googling hyperparameter tuning + model_name
Throughout your career, you’ve probably worked with many problems. Some problems can easily be formulated as regression problem, whereas others are easily formulated as a classification problem.
Exercise 1.1
Name three different problems which you’ve worked with where the outcome of interest was:
Continuous (regression)
Categorical (classification)
Have you encountered problems where the outcome of interest could be both continuous and categorical? Would being able to predict these outcomes of interest be valuable?
For this session, I invite you to use a dataset of your own, as different models work best for different problems: - This can be either a regression problem or a classification problem. - Feel free to preprocess in another program and export it as a csv
file or another format of your choosing
The exercises are designed with a classification problem in mind, but all exercises except the ones about confusion matrices can be exchanged for regression problems by changin from a Classifier
to a Regressor
model.
The dataset I’ve decided upon is a dataset regarding classification of high income people, namely the Census Income Data Set from the UCI Machine Learning Repository. I’ve reduced the amount of features and sample size, as well as done a little bit of cleaning, from the full sample to reduce computation time. All the categorical features are one-hot encoded.
Exercise 1.2
What column in the DataFrame is the target of interest? Subset this as a
Series
calledy
, and the rest of the columns as aDataFrame
calledX
Hints:
y = df['column_name']
subsets a column as a Series.
X = df.drop(columns='column_name')
drops a column in a dataframe
Exercise 1.3
As a first step, you should split the data into a development and test set. Make a development and test split with 80% of the data in the development set. Name them
X_dev
,X_test
,y_dev
andy_test
Hints:
Try importing
train_test_split
fromsklearn.model_selection
Validation curves
Last week, you were introduced to validation curves. This is a way of getting an understanding how a single hyperparameter changes the performance of a model on both seen and unseen data. We will be using this tool throughout these exercises to probe the models and see how the hyperparameters change the performance of the model.
Below I’ve created a snippet of code, which you can copy and use to create the validation curves. This is essentially a function, but I’ve refrained from creating a function so you can easily change it around.
To use it, we need to define four things:
- A modelling pipeline, e.g. a
Pipeline
withPolynomialFeatures
,StandardScaling
thenLasso
- A scoring method, e.g.
neg_mean_squared_error
oraccuracy
, see this list for potential candidates - A hyperparameter range, e.g.
np.logspace(-4, 4, 10)
- The name of the modelling step and the hyperparameter name, e.g.
lasso__alpha
Also make sure that your development data is called X_dev
and y_dev
.
Note that you can change the scale (normal vs log) by changing logx
to False
from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
import pandas as pd
# Modelling pipeline we want to use
= # FILL IN
pipeline
# The measure we want to evaluate or model against
= # FILL IN
score_type
# A range of hyperparameter values we want to examine
= # FILL IN
param_range
# The name of the step in the pipeline and the name of the hyperparameter
= # FILL IN
param_name
# Calculate train and test scores using 5 fold cross validation
= \
train_scores, test_scores = pipeline,
validation_curve(estimator = X_dev,
X = y_dev,
y = param_name,
param_name = param_range,
param_range = 5)
cv
# Convert train and test scores into a DataFrame
= pd.DataFrame({'Train':train_scores.mean(axis=1),
score_df 'Validation':test_scores.mean(axis=1),
param_name:param_range})
# Plot the scores as a function of hyperparameter
= plt.subplots()
f, ax =True, ax=ax) score_df.set_index(param_name).plot(logx
Logistic Regression
Here I give an example with LogisticRegression, as this is the only model we are going to be examining today which only supports classification.
from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Additional imports
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Pipeline with StandardScaler and LogisticRegression (could add PolynomialFeatures etc.)
= Pipeline([
pipeline 'scaler', StandardScaler()),
('logit', LogisticRegression())
(
])
# I want to evaluate the hyperparameter with accuracy
= 'accuracy'
score_type
# Logarithmically spaced between 10^-4 and 10^4
= np.logspace(-2, 2, 20)
param_range
# Model step is called 'logit', hyperparameter is called 'C'
= 'logit__C' # Remember two underscores
param_name
# Calculate train and test scores using 5 fold cross validation
= \
train_scores, test_scores = pipeline,
validation_curve(estimator = X_dev,
X = y_dev,
y = score_type,
scoring = param_name,
param_name = param_range,
param_range = 5)
cv
# Convert train and test scores into a DataFrame
= pd.DataFrame({'Train':train_scores.mean(axis=1),
score_df 'Validation':test_scores.mean(axis=1),
param_name:param_range})
# Plot the scores as a function of hyperparameter
= plt.subplots()
f, ax =True, ax=ax)
score_df.set_index(param_name).plot(logx
ax.set_ylabel(score_type) plt.show()
As expected, we find that lower values of C
corresponds to higher regularization, which causes the model to underfit on both the training and test data. For higher values of C
the model starts to overfit, where we see a gap between the train and validation scores.
score_df
Exercise 1.3
Having now examined how the logistic regression, we want to see how it performs on the test data. Create a pipeline with the best hyperparameter found before, fit on the development data and calculate the accuracy on the test data.
Hints:
Try importing
accuracy_score
fromsklearn.metrics
best_param = score_df.iloc[score_df['Validation'].idxmax()][param_name]
gets the hyperparameter for the highest validation score
Exercise 1.4
Plot the confusion matrix using the pipeline from last exercise using the test set. Has the model learnt anything useful? > Hints: > > Try importing
ConfusionMatrixDisplay
fromsklearn.metrics
> > If this fails, there also exists a deprecated functionsklearn.metrics.plot_confusion_matrix
, which is available in previous versions ofsklearn
Exercise 1.5
If you’re using the dataset I gave you, you might have noticed that the class distribution is not completely equal, which can be seen both using summary statistics and in the confusion matrix. In this setting, a baseline model becomes even more important, as a model which guesses the majority class all the time might perform quite well if the data is imbalanced enough.
Create a pipeline with a baseline model that always guesses the majority class
Hints:
try importing
DummyClassifier
fromsklearn.dummy
Exercise 1.6 (OPTIONAL)
What would the confusion matrix for this dummy classifier look like? Try plotting it: Was your intuition correct?
Hints:
Try importing
ConfusionMatrixDisplay
fromsklearn.metrics
If this fails, there also exists a deprecated function
sklearn.metrics.plot_confusion_matrix
, which is available in previous versions ofsklearn
Decision Tree
Having now examined a logistic regression, baseline models and the confusion matrix, we turn to the more exotic models you were introduced to today, starting with the decision tree.
Exercise 2.1
What does the
max_depth
parameter in a decision tree do? Does the model overfit more or less if you increase this value? > Create a validation plot with values ofmax_depth
. Use the valuesnp.unique(np.logspace(0, 4, 10).astype(int))
which returns integers which are evenly spaced on a log scale. Why should they be converted to integers?Hints:
Try importing
DecisionTreeClassifier
orDecisionTreeRegressor
fromsklearn.tree
Exercise 2.2
What does the
min_samples_split
parameter in a decision tree do? Does the model overfit more or less if you increase this value?Create a validation plot with values of
min_samples_split
. Use the valuesnp.arange(0.05, 1.05, 0.05)
which returns fractions from 0.05 to 1, spaced 0.05 apart.What do these fractions mean?
Exercise 2.3
What does the
min_samples_leaf
parameter in a decision tree do? Does the model overfit more or less if you increase this value?Create a validation plot with values of
min_samples_split
. Use the valuesnp.arange(2, 50, 2)
Exercise 2.4
To find the best hyperparamter values, implement a randomized search (
RandomizedSearchCV
) using the previous hyperparameter ranges. Usen_iter = 25
. If your model takes too long to run, you can change this parameter – should you increase it or lower it to reduce running time? What are the best hyperparameters? > > Hints: > > Look at exercise 2.6 from exercise session 3 for inspiration
Exercise 2.5
Calculate the accuracy of your model with the best hyperparameters. Is it better than the baseline? > Hints: > > If you are using regression data, you can compare to a baseline with
DummyRegressor
fromsklearn.dummy
> > Feel free to plot the confusion matrix as well
Ensemble Model
As covered in the lectures, there exists two overarching ensemble methods, bagging and boosting.
- For bagging, we use bootstrap aggregation to train many models, averaging their predictions afterwards.
- For boosting, we sequentially train models, optimizing them to aid each other in the prediction task.
As examples of these two ensemble methods, we covered Random Forests, a bagging algorithm, and AdaBoost, a boosting algorithm, which we will cover in the next two sections.
Random Forest (Bagging)
Exercise 3.1
The Random Forest has all the same hyperparameters as the decision tree, but also a few new. For each point below, explain what the hype parameter pertaining to
sklearn.ensemble.RandomForestClassifier
controls, and how setting it either too low or too high (or True/False) might hurt model performance: 1.n_estimators
2.max_depth
3.max_features
4.bootstrap
Exercise 3.2
For
n_estimators > 1
, how should one set the hyperparametersmax_features
andbootstrap
so that all the trees in the ensemble end up identical?
Exercise 3.3
Create a validation plot with values of
n_estimators
. Use the valuesnp.unique(np.logspace(0, 3, 25).astype(int))
. How does it influence the train and validation scores?
Exercise 3.4
What does the
max_features
parameter in a Random Forest do? Does the model overfit more or less if you increase this value?Create a validation plot with values of
max_features
. Use the valuesnp.arange(0.1, 1.01, 0.1)
. Does it influence the train and validation scores?
Exercise 3.5 (OPTIONAL)
To find the best hyperparamter values, implement a randomized search (
RandomizedSearchCV
) using the previous hyperparameter ranges, including the decision tree section. Usen_iter = 10
. If your model takes too long to run, you can change this parameter – should you increase it or lower it to reduce running time? What are the best hyperparameters? How does the model perform on the test set? > > Hints: > > Look at exercise 2.6 from exercise session 3 for inspiration
AdaBoost (Boosting)
Exercise 4.1
What does the
n_estimators
parameter in a AdaBoost do? Does the model overfit more or less if you increase this value?Create a validation plot with values of
n_estimators
. Use the values[int(x) for x in np.linspace(start = 1, stop = 500, num = 10)]
> > Hints: > > Try importingAdaBoostClassifier
fromsklearn.ensemble
Exercise 4.2
As AdaBoost is a boosting algorithm, it is designed to use weak learners. What does this imply for the hyperparameter space you should search over?
Exercise 4.3 (OPTIONAL)
Iterate over the hyperparameter grid given below using
RandomizedSearchCV
withn_iter = 10
. Are there any new hyperparameters that you haven’t seen before? Consider whether you are getting any corner solutions? What does this imply for your hyperparameter search?Note how I specify hyperparameters in the decision tree using
__
twice, first to accessbase_estimator
and then the base estimators hyperparameters.
from sklearn.ensemble import AdaBoostClassifier
= Pipeline([
pipeline 'adaboost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))
(
])
= [{
param_grid'adaboost__n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 4)],
'adaboost__learning_rate': [0.01, 0.1, 0.5, 1],
'adaboost__base_estimator__max_depth': [1, 5, 9],
'adaboost__base_estimator__min_samples_split': [2, 5, 9],
'adaboost__base_estimator__min_samples_leaf': [1, 3, 5],
'adaboost__base_estimator__max_leaf_nodes': [2, 5, 9],
} ]
Gradient Boosting
As a small aside, there exists a subset of boosting models called Gradient Boosting models. These models are very powerful, and you should be aware that they exist. In essence, instead of changing weights of samples, they are trained to minimize the residual.
One example from sklearn
is GradientBoostingClassifier
, see documentation here and HistGradientBoostingClassifier
, see documentation here, which also have Regressor
counterparts.
The perhaps most popular is XGBoost
. It is not implemented in sklearn
, but it uses the same interface, so the process is exactly the same with fit
and predict
. See the documentation here. The source is Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
Other boosting algorithms are LightGBM
for efficient training and CatBoost
for many categorical features.
Neural network
A visual inspection of neural networks
Instead of diving into code, it’s more important that our intuition about what neural networks are doing is as good as possible. The best (and most fun) way to do that is to play around and with things a bit, so go familiarize yourself with the Tensorflow Playground, slide some knobs and pull some levers. The example in the lecture uses the same idea for demonstrating the intuition of neural networks.
Exercise 5.1
Using the dataset with the two point clouds, create the minimal neural network that separates the clusters. You can share your answer with a link (the URL on playground.tensorflow.org changes as you update the network, so at any time you can use the link to show others what you have created).
Exercise 5.2
Using the dataset with the two circular clusters, one inner and one outer. Create the minimal neural network that separates the clusters.
Exercise 5.3 (OPTIONAL)
See if you can create a network that performs well on the the dataset with the intertwined spirals. Can you do it with only \(x_1\) and \(x_2\)? > Hints: > > Try experimenting with depth of the network, regularization and possibly the activation function
Having now slid some knobs and pulled some levers to get some intuition for how the neural networks operate, we turn to the Multilayer Perceptron in sklearn
.
Exercise 5.4 (OPTIONAL)
Try to create a neural network which performs better than the best model on the test data. You may want to consider looking at different strengths of regularization (
alpha
, perhaps usingnp.logspace
) and different amounts of hidden layers and hidden neurons. At this point in time, a just semi-exhaustive search of hyperparameters becomes computationally infeasible, and machine learning turns to art.Note: It is not given that a neural network performs best for the given problem, and even if the model exists, it may be hard to find the right architecture. I have not succeeded.
Hints:
It may be time-consuming to do k fold cross validation. Splitting your development data into a train and validation set a single time is also a possibility. Only rule is that you don’t use the test data for model selection!