Exercise Set 4: Supervised learning

In this exercise set, we will mainly be looking at different supervised learning algorithms, both tinkering around with them and seeing how the models perform for a given dataset. We will look at:

Logistic regression
Decision tree
Ensemble methods
- Random Forest
- AdaBoost
Neural network

If you in general need more information about models or how to tune their hyperparameters, try looking up the documentation or googling hyperparameter tuning + model_name

Throughout your career, you’ve probably worked with many problems. Some problems can easily be formulated as regression problem, whereas others are easily formulated as a classification problem.

Exercise 1.1

Name three different problems which you’ve worked with where the outcome of interest was:

Continuous (regression)

Categorical (classification)

Have you encountered problems where the outcome of interest could be both continuous and categorical? Would being able to predict these outcomes of interest be valuable?

For this session, I invite you to use a dataset of your own, as different models work best for different problems: - This can be either a regression problem or a classification problem. - Feel free to preprocess in another program and export it as a csv file or another format of your choosing

The exercises are designed with a classification problem in mind, but all exercises except the ones about confusion matrices can be exchanged for regression problems by changin from a Classifier to a Regressor model.

The dataset I’ve decided upon is a dataset regarding classification of high income people, namely the Census Income Data Set from the UCI Machine Learning Repository. I’ve reduced the amount of features and sample size, as well as done a little bit of cleaning, from the full sample to reduce computation time. All the categorical features are one-hot encoded.

# Suppress convergencewarnings if they appear
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

# Actual code to load
import pandas as pd

df = pd.read_csv('adult_preprocessed.csv')
df.describe()

Exercise 1.2

What column in the DataFrame is the target of interest? Subset this as a Series called y, and the rest of the columns as a DataFrame called X

Hints:

y = df['column_name'] subsets a column as a Series.

X = df.drop(columns='column_name') drops a column in a dataframe

Exercise 1.3

As a first step, you should split the data into a development and test set. Make a development and test split with 80% of the data in the development set. Name them X_dev, X_test, y_dev and y_test

Hints:

Try importing train_test_split from sklearn.model_selection

Validation curves

Last week, you were introduced to validation curves. This is a way of getting an understanding how a single hyperparameter changes the performance of a model on both seen and unseen data. We will be using this tool throughout these exercises to probe the models and see how the hyperparameters change the performance of the model.

Below I’ve created a snippet of code, which you can copy and use to create the validation curves. This is essentially a function, but I’ve refrained from creating a function so you can easily change it around.

To use it, we need to define four things:

A modelling pipeline, e.g. a Pipeline with PolynomialFeatures, StandardScaling then Lasso
A scoring method, e.g. neg_mean_squared_error or accuracy, see this list for potential candidates
A hyperparameter range, e.g. np.logspace(-4, 4, 10)
The name of the modelling step and the hyperparameter name, e.g. lasso__alpha

Also make sure that your development data is called X_dev and y_dev.

Note that you can change the scale (normal vs log) by changing logx to False

from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
import pandas as pd

# Modelling pipeline we want to use
pipeline = # FILL IN

# The measure we want to evaluate or model against
score_type = # FILL IN

# A range of hyperparameter values we want to examine
param_range = # FILL IN

# The name of the step in the pipeline and the name of the hyperparameter
param_name = # FILL IN

# Calculate train and test scores using 5 fold cross validation
train_scores, test_scores = \
    validation_curve(estimator = pipeline,
                     X = X_dev,
                     y = y_dev,
                     param_name = param_name,
                     param_range = param_range,       
                     cv = 5)

# Convert train and test scores into a DataFrame
score_df = pd.DataFrame({'Train':train_scores.mean(axis=1),
                          'Validation':test_scores.mean(axis=1),
                          param_name:param_range})

# Plot the scores as a function of hyperparameter
f, ax = plt.subplots()
score_df.set_index(param_name).plot(logx=True, ax=ax)

Logistic Regression

Here I give an example with LogisticRegression, as this is the only model we are going to be examining today which only supports classification.

from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Additional imports
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Pipeline with StandardScaler and LogisticRegression (could add PolynomialFeatures etc.)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logit', LogisticRegression())
])

# I want to evaluate the hyperparameter with accuracy
score_type = 'accuracy'

# Logarithmically spaced between 10^-4 and 10^4
param_range = np.logspace(-2, 2, 20)

# Model step is called 'logit', hyperparameter is called 'C'
param_name = 'logit__C' # Remember two underscores

# Calculate train and test scores using 5 fold cross validation
train_scores, test_scores = \
    validation_curve(estimator = pipeline,
                     X = X_dev,
                     y = y_dev,
                     scoring = score_type,
                     param_name = param_name,
                     param_range = param_range,       
                     cv = 5)

# Convert train and test scores into a DataFrame
score_df = pd.DataFrame({'Train':train_scores.mean(axis=1),
                          'Validation':test_scores.mean(axis=1),
                          param_name:param_range})
                

# Plot the scores as a function of hyperparameter
f, ax = plt.subplots()
score_df.set_index(param_name).plot(logx=True, ax=ax)
ax.set_ylabel(score_type)
plt.show()

As expected, we find that lower values of C corresponds to higher regularization, which causes the model to underfit on both the training and test data. For higher values of C the model starts to overfit, where we see a gap between the train and validation scores.

score_df

Exercise 1.3

Having now examined how the logistic regression, we want to see how it performs on the test data. Create a pipeline with the best hyperparameter found before, fit on the development data and calculate the accuracy on the test data.

Hints:

Try importing accuracy_score from sklearn.metrics

best_param = score_df.iloc[score_df['Validation'].idxmax()][param_name] gets the hyperparameter for the highest validation score

Exercise 1.4

Plot the confusion matrix using the pipeline from last exercise using the test set. Has the model learnt anything useful? > Hints: > > Try importing ConfusionMatrixDisplay from sklearn.metrics > > If this fails, there also exists a deprecated function sklearn.metrics.plot_confusion_matrix, which is available in previous versions of sklearn

Exercise 1.5

If you’re using the dataset I gave you, you might have noticed that the class distribution is not completely equal, which can be seen both using summary statistics and in the confusion matrix. In this setting, a baseline model becomes even more important, as a model which guesses the majority class all the time might perform quite well if the data is imbalanced enough.

Create a pipeline with a baseline model that always guesses the majority class

Hints:

try importing DummyClassifier from sklearn.dummy

Exercise 1.6 (OPTIONAL)

What would the confusion matrix for this dummy classifier look like? Try plotting it: Was your intuition correct?

Hints:

Try importing ConfusionMatrixDisplay from sklearn.metrics

If this fails, there also exists a deprecated function sklearn.metrics.plot_confusion_matrix, which is available in previous versions of sklearn

Decision Tree

Having now examined a logistic regression, baseline models and the confusion matrix, we turn to the more exotic models you were introduced to today, starting with the decision tree.

Exercise 2.1

What does the max_depth parameter in a decision tree do? Does the model overfit more or less if you increase this value? > Create a validation plot with values of max_depth. Use the values np.unique(np.logspace(0, 4, 10).astype(int)) which returns integers which are evenly spaced on a log scale. Why should they be converted to integers?

Hints:

Try importing DecisionTreeClassifier or DecisionTreeRegressor from sklearn.tree

Exercise 2.2

What does the min_samples_split parameter in a decision tree do? Does the model overfit more or less if you increase this value?

Create a validation plot with values of min_samples_split. Use the values np.arange(0.05, 1.05, 0.05) which returns fractions from 0.05 to 1, spaced 0.05 apart.

What do these fractions mean?

Exercise 2.3

What does the min_samples_leaf parameter in a decision tree do? Does the model overfit more or less if you increase this value?

Create a validation plot with values of min_samples_split. Use the values np.arange(2, 50, 2)

Exercise 2.4

To find the best hyperparamter values, implement a randomized search (RandomizedSearchCV) using the previous hyperparameter ranges. Use n_iter = 25. If your model takes too long to run, you can change this parameter – should you increase it or lower it to reduce running time? What are the best hyperparameters? > > Hints: > > Look at exercise 2.6 from exercise session 3 for inspiration

Exercise 2.5

Calculate the accuracy of your model with the best hyperparameters. Is it better than the baseline? > Hints: > > If you are using regression data, you can compare to a baseline with DummyRegressor from sklearn.dummy > > Feel free to plot the confusion matrix as well

Ensemble Model

As covered in the lectures, there exists two overarching ensemble methods, bagging and boosting.

For bagging, we use bootstrap aggregation to train many models, averaging their predictions afterwards.
For boosting, we sequentially train models, optimizing them to aid each other in the prediction task.

As examples of these two ensemble methods, we covered Random Forests, a bagging algorithm, and AdaBoost, a boosting algorithm, which we will cover in the next two sections.

Random Forest (Bagging)

Exercise 3.1

The Random Forest has all the same hyperparameters as the decision tree, but also a few new. For each point below, explain what the hype parameter pertaining to sklearn.ensemble.RandomForestClassifier controls, and how setting it either too low or too high (or True/False) might hurt model performance: 1. n_estimators 2. max_depth 3. max_features 4. bootstrap

Exercise 3.2

For n_estimators > 1, how should one set the hyperparameters max_features and bootstrap so that all the trees in the ensemble end up identical?

Exercise 3.3

Create a validation plot with values of n_estimators. Use the values np.unique(np.logspace(0, 3, 25).astype(int)). How does it influence the train and validation scores?

Exercise 3.4

What does the max_features parameter in a Random Forest do? Does the model overfit more or less if you increase this value?

Create a validation plot with values of max_features. Use the values np.arange(0.1, 1.01, 0.1). Does it influence the train and validation scores?

Exercise 3.5 (OPTIONAL)

To find the best hyperparamter values, implement a randomized search (RandomizedSearchCV) using the previous hyperparameter ranges, including the decision tree section. Use n_iter = 10. If your model takes too long to run, you can change this parameter – should you increase it or lower it to reduce running time? What are the best hyperparameters? How does the model perform on the test set? > > Hints: > > Look at exercise 2.6 from exercise session 3 for inspiration

AdaBoost (Boosting)

Exercise 4.1

What does the n_estimators parameter in a AdaBoost do? Does the model overfit more or less if you increase this value?

Create a validation plot with values of n_estimators. Use the values [int(x) for x in np.linspace(start = 1, stop = 500, num = 10)] > > Hints: > > Try importing AdaBoostClassifier from sklearn.ensemble

Exercise 4.2

As AdaBoost is a boosting algorithm, it is designed to use weak learners. What does this imply for the hyperparameter space you should search over?

Exercise 4.3 (OPTIONAL)

Iterate over the hyperparameter grid given below using RandomizedSearchCV with n_iter = 10. Are there any new hyperparameters that you haven’t seen before? Consider whether you are getting any corner solutions? What does this imply for your hyperparameter search?

Note how I specify hyperparameters in the decision tree using __ twice, first to access base_estimator and then the base estimators hyperparameters.

from sklearn.ensemble import AdaBoostClassifier

pipeline = Pipeline([
    ('adaboost', AdaBoostClassifier(base_estimator=DecisionTreeClassifier()))
])


param_grid= [{
    'adaboost__n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 4)],
    'adaboost__learning_rate': [0.01, 0.1, 0.5, 1],
    'adaboost__base_estimator__max_depth': [1, 5, 9],
    'adaboost__base_estimator__min_samples_split': [2, 5, 9],
    'adaboost__base_estimator__min_samples_leaf': [1, 3, 5],
    'adaboost__base_estimator__max_leaf_nodes': [2, 5, 9],
    } ]

Gradient Boosting

As a small aside, there exists a subset of boosting models called Gradient Boosting models. These models are very powerful, and you should be aware that they exist. In essence, instead of changing weights of samples, they are trained to minimize the residual.

One example from sklearn is GradientBoostingClassifier, see documentation here and HistGradientBoostingClassifier, see documentation here, which also have Regressor counterparts.

The perhaps most popular is XGBoost. It is not implemented in sklearn, but it uses the same interface, so the process is exactly the same with fit and predict. See the documentation here. The source is Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

Other boosting algorithms are LightGBM for efficient training and CatBoost for many categorical features.

Neural network

A visual inspection of neural networks

Instead of diving into code, it’s more important that our intuition about what neural networks are doing is as good as possible. The best (and most fun) way to do that is to play around and with things a bit, so go familiarize yourself with the Tensorflow Playground, slide some knobs and pull some levers. The example in the lecture uses the same idea for demonstrating the intuition of neural networks.

Exercise 5.1

Using the dataset with the two point clouds, create the minimal neural network that separates the clusters. You can share your answer with a link (the URL on playground.tensorflow.org changes as you update the network, so at any time you can use the link to show others what you have created).

Exercise 5.2

Using the dataset with the two circular clusters, one inner and one outer. Create the minimal neural network that separates the clusters.

Exercise 5.3 (OPTIONAL)

See if you can create a network that performs well on the the dataset with the intertwined spirals. Can you do it with only \(x_1\) and \(x_2\)? > Hints: > > Try experimenting with depth of the network, regularization and possibly the activation function

Having now slid some knobs and pulled some levers to get some intuition for how the neural networks operate, we turn to the Multilayer Perceptron in sklearn.

Exercise 5.4 (OPTIONAL)

Try to create a neural network which performs better than the best model on the test data. You may want to consider looking at different strengths of regularization (alpha, perhaps using np.logspace) and different amounts of hidden layers and hidden neurons. At this point in time, a just semi-exhaustive search of hyperparameters becomes computationally infeasible, and machine learning turns to art.

Note: It is not given that a neural network performs best for the given problem, and even if the model exists, it may be hard to find the right architecture. I have not succeeded.

Hints:

It may be time-consuming to do k fold cross validation. Splitting your development data into a train and validation set a single time is also a possibility. Only rule is that you don’t use the test data for model selection!