Magnus Nielsen, SODAS, UCPH
Has a given target and uses structured datasets
Focus on two things:
I will give a tips and tricks summary for each model
I won’t go much into the math
What’s the difference between regression & classification?
Target is a continuous value
Common in many situations:
You probably know a lot of them already:
Often changes how models weigh extreme values
Categorical outcomes, which can be both binary and multiclass
Common in many situations:
An intuitive starting point would be the accuracy of the classifier, i.e. what percentage of observations were correctly classified
\[Accuracy = \frac{True}{True + False}\]
We need to operationalize this mathematically
Source: Raschka & Mirjalili, 2022, ch. 6
Rewriting accuracy using the confusion matrix
\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
This is a fraction between 0 and 1
Could use precision if not classifying negative labels as positive is important
\[\frac{TP}{TP + FP}\]
Interpretation: ‘How many of the predicted positive labels are correct’
Could use recall if not missing positive is important
\[\frac{TP}{TP + FN}\]
Interpretation: ‘How many of the actual positive labels did we correctly classify’
sklearn
has a whole list of performance metrics you can easily use, where other common ones are the F1 score and the ROC AUC (AUROC)
The results can be very dependent on the metric chosen, especially if classes are imbalanced
Tips:
Linear combination with a logistic/sigmoid activation function to turn it into a probability
\[p(X) = \sigma(f(X, w)) = \frac{1}{1 + e^{-(Xw)}}\]
Regularized just like Lasso or Ridge models
Regularization implies we need to…
sklearn.linear_model.LogisticRegression
for classification
penalty
specifies type of regularization
l1
for absolute values, l2
for squared valuesC
specificies regularization strength
Split the data up into different groups one or more times, ending up with a flowchart!
Why?
Why not?
To make an algorithm, we need a measure of ‘goodness’ of fit
For classification we use separation of classes as a measure
For regression, we use metrics such as mean squared error or mean absolute error
sklearn.tree.DecisionTreeClassifier
for classification sklearn.tree.DecisionTreeRegressor
for regression
max_depth
specifies how deep the tree is
min_samples_split
specifices how many observations need to be in a node for it to allow a split
min_samples_leaf
specifies how many observations need to be in an end node/leaf for it to be allowed
sklearn
has a write-up on decision trees with more math, tips for practical use and more
In bagging, we do bootstrap aggregation
In boosting, we ‘boost’ the models sequentially
The random forest is a bagging method using many decision trees
The new thing is two sources of randomness:
You need one of these two sources of randomness to avoid growing identical trees
This reduces overfitting, as
sklearn.ensemble.RandomForestClassifier
for classification sklearn.ensemble.RandomForestRegressor
for regression
n_estimators
controls how many trees are grown
max_features
controls how many features are considered at each split
Random forests are random, so remember a seed!
sklearn
has a short write-up on random forests
Use ‘weak learners’ (often decision trees of relatively shallow depth) which are built to correct each others mistakes
The AdaBoost (adaptive boosting) model does this by weighing the misclassified observations (high error in regression) higher in the next weak learning phase
sklearn.ensemble.AdaBoostClassifier
for classification, sklearn.ensemble.AdaBoostRegressor
for regression
n_estimators
controls how many trees are grown
learning_rate
controls how much weight is given to misclassified observation
sklearn
has a short write-up on AdaBoost
Neural networks can be extremely complicated
They can be both supervised and unsupervised
As an introduction, we focus on feed forward neural networks
Different models are created for and excel in different areas
Following illustrations are from:
In essence, feed forward neural networks are nested logistic regressions (roughly)
Linear combination with an activation function (sigmoid/logistic function, \(\sigma\)) \[\sigma(f(X, w)) = \frac{1}{1 + e^{-(Xw)}}\]
Not good at non-linear relationships
Would require input transformations
What if we limited ourselves to inputting just \(X_1\) and \(X_2\)?
How many logistic regressions do you think it would take to approximately separate the two groups?
A fully connected feed forward neural network is called a Multilayer Perceptron (MLP)
By omitting the last logistic activation, the model is turned into a regression model
The sklearn
implementation is regularized (L2), so we should…
sklearn.neural_network.MLPClassifier
for classification sklearn.neural_network.MLPRegressor
for regression
alpha
controls the amount of regularization
hidden_layer_sizes
is a tuple which controls amount of hidden layers and hidden neurons
sklearn
has a write-up on MLP with more math, tips for practical use and more
Raschka, S., Liu, Y. H., Mirjalili, V., & Dzhulgakov, D. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.