Unsupervised learning

Magnus Nielsen, SODAS, UCPH

Agenda

Survey
Unsupervised learning
- Dimensionality reduction
- Clustering
Text as data
- Dictionary based methods
- Bag-of-words

Survey

Heterogeneous student body

Both across and within Aarhus and Copenhagen
Primary job responsibilities are of course main priority

You want more lecturing, less assistance for coding

Solution…?

Less focus on Python in lectures
More code given in exercises

Unsupervised learning

We do not have/use a given target

Different models ‘create’ their own target and structure

Structure not always necessary

Depends more on domain rather than supervised or not
Utilize data sources such as text and images in novel ways

Dimensionality reduction

Question

Have you worked with dimensionality reduction before?

What methods did you use?

Why?

Reducing dimensionality of the data is done primarily for three reasons:

It reduces computation time for later models
It can increase performance (the curse of dimensionality)
It makes visualizations easier

PCA

The most common method for dimensionality reduction is (probably) Principal Component Analysis

Main idea is to create new variables:

Which are orthogonal
Each capture as much variance as possible

Creates linear ‘latent variables’

The method ‘intuitively’

Find the direction of max variance in data
- This is the first component
Subtract the first component
Find the direction of max variance in data
- This is the second component
Subtract, and create additional components until no more variance in data

Earlier components will explain more variance!

An illustration

Principal components in 2D

_{^{Source: Raschka & Mirjalili, 2022, ch. 6}}

The method ‘mathematically’

Standardize the features
Calculate the correlation matrix
Perform an eigenvector decomposition of the correlation matrix
Order eigenvectors by the size of the eigenvalues
Transform original data into \(K\)-dimensional space using \(K\) first eigenvectors

\(K\) is a hyperparameter that we choose - More on how to choose later

The method ‘sklearnically’

Supports fit and transform in PCA from sklearn.decomposition

Can be used in pipelines just like the other methods
Centers, but does not scale
- Generally preceded by StandardScaler

n_components is the parameter of interest, and can be set in two ways:

A positive integer, corresponding to \(K\)
- i.e. ‘I would like \(K=5\) components’
A float between 0 and 1, corresponding to the amount of variance that is retained
- i.e. ‘I would like enough components to keep \(K=95%\) of the variance’

How to evaluate

Downstream performance
- Treat K as any other hyperparameter
Computational limits
- Use as many dimensions as feasible
Explained variance plots
- Stop when adding more components doesn’t add much information

Explained variance as a guide

Each principal components explains part of the variance in the original data

If some variables covary, a single principal component can explain much of the variance in the data

Most often, a relatively low amount of principal components explain a lot of the variance

We can look for an ‘elbow’ in a plot of the explained variance to guide us

Introducing the scree plot

Scree plot

_{^{Source: Raschka & Mirjalili, 2022, ch. 5}}

Interpretation

As PCA is a linear transformation, we can look at how the variables are transformed

Available in pca.components_

A way of interpreting the new variables

Remember that variables are standardized

Not done that often

Everything loads on everything due to no restrictions

Non-linearity

If we want to incorporate non-linearity, one of two things are commonly done:

Do a polynomial tranformation of the input before doing PCA
Use a kernel (kernelPCA)

Clustering

Question

Have you worked with clustering before?

What methods did you use?

Clustering

Find unknown groups within populations

Observations within clusters are alike
Observations between clusters are different

What constitutes different and alike?

A broad introduction

Different ways of formulating this, with three broad categories:

Hierarchical clustering
Prototype-based clustering
Density-based clustering

Some methods fall into more than one category and there are other clustering methods (e.g. graph-based)

Hierarchical clustering

Clusters are based sequential merging or divison of clusters

Agglomerative, start with n clusters and sequentially join clusters
Divisive, start with one cluster and sequentially divide into n clusters
Main hyperparameters: Distance measure (no. of clusters)
Creates neat dendrograms
No notion of noise, and all points will belong to a cluster
An example: AgglomerativeClustering

Prototype-based clustering

A cluster is represented by a prototype (centroid)

Main hyperparameter: Number of clusters
Clusters will generally be convex (circular or ellipsoidal)
No notion of noise, and all points will belong to a cluster
An example: Kmeans

Density-based clustering

A cluster is represented by a dense area of points

Main hyperparameters: What constitutes ‘dense’?
Clusters can be any shape
Distant points are labeled as outliers
An example: DBSCAN

K-means

K-means aims to minimize the within-cluster sum-of-squares:

\[\sum_{i=0}^n \min_{\mu_j \in C}(||x_i - \mu_j||^2)\]

Where \(\mu_j\) are the centroids

This corresponds to an Euclidian distance, and thus we (often) standardscale

Iterative updating

This is done by:

Initialize clusters
Assigning points to nearest cluster
Move centroids to mean values of the cluster samples

Repeating procedure until convergence

Illustrations are nice

The easiest way to get to know K-means is to look at the process

The following figures are from a cool tool made by brookslybrand

Note that this is the best case scenario:

Three clusters of data and three centroids
Convex clusters

Initialize centroids

Centroids are randomly placed

Assign points

All points are assigned to the nearest centroid

Update centroids

Centroids are updated to the mean of assigned points

Assign points

All points are assigned to the nearest centroid after update

Update centroids

Centroids are updated to the mean of assigned points

Assign points

All points are assigned to the nearest centroid after update

Update centroids

At this point, we’ve converged

Assumptions

Violated assumptions

_{^{Source: sklearn examples, ‘Demonstration of k-means assumptions’}}

There are other methods

Methods implemented in sklearn

_{^{Source: sklearn User Guide, ‘Overview of clustering methods’}}

Evaluation

It is generally not easy to evaluate clustering methods, but methods exist

These can be used to evaluate hyperparameter choice or method choice

Visualize it
Domain knowledge
Elbow plots
Silhouette coefficient

If ground truth is known, there are more methods

Choosing the right amount of cluster

Models optimize some sort of metric

For K-means: Sum of squared distances

Generally, more clusters cause a better fit

When this additional increase in fit becomes small, we have ‘enough’ clusters

Back to elbow plots!

If different models optimize different metrics, this cannot be used across models

Example data

Clustered data

_{^{Source: Raschka & Mirjalili, 2022, ch. 11}}

K-means elbow plot

Sum of squared distances to nearest centroid as a function of cluster amount

_{^{Source: Raschka & Mirjalili, 2022, ch. 11}}

Silhoutte scores

Based on the mean intra-cluster distance, \(a_i\) (cohesion), and the mean nearest-cluster distance, \(b_i\) (separation), for an observation \(i\):

\[s_i = \frac{(b_i - a_i)}{max(a_i, b_i)}\]

Bounded between -1 and 1,

Higher values are preferred
Values of 0 indicate overlapping clusters
- Cohesion equals separation

These can be plotted!

A good number

_{^{Source: Raschka & Mirjalili, 2022, ch. 6}}

A not so good number

_{^{Source: Raschka & Mirjalili, 2022, ch. 6}}

Convexity

Silhouette scores based on the idea that we want samples to be

Near their cluster companions
Far away from the nearest neighbor clusters

Works across models, but favors models that create convex clusters

Text as data

Question

Have any of you used text as data in an analysis?

Was it done using qualitative methods or quantitative?

Aim today

Natural Language Processing (NLP) is a huge field

Aim today is to get you thinking about text as data

How do I start working with text
What do people do

Some terminology

A word is the basic unit of discrete data

E.g. a word, but can also be parts of words

A document is a collection of words

E.g. a sentence, paragraph or full documents

A corpus is a collection of documents

I.e. a dataset

How to use text as data?

There are many different ways of using text as data

There is no ‘correct’ way
The way in which you use text should be guided by what you’re interested in

The most basic method way to use text is to read the text

Does not scale very well, to say the least…

Today we will focus on dictionary methods and bag-of-word models

Dictionary based methods

Words to values

The key in this method is a dictionary which includes:

Words of interest
Scores associated with these words

Using these dictionaries, we can

Go through sentences and give each word a score
Summarize these scores

How to get these dictionaries?

Dictionaries are generally either:

General
- Often a scale of an emotion (sentiment analysis)
- More throughougly tested and verified
Handcrafted for the purpose
- Subject specific and often very niche
- No to little testing

Sentiment analysis

One well known dictionary is VADER (Hutto & Gilbert, 2014), Valence Aware Dictionary and sEntiment Reasoner

Encodes sentiment as both polarity and intensity

Verified through experiments and benchmarks

A benefit over handcrafted dictionaries

Scores every sentence according to how positive or negative they are (-1 to 1), and the proportion of negative, neutral and positive words

Examples

Text	Pos	Neu	Neg	Comp
VADER is smart, handsome, and funny.	0.746	0.254	0.0	0.8316
VADER is smart, handsome, and funny!	0.752	0.248	0.0	0.8439
VADER is very smart, handsome, and funny.	0.701	0.299	0.0	0.8545
VADER is VERY SMART, handsome, and FUNNY.	0.754	0.246	0.0	0.9227
VADER is VERY SMART, handsome, and FUNNY!!!	0.767	0.233	0.0	0.9342
VADER is not smart, handsome, nor funny.	0.0	0.354	0.646	-0.7424
Make sure you :) or :D today!	0.706	0.294	0.0	0.8633
Catch utf-8 emoji such as 💘 and 💋 and 😁	0.279	0.721	0.0	0.7003

One small problem

It’s not Danish

Here we can use AFINN instead

Less nuanced
Multiple languages
Positive and negative words on a scale from -5 to 5
Mean calculated at the sentence level

The difference

´VADER is not smart, handsome, nor funny.´ classified as positive

Picks up on smart, handsome and funny
Does not capture the negation

You can play around with it at this website, both in Danish and English

Use case: Racial animus in politics

Stephens-Davidowitz (2014) examines whether prejudice against blacks remains a potent factor within American politics (at the time)

People sometimes lie on surveys

Could be a problem when asked about racial animus

A hancrafted dictionary

To examine this, they use Google searches up to the 2008 election in the US split by media market and examine the ‘racially charged search rate’, which is a fraction of the type

\[\frac{\text{Google searches including some word(s)}}{\text{All Google searches}}\]

Essentially counting

Question

What word(s) would you include in this dictionary?

Simplicity at its best

Stephens-Davidowitz (2014) uses just a single derogatory term

Can be found in the appendix for the curious

Using this measure, they find that it is a robust negative predictor of Obama’s performance

Racial animus causes about a 4%-point loss in the elections in 2008 and 2012

Downsides of dictionary based methods

Requires a dictionary

Getting an apropriate one can be hard

Sometimes too simple

Hard time with subtleties, i.e. difference between I hate this and I do not hate this

Bag-of-words

Question

How could one represent sentences numerically?

Counting words systematically

We can treat words as unique tokens

Count how often each word appears in a sentence
Much like categorical variables

High-dimensional and sparse

Good that we know both how to regularize and reduce dimensionality!

An example

Sentence	dogs	like	i	bananas
Dogs like dogs	2	1	0	0
I like dogs	1	1	1	0
I like bananas	0	1	1	1

This allows us to go from sentences to a tabular format

Use as covariates
All the machine learning methods
- Prediction, clustering, dimensionality reduction
All the causal inference methods

A variant: tf-idf

A common variant is a term frequency times inverse document (tf-idf) matrix

Here we normalize by how often the word appears in documents
Intuition is that words which appear in only a few documents are more important for those documents than those who appear in many documents

A variant: n-grams

N-gram models are also another common variant

Extend the simple unigram bag-of-words to consecutive words
- e.g. Bigram (\(n=2\)) models would also include all pairs of words in order
- 'Dogs like dogs' would become {'dogs':2, 'like':1, 'dogs like':1, 'like dogs':1}

Common preprocessing

You should consider whether you want to:

Lowercase everything
Remove stopwords
- See e.g. nltk.corpus.stopwords.words('danish') or spacy.lang.da.stop_words.STOP_WORDS
Remove punctuation
Either stem or lemmatize words
- Stemming removes endings based on some fixed rules, see e.g. nltk.stem.snowball.DanishStemmer
- Lemmatization aims to obtain the grammatically correct form, see e.g. lemmy
- Reduces dimensionality
Remove rare and/or common words

You’re the experts

In the end, it’s text

Read it, get a feel for it!
Why are you interested in this text?

Room for subject-specific preprocessing

What to to do with links? Hashtags? Emojis?
Specific terms such as names
regex is a nice tool for working with text
- Can be daunting

Many further avenues

Deep learning, e.g. transformers
- Can do translation, prediction, generation and more
- HuggingFace is a repository of (relatively) easily accesible models
Topic modelling, e.g. Latent Dirichlet Allocation
- Find topics in corpus and assign words and documents to topics
- Focus on interpretability

References

Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).

Raschka, S., Liu, Y. H., Mirjalili, V., & Dzhulgakov, D. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.

Stephens-Davidowitz, S. (2014). The cost of racial animus on a black candidate: Evidence using Google search data. Journal of Public Economics, 118, 26-40.

Unsupervised learning

Agenda

Survey

Unsupervised learning

Dimensionality reduction

Question

Why?

PCA

The method ‘intuitively’

An illustration

The method ‘mathematically’

The method ‘sklearnically’

How to evaluate

Explained variance as a guide

Introducing the scree plot

Interpretation

Non-linearity

Clustering

Question

Clustering

A broad introduction

Hierarchical clustering

Prototype-based clustering

Density-based clustering

K-means

Iterative updating

Illustrations are nice

Initialize centroids

Assign points

Update centroids

Assign points

Update centroids

Assign points

Update centroids

Assumptions

There are other methods

Evaluation

Choosing the right amount of cluster

Example data

K-means elbow plot

Silhoutte scores

A good number

A not so good number

Convexity

Text as data

Question

Aim today

Some terminology

How to use text as data?

Dictionary based methods

Words to values

How to get these dictionaries?

Sentiment analysis

Examples

One small problem

The difference

Use case: Racial animus in politics

A hancrafted dictionary

Question

Simplicity at its best

Downsides of dictionary based methods

Bag-of-words

Question

Counting words systematically

An example

A variant: tf-idf

A variant: n-grams

Common preprocessing

You’re the experts

Many further avenues

References

To the exercises!