Magnus Nielsen, SODAS, UCPH
Heterogeneous student body
You want more lecturing, less assistance for coding
Solution…?
We do not have/use a given target
Structure not always necessary
Have you worked with dimensionality reduction before?
What methods did you use?
Reducing dimensionality of the data is done primarily for three reasons:
The most common method for dimensionality reduction is (probably) Principal Component Analysis
Main idea is to create new variables:
Creates linear ‘latent variables’
Earlier components will explain more variance!
Source: Raschka & Mirjalili, 2022, ch. 6
\(K\) is a hyperparameter that we choose - More on how to choose later
Supports fit and transform in PCA
from sklearn.decomposition
StandardScaler
n_components
is the parameter of interest, and can be set in two ways:
K
as any other hyperparameterEach principal components explains part of the variance in the original data
If some variables covary, a single principal component can explain much of the variance in the data
Most often, a relatively low amount of principal components explain a lot of the variance
Source: Raschka & Mirjalili, 2022, ch. 5
As PCA is a linear transformation, we can look at how the variables are transformed
pca.components_
A way of interpreting the new variables
Not done that often
If we want to incorporate non-linearity, one of two things are commonly done:
kernelPCA
)Have you worked with clustering before?
What methods did you use?
Find unknown groups within populations
What constitutes different and alike?
Different ways of formulating this, with three broad categories:
Some methods fall into more than one category and there are other clustering methods (e.g. graph-based)
Clusters are based sequential merging or divison of clusters
n
clusters and sequentially join clustersn
clustersAgglomerativeClustering
A cluster is represented by a prototype (centroid)
Kmeans
A cluster is represented by a dense area of points
DBSCAN
K-means aims to minimize the within-cluster sum-of-squares:
\[\sum_{i=0}^n \min_{\mu_j \in C}(||x_i - \mu_j||^2)\]
Where \(\mu_j\) are the centroids
This corresponds to an Euclidian distance, and thus we (often) standardscale
This is done by:
Repeating procedure until convergence
The easiest way to get to know K-means is to look at the process
The following figures are from a cool tool made by brookslybrand
Note that this is the best case scenario:
Source: sklearn examples, ‘Demonstration of k-means assumptions’
Source: sklearn User Guide, ‘Overview of clustering methods’
It is generally not easy to evaluate clustering methods, but methods exist
These can be used to evaluate hyperparameter choice or method choice
If ground truth is known, there are more methods
Models optimize some sort of metric
Generally, more clusters cause a better fit
When this additional increase in fit becomes small, we have ‘enough’ clusters
If different models optimize different metrics, this cannot be used across models
Source: Raschka & Mirjalili, 2022, ch. 11
Source: Raschka & Mirjalili, 2022, ch. 11
Based on the mean intra-cluster distance, \(a_i\) (cohesion), and the mean nearest-cluster distance, \(b_i\) (separation), for an observation \(i\):
\[s_i = \frac{(b_i - a_i)}{max(a_i, b_i)}\]
Bounded between -1 and 1,
These can be plotted!
Source: Raschka & Mirjalili, 2022, ch. 6
Source: Raschka & Mirjalili, 2022, ch. 6
Silhouette scores based on the idea that we want samples to be
Works across models, but favors models that create convex clusters
Have any of you used text as data in an analysis?
Was it done using qualitative methods or quantitative?
Natural Language Processing (NLP) is a huge field
Aim today is to get you thinking about text as data
A word is the basic unit of discrete data
A document is a collection of words
A corpus is a collection of documents
There are many different ways of using text as data
The most basic method way to use text is to read the text
Today we will focus on dictionary methods and bag-of-word models
The key in this method is a dictionary which includes:
Using these dictionaries, we can
Dictionaries are generally either:
One well known dictionary is VADER (Hutto & Gilbert, 2014), Valence Aware Dictionary and sEntiment Reasoner
Verified through experiments and benchmarks
Scores every sentence according to how positive or negative they are (-1 to 1), and the proportion of negative, neutral and positive words
Text | Pos | Neu | Neg | Comp |
---|---|---|---|---|
VADER is smart, handsome, and funny. | 0.746 | 0.254 | 0.0 | 0.8316 |
VADER is smart, handsome, and funny! | 0.752 | 0.248 | 0.0 | 0.8439 |
VADER is very smart, handsome, and funny. | 0.701 | 0.299 | 0.0 | 0.8545 |
VADER is VERY SMART, handsome, and FUNNY. | 0.754 | 0.246 | 0.0 | 0.9227 |
VADER is VERY SMART, handsome, and FUNNY!!! | 0.767 | 0.233 | 0.0 | 0.9342 |
VADER is not smart, handsome, nor funny. | 0.0 | 0.354 | 0.646 | -0.7424 |
Make sure you :) or :D today! | 0.706 | 0.294 | 0.0 | 0.8633 |
Catch utf-8 emoji such as 💘 and 💋 and 😁 | 0.279 | 0.721 | 0.0 | 0.7003 |
It’s not Danish
Here we can use AFINN
instead
´VADER is not smart, handsome, nor funny.´ classified as positive
You can play around with it at this website, both in Danish and English
Stephens-Davidowitz (2014) examines whether prejudice against blacks remains a potent factor within American politics (at the time)
People sometimes lie on surveys
To examine this, they use Google searches up to the 2008 election in the US split by media market and examine the ‘racially charged search rate’, which is a fraction of the type
\[\frac{\text{Google searches including some word(s)}}{\text{All Google searches}}\]
Essentially counting
What word(s) would you include in this dictionary?
Stephens-Davidowitz (2014) uses just a single derogatory term
Using this measure, they find that it is a robust negative predictor of Obama’s performance
Requires a dictionary
Sometimes too simple
How could one represent sentences numerically?
We can treat words as unique tokens
Count how often each word appears in a sentence
Much like categorical variables
High-dimensional and sparse
Sentence | dogs | like | i | bananas |
---|---|---|---|---|
Dogs like dogs | 2 | 1 | 0 | 0 |
I like dogs | 1 | 1 | 1 | 0 |
I like bananas | 0 | 1 | 1 | 1 |
This allows us to go from sentences to a tabular format
A common variant is a term frequency times inverse document (tf-idf) matrix
N-gram models are also another common variant
'Dogs like dogs'
would become {'dogs':2, 'like':1, 'dogs like':1, 'like dogs':1}
You should consider whether you want to:
nltk.corpus.stopwords.words('danish')
or spacy.lang.da.stop_words.STOP_WORDS
nltk.stem.snowball.DanishStemmer
lemmy
In the end, it’s text
Room for subject-specific preprocessing
regex
is a nice tool for working with text
Hutto, C., & Gilbert, E. (2014, May). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).
Raschka, S., Liu, Y. H., Mirjalili, V., & Dzhulgakov, D. (2022). Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd.
Stephens-Davidowitz, S. (2014). The cost of racial animus on a black candidate: Evidence using Google search data. Journal of Public Economics, 118, 26-40.