import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
# Get wine data
= load_wine(as_frame=True)
data_wine = data_wine.data
X = data_wine.target y
Exercise Set 5: Unsupervised learning & Text as data
In this exercise set, we will be looking at:
- Unsupervised learning, focusing on the canonical
Principal Component Analysis
andK-means
for dimensionality reduction and clustering, respectively - Text as data, focusing on
VADER
andbag-of-words
models
The focus in the first part is implementing the methods using sklearn
and then how we can use and evaluate these methods. In the second part, we see how we can use text as both unsupervised input to dictionary based methods, but also how the more general bag-of-words
models allow us to use text as regular tabular input.
Unsupervised learning
The dataset we will be looking at this time is the UCI ML Wine recognition dataset. This features analysis of 178 wines from three different wine manufacturers, and as it is often used you will be able to find examples analyzing this online. Furthermore, this entails that we have a ground truth for our clustering algorithms, which is nice to know when getting started with clustering. As last time, you’re welcome to use a dataset of your own.
Load data
Here we load our input data into a DataFrame
called X
and our target data into a Series
Here we describe the data using both the documentation which came with the data, but also by computing summary statistics for the input data and value counts for the target.
Consider whether the input features are measured on the same scale and whether the classes heavily skewed.
print(data_wine.DESCR)
X.describe()
y.value_counts()
Dimensionality reduction
As we saw, the data has 13 dimensions, and the goal of this section is to reduce this to a lower amount of dimensions.
This can be done for many reasons, including:
- Reduce computation time
- Performance increases
- Visualization
This we will do using principal component analysis. All the same things regarding data leakage from train to test data carries over from supervised learning, but we will disregard this aspect and use all data at once for simplicity. Later on, it can be used in a step in your pipelines, and it will only learn from the train data.
Exercise 1.1
Fill in the missing code to perform a principal component analysis using
sklearn
Hints: > Were all the variables on the same scale?
from sklearn.preprocessing import # FILL IN
from sklearn.decomposition import # FILL IN
# Step one
= # FILL IN
sc
sc.fit(X)= sc.transform(X)
X_std
# Step two
= # FILL IN
pca
pca.fit(X_std)= pca.transform(X_std) X_pca
Exercise 1.2
- What are the dimensions of
X_pca
?- Have you reduced the dimensionality?
Hints: > The shape of an array can be determined using
.shape
# Your code
Exercise 1.3
Plot the two first principal components in a scatter plot by filling in the missing code
Hints: > When subsetting arrays, the first input determines the rows and the second determines columns > > The two inputs are separated by a comma > > The input
:
corresponds to all > > Python is zero-index, i.e.0
corresponds to the first element
# Plot
# Missing code
plt.scatter(X_pca[FILL IN], X_pca[FILL IN]) 'Principal component 1')
plt.xlabel('Principal component 2')
plt.ylabel( plt.show()
Exercise 1.4
Reuse the code from before, but add colors by adding the option
c = y
to the scatter plot. Can we see a difference between the three wine cultivators?Hints: > This colors the plot according to the class of the observation
# Your code
Now we have chosen two dimensions for visualisation, but sometimes we might want to make a more informed choice about the amount of dimensions based on the variance kept or lost. This information can be obtained using a scree plot.
To create the scree plot, we need to calculate the explained variance ratio for each principal component.
Implementing stuff on your own might cause entail minor bugs and errors. Perhaps sklearn
has an implementation for us?
Exercise 1.5
Look at the documentation for the PCA function. - Does it have a feature/attribute which calculates it for us? - How would we access this feature?
Hints: > Look under Attributes
# Your code
Exercise 1.6 1. Extract the explained variance ratio 2. Calculate the cumulative explained variance ratio
Hints: > Attributes can be accessed using a period (
.
) > >numpy
has a function for calculating cumulative sums
# Your code
Exercise 1.7
Create a scree plot using the code below, inserting the appropriate x and y variables
Hints: >
PC_values
is an array that goes from1
to13
, which corresponds to the amount of principal components
= np.arange(pca.n_components_) + 1
PC_values
plt.bar(FILL IN)
plt.step(FILL IN)'Scree Plot')
plt.title('Principal Component')
plt.xlabel('Variance Explained')
plt.ylabel( plt.show()
There are many ways to decide on the amount of dimensions, most often through cross validation, compute constraints, or a heuristic such as the elbow method.
However, as we are going to continue to plotting the data in a 2-dimensional space, we only need two principal components.
It seems superfluous to return all the principal components, doesn’t it?
Exercise 1.8
Change the code from exercise 1.1 to only return the first two components
Call the transformed data
X_pca_2
Hints: >
PCA
has an input which decides the amount of components.
# Your code
Clustering
Having now performed dimensionality reduction, we will use the K-means
algorithm to cluster the data. In this case, we know that three classes exist, but K-means
will not use this information.
First we implement the method, and then we continue to look at how one can evaluate the method and choose the amount of clusters.
There are many other clustering methods, and if you want to use other methods, a starting point could be the clustering section in sklearn.
Exercise 2.1
Fill in the missing code such that you implement a
K-means
clustering algorithm with three clusters. For replicability, you should also set a random state
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import
# fit the pca and get the two first components
= StandardScaler().fit_transform(X)
X_std = PCA(n_components=2).fit_transform(X_std)
X_pca
# apply the
= FILL IN
kmeans
kmeans.fit(X_pca)= kmeans.predict(X_pca) y_kmeans
The code below visualizes the found clusters from the previous exercise.
Exercise 2.2
Explain the code by filling in the missing comments, one at each
#
#
= pd.DataFrame(X_pca)
X_kmeans 'cluster_id'] = y_kmeans
X_kmeans[
#
= X_kmeans['cluster_id'].unique()
unique_cluster_ids
#
for cluster_id in unique_cluster_ids:
#
= X_kmeans.loc[X_kmeans.cluster_id == cluster_id]
cluster_subset #
0], cluster_subset[1])
plt.scatter(cluster_subset[
#
= kmeans.cluster_centers_
centroids
#
0], centroids[:,1], c='black', s=80)
plt.scatter(centroids[:, plt.show()
So far we have chosen three dimensions because I told you to, but usually you would have to decide upon this yourself, a downside of K-means
.
To assist us, we can look for elbows in what the model optimizes.
Exercise 2.3
The
K-means
algorithm minimizes the sum of squared distances to the nearest centroid. This is available through theKmeans
object. Look through the documentation to find out how to extract this information. Using this knowledge, fill in the missing code to plot the sum of squared distances for 1 to 10 clusters. > Hints: > > Try looking under Attributes
= range(1, FILL IN)
cluster_range = []
sum_squared_distances_list
# For each cluster, calculate sum of squared distances
for no_clusters in cluster_range:
= KMeans(n_clusters=no_clusters, random_state=73)
kmeans
kmeans.fit(X_pca)
sum_squared_distances_list.append(FILL IN)
# Plot the sum of squared distances as a function of cluster range
='o')
plt.plot(cluster_range, sum_squared_distances_list, marker'Number of clusters')
plt.xlabel('Sum of squared distances')
plt.ylabel( plt.show()
However, there are many different metrics to evaluate a clustering algorithm. A list of those implemented in sklearn
can be found in their user guide, which also includes pros and cons of each metric.
Exercise 2.4
The code below calculates the average silhoutte coefficient, see documentation here.
- What is the range of values, and what values are preferred?
- Should one be wary of using this method to compare across models from the three broad categories introduced in the lecture? > Hints: > > Think about convexity
from sklearn.metrics import silhouette_score
= KMeans(n_clusters=3, random_state=73)
clusterer = clusterer.fit_predict(X_pca)
cluster_labels = silhouette_score(X_pca, cluster_labels)
silhouette_avg print(f"Average silhouette coefficient: {silhouette_avg:.2f}")
Having now seen how to calculate the silhouette coefficient, we want to look at how it varies with up to ten clusters
Exercise 2.5
Fill in the missing code to calculate the average silhouette coefficients
Hints:
How many clusters are needed to calculate the silhouette coefficient?
from sklearn.metrics import silhouette_score
# Specify range of clusters
= range(FILL IN)
cluster_range_silhouette = []
avg_silhouette_list
# Calculate the average silhouette coefficient
for no_clusters in cluster_range_silhouette:
= KMeans(n_clusters=no_clusters, random_state=73)
kmeans
kmeans.fit(X_pca)= kmeans.predict(X_pca)
cluster_labels = silhouette_score(X_pca, cluster_labels)
silhouette_avg
avg_silhouette_list.append(silhouette_avg)
# Plot average silhouette coefficients
='o')
plt.plot(FILL IN, FILL IN, marker'Number of clusters')
plt.xlabel('Average silhouette coefficient')
plt.ylabel( plt.show()
We can also make silhouette plots, although they are a bit tedious to produce. Code to produce it using just sklearn
can be found online, but there also exist packages to do it for us! yellowbrick
is one such package, and it even uses the same syntax as sklearn
. As a general rule, it’s always a good idea to check if there exists a package which does what you want to do, ideally before you spend too much time implementing stuff.
Exercise 2.6
Install the package
yellowbrick
to plot the silhoutte plot using the code below.Bonus: Try plotting different amounts of cluster amounts. Which amount do you prefer?
Hints:
Installing with
pip
follows standard naming conventions, but otherwise installation instructions can be found on their website
from yellowbrick.cluster import SilhouetteVisualizer
# Model we want to evaluate
= KMeans(n_clusters=3, random_state=73)
kmeans
# The vizualiser
= SilhouetteVisualizer(kmeans)
visualizer
# Fit the data to the visualizer
visualizer.fit(X_pca)
# Show the plot
visualizer.show() plt.show()
Text as data
The dataset we will be looking at to get used to working with text as data is IMDB Dataset downloaded from Kaggle, but originally from Stanford and created for the paper Maas, Andrew, et al. “Learning word vectors for sentiment analysis.” Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
The dataset consists of 50.000 movie reviews, which are humanly classified as either positive or negative (25.000 of each).
Load data
Here we load our data into a DataFrame
called df
. Furthermore, we map the classes into a binary vector which indicates whether the review was positive (1
) or negative (0
).
# Import data
= pd.read_csv('movie_data.csv.zip', encoding='utf-8', compression='zip')
df 'positive'] = df['sentiment'].map({'positive':1,'negative':0}) df[
A sensible first thing to do is to read some of the text. The code below does enables you to do this, printing the first two positive and negative reviews.
Exercise 3.1
Are there any weird artifacts in the text? If there are any, can you guess why they’re there?
print("Positive")
print()
for i in df.loc[df.sentiment == 'positive'].review[:2]:
print(i)
print()
print("Negative")
print()
for i in df.loc[df.sentiment == 'negative'].review[:2]:
print(i)
print()
Having a dataset with labels is not always easy. If we had no labels but were still interested in the sentiment of the reviews, one way to go about this would be using a dictionary based method.
In this example, we will use the VADER
sentiment analyser to get the sentiment of the reviews.
Exercise 3.2
Explain what happens in each of the four steps by commenting the code.
Hints:
.apply
applies a function to the column
lambda
functions are anonymous function which are defined inplace. In this situation, they are applied to each row in the column.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#
= SentimentIntensityAnalyzer()
sia
#
'scores'] = df['review'].apply(lambda review: sia.polarity_scores(review))
df[
#
'compound'] = df['scores'].apply(lambda scores: scores['compound'])
df[
#
'comp_score'] = df['compound'].apply(lambda comp_score: 1 if comp_score >= 0 else 0) df[
As we are so lucky to have a labelled dataset, we can see how our unsupervised method did!
Exercise 3.3
Calculate the accuracy of the predicted
comp_score
(compound scores) > Hints: > > Try importingaccuracy_score
fromsklearn.metrics
# Your code
VADER
is relatively advanced, and uses information about whether the text is capitalized and uses exclamation marks. However, for bag-of-words
models and other text models, it is common to preprocess the data to reduce the complexity.
In the following code, I give you some examples of how one could preprocess the data. One of the common tools used is Regular Expressions, shortened re
. I do not expect you to know it, but it’s a neat tool for capturing text and either storing it or replacing it with other text. You can play around with it at RegExr.com, should you wish.
Exercise 3.4
Look at the reviews after each cleaning example. What’s the difference between the two preprocessing methods? Is the text better represented than before we preprocessed it? Some things you could consider: - Does it make the text more readable for you? What about for an algorithm?
- Have we removed the weird artifacts you (perhaps) found earlier? - Have we introduced any new weird artifacts?
import re
# Clean reviews
def cleaner(document):
= document.lower() #To lower case
document = re.sub(r'<[^>]*>', ' ', document) #Remove HTML
document = re.sub(r'[^\w\s]','', document) #Remove non-alphanumeric characters
document return document
'review_clean'] = df['review'].apply(cleaner) df[
print("Positive")
print()
for i in df.loc[df.sentiment == 'positive'].review_clean[:2]:
print(i)
print()
print("Negative")
print()
for i in df.loc[df.sentiment == 'negative'].review_clean[:2]:
print(i)
print()
# Import stopwords
from nltk.corpus import stopwords
= stopwords.words('english')
english_stopwords
# Extended cleaning function
def extended_cleaner(document, stopwords_list = english_stopwords):
= document.lower() # To lower case
document = re.sub(r'<[^>]*>', ' ', document) # Remove HTML
document = re.sub(r'[^\w\s]','', document) # Remove non-alphanumeric characters
document = ' '.join(x for x in document.split(' ') if x not in stopwords_list) # Remove stopwords
text return text
'review_extended_clean'] = df['review'].apply(extended_cleaner)# Clean reviews df[
print("Positive")
print()
for i in df.loc[df.sentiment == 'positive'].review_extended_clean[:2]:
print(i)
print()
print("Negative")
print()
for i in df.loc[df.sentiment == 'negative'].review_extended_clean[:2]:
print(i)
print()
Having now preprocessed the text, we want to implement a bag-of-words
model.
Exercise 3.5
Implement a model that count the amount of unique words in each sentence by filling in the missing code > Hints: > > Try importing
CountVectorizer
> > It has a method which both fits and transforms the data in one go.
from sklearn.feature_extraction.text import FILL IN
= FILL IN
vectorizer
= df.review_extended_clean
X
= FILL IN X_bag
Exercise 3.6
We have now vectorized the text, and have a variable called
X_bag
.
- What is the type of
X_bag
?- What is the dimensionality of
X_bag
?- Could we use simple unregularized linear regression with this input?
Hints:
How many samples compared to variables do we have?
# Your code
Having now seen the workings of the CountVectorizer
, we’re going to implement it in a pipeline so it can be used for supervised learning as we have seen whilst avoiding data leakage. We do not perform cross validation to reduce the time it takes to run.
Exercise 3.7
Fill in the missing code such that we implement a
CountVectorizer
followed by aLogisticRegression
.Does it perform better than
VADER
? > Hints: > > We have previously looked at pipelines and datasplitting. Try looking at last sessions exercises.
from sklearn.linear_model import
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
= df.positive
y = df.review_extended_clean
X
= train_test_split(FILL IN, test_size=0.3, random_state=73)
X_train, X_test, y_train, y_test
= Pipeline([FILL IN])
tf_clf
tf_clf.fit(X_train, y_train)= tf_clf.score(X_test, y_test)
tf_acc print(f"Accuracy: {tf_acc:.2f}")
Exercise 3.8
Change the vectorizer from the previous exercise to a tf-idf vectorizer followed by a
LogisticRegression
.Does the model perform better? > Hints: > > Try googling
sklearn tfidf
# Your code
We have now looked at some ways of how to work with text. You could also look into:
- Stemming and lemmatization
- N-gram models (both vectorizers support it)
- Changing the minimum or maximum frequency that words need to appear with
Another model to look into that is not too computationally difficult is topic models.
A cool application of topic models can be seen in Transparency and Deliberation within the FOMC: A Computational Linguistics Approach, with the most information about the text analysis in section IV.
sklearn
has an implementation of a LDA topic model (sklearn.decomposition.LatentDirichletAllocation), although it is my impression that it is most commonly done using gensim
, see their website here.