Magnus Nielsen, SODAS, UCPH
Copenhagen Center for Social Data Science
Session # | Topic |
---|---|
1 | Introduktion til kurset og ML |
2 | Indførsel til Python |
3 | Model- og hyperparameterselektion |
4 | Supervised ML |
5 | Unsupervised ML |
6 | Fortolkning af modeller |
7 | Algorithmic audits |
8 | Kausalitet – Træbaserede modeller |
9 | Kausalitet - double machine learning |
Slides and exercises (+ possible reading list) will be available through the course website
The course will be mostly hands on, and reading is not necessary!
Some of you might find Python Machine Learning useful if you want a book that introduces machine learning in Python
What do you expect from this course?
Source: Google Trends
But why is it popular?
Source: Google Trends
Popularity is once again not be-all end end-all, however…
R (and Stata) are probably better in some settings (e.g. statistics)
Who here has experience with:
Breiman (2001) postulates that there exists two cultures (within statistics):
In which culture would you place yourself?
The first culture still dominates economics and social sciences (Athey & Imbens, 2019; Verhagen, 2022)
Breiman (2001) writes that “our goal as a field is to use data to solve problems;” (referencing statistics)
Should we as social scientists limit ourselves to just one culture?
Causal inference centers around treatment effects
Average treatment effect \[\tau = E[Y_i(1) - Y_i(0)]\]
Subgroup treatment effect \[\tau_g = E[Y_i(1) - Y_i(0) |G_i = g]\]
Conditional treatment effect \[\tau(x) = E[Y_i(1) - Y_i(0)| X_i = x]\]
Requires large sample properties, such as unbiasedness, consistency, normality and efficiency
Prediction centers around minimizing loss functions, \(L(\hat y, y)\)
Classification: Accuracy \[L(\hat y, y) = I[\hat y \neq y] \]
Regression: Mean squared error \[L(\hat y, y) = (\hat y - y)^2 \]
Has a given target and uses structured datasets
So… like OLS?
We control the bias-variance trade-off!
This is done using model- and hyperparameterselection
When is this useful?
Prediction policy problems (Kleinberg et al., 2015)
Inferring data to enhance datasets (Salganik, 2019)
No given target and structure not necessary
Different models ‘create’ their own target and structure
Utilize data sources such as text and images in novel ways
Generate new data, such as text or images
We are able to reduce the dimensionality of high-dimensional inputs, enabling new uses (old methods)
Transforming text into useful variables for further analysis
Aid in interpretation of text itself (Nelson, 2020)
Source: YouGov, 2018
Who has a working Python installation (can print ‘hello world’)?
Scripts (.py) and notebooks (.ipynb) are not the same
Scripts are plain-text files
Notebooks are interactive computational environments
Feel free to use whichever you prefer
PyCharm (natively in Professional, with a plug-in in Community Edition) support a thing in between, where .py scripts can be executed in blocks
You work in projects and both Python and installed packages have many versions
ssc install package_name
install.packages(package_name)
conda install package_name
or pip install package_name
For replicability (and to avoid breaking your own code), it is important to keep these fixed!
An environment contains information regarding
In essence the same workflow as without PyCharm, but each environment is associated with a project1
PyCharm has a introduction to environments with conda
Anaconda has a introduction to environments with PyCharm
(to be skipped, for the curious)
In the Anaconda Prompt:
Creating an environment
conda create -n my_env
Creating an environment with a specific Python version (e.g. 3.10)
conda create -n my_env python=3.10
To activate an environment
conda activate my_env
That’s it!
To deactivate an active environment
conda deactivate
When starting a new project, create an environment
When working on a project, activate the environment before launching your IDE or Python
Someone else needs to work on the project, what to do?
While the environment is active, export the environment to a YAML file
conda env export > filename.yml
Create an environment from the YAML file
conda env create -f filename.yml
Ex. 1: Make sure you can print('hello world')
in PyCharm
Ex. 2: Install the PyCharm Cell Mode plugin
Ex. 3: Create two projects with different environments
import flask
) and succeedimport flask
) and failEx. 4: (Optional) An important part of coding is version control. When changing workflow and program, the cost of implementing it is at the lowest
Athey, S., & Imbens, G. W. (2019). Machine learning methods that economists should know about. Annual Review of Economics, 11, 685-725.
Athey, S., Mobius, M., & Pal, J. (2021). The impact of aggregators on internet news consumption (No. w28746). National Bureau of Economic Research.
Bjerre-Nielsen, A., Kassarnig, V., Lassen, D. D., & Lehmann, S. (2021). Task-specific information outperforms surveillance-style big data in predictive analytics. Proceedings of the National Academy of Sciences, 118(14), e2020258118.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.
Kleinberg, J., Ludwig, J., Mullainathan, S., & Obermeyer, Z. (2015). Prediction policy problems. American Economic Review, 105(5), 491-95.
Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), 87-106.
Nelson, L. K. (2020). Computational grounded theory: A methodological framework. Sociological Methods & Research, 49(1), 3-42.
Salganik, M. J. (2019). Bit by bit: Social research in the digital age. Princeton University Press.
Verhagen, M. D. (2022). A pragmatist’s guide to using prediction in the social sciences. Socius, 8, 23780231221081702.