hello world
Magnus Nielsen, SODAS, UCPH
Python can do stuff that Stata, R and SAS can do:
Python can do stuff that Matlab can do:
numpy
Python is a general-purpose programming language:
Python has a huge community!
With a huge community comes a huge amount of resources
Sites such as:
Especially Python for Data Analysis may be interesting for this course
But all you really need is Google
Go through the Python tutorial at W3Schools
Just kidding, but there really are a ton of great guides out there!
The only way to get good at programming is simply to program!
The amount of information in the presentation and exercises might be overwhelming
If you wish, you can continue preprocessing data in your favourite program and import the data into Python and go straight to machine learning
When using Python, I will try to include both source code and the output
You can copy the code in the upper left corner
The source code might be hidden – but it’s still there
Python makes heavy use of assigning variables
A variable is created when you assign a value to it using =
# Lines can be commented out with a #
# Variables can be assigned with =
var_1 = 'Example 1'
# Variables can be printed with the print() function
print(var_1)
Example 1
Python is case-sensitive
help()
gives information about objects
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
In PyCharm, hovering over an object also gives this information
In Jupyter Notebook, pressing shift+tab
while inside parenthesis also gives this information
When you use a .
in your code to call a method, PyCharm will suggest methods – to prompt this in Jupyter Notebook, press tab
String
Numeric
Boolean
Strings are defined with ''
or ""
- Multiline, raw and formatted strings also exist
Numeric are defined as numbers, with type dependent on delimiter
Booleans are defined as True
or False
# strings
a_string = "I'm a string"
another_string = '2.5'
# numerical
an_int = 2
a_float = 2.5
# boolean
a_boolean = True
#confusion
print(another_string, a_float)
2.5 2.5
Strings that look like a float
/int
can cause confusion
You can convert between different types with int(), float(), str(), bool()
You can check a type with type()
a_float = 2.5
a_string = str(a_float)
an_int = int(a_float)
a_boolean = bool(a_float)
print(a_float, type(a_float), a_string, type(a_string), an_int, type(an_int), a_boolean, type(a_boolean))
2.5 <class 'float'> 2.5 <class 'str'> 2 <class 'int'> True <class 'bool'>
Some conversions are a bit odd, e.g. bool()
, see more here
Some things are not possible, and give an error
ValueError: invalid literal for int() with base 10: 'Error string'
The most important part of an error message (or stack trace) is usually the bottom (what went wrong) and the top (what part of the code started this)
What do we do with this error message?
Before asking for help, try it out!
Some basic operators are:
+
*
-
/
**
Python also supports comparisons, such as:
==
!=
<=
<=
These return boolean values (or errors)
Boolean values can combined using:
and
operator - equivalent to &
or
operator - equivalent to |
And can be negated with not
Three of the most fundamental composite data types are
The list and tuple are accessed with numerical indices
The dictionary is accessed with indices chosen by the programmer (consists of key:value
pairs)
These composite data types can contain other variables1
Numerical indices can be accessed using slices in as described here:
a[start:stop] # items start through stop-1
a[start:] # items start through the rest of the array
a[:stop] # items from the beginning through stop-1
a[start:stop:step] # start through not past stop, by step
Control flow means writing code that controls the way data or information flows through the program
In Python, this is (mainly) done using either
Essentially: If something is true, do something
Pseudo-code
if statement is true:
do something
In the example above, the block called code
is run if the condition called statement
is True (the boolean value)
Python is designed to look like pseudo-code
We introduce an alternative!
if statement is true:
do something
else:
do something else
Which again looks similar in Python
Python also supports elif
(else-if)
When you want to do the same thing multiple times, loops are your best friend
Two types:
Do the same thing for each element in an iterable (e.g. a list)
for each element in iterable:
do something
Once again, very similar
Do something while a statement holds
while statement is:
do something
Commonly done with a counting variable, but not necessarily
Make sure it terminates!
Reuse your own code
Reuse other’s code
Done using functions, which can be thought of as a recipe
You define:
Extremely powerful!
The scaffold is as follows
def function_name(input_1, input_2, ..., input_k):
something = do_something()
return something
An example
def func_name(input_1, input_2):
temporary_var = (input_1 + input_2)*2
return temporary_var
func_name(2,3)
10
Python supports infinitely many inputs, default values and much, much more
Built-in functions
Packages
Do you know any built-in Python functions?
Our dear friend print()
!
But so many more:
print('len is',len([1,2,3]))
print('sum is',sum([1,2,3]))
print('max is',max([1,2,3]))
print('abs is',abs(-1))
len is 3
sum is 6
max is 3
abs is 1
You won’t be able to remember everything, and once again Google is your best friend
Reusing other people’s code is perhaps the most important part of Python!
Corresponds to reg, fixest, etable and so on
Usually installed through conda
or pip
If you need a specific module, Google “install module_name python”, e.g. for pandas it’s conda install pandas
%
, see hereIn PyCharm, there’s a package manager window where you can search for packages
This will depend on the field you’re operating in
We will focus on pandas (this session) and sklearn (later sessions) due to time constraints
I will however shortly introduce the different modules
First import
Most basic element is a Series (list / column)
Series can be combined into DataFrames
The DataFrames are the main object in pandas
Usually loaded using pd.read_csv
(dependent upon format, see list), but also offer support for dta
or SAS7BDAT
:
pd.read_stata
pd.read_sas
You will have time to work with pandas during the exercises
There are lots of guides online, e.g. in the documentation
If you want to work with vectors and matrices, numpy is your friend!
import numpy as np
array_1 = np.array([1,2,3])
array_2 = np.array([3,2,1])
matrix = np.array([array_1, array_2])
print('matrix:')
print(matrix)
print('slice:', matrix[0,:]) # supports slicing
print('dot + transpose:', array_1 @ array_2.T) # and dot products, transpose and more
matrix:
[[1 2 3]
[3 2 1]]
slice: [1 2 3]
dot + transpose: 10
Only numeric data! Most matrix calculations are done under the hood (thank god!), so you probably won’t need this much
Not always very intuitive (MATLAB-like syntax), but very flexible
# import matplotlib
import matplotlib.pyplot as plt
# load data
import seaborn as sns
df = sns.load_dataset('mpg')
# create plot
f, ax = plt.subplots(1,2)
# subfigure 1
ax[0].hist(df['mpg'])
ax[0].set_title('MPG histogram')
ax[0].set_xlabel('MPG')
ax[0].set_ylabel('Count')
# subfigure 2
ax[1].scatter(df['model_year'], df['horsepower'])
ax[1].set_title('Horsepower and year scatter')
ax[1].set_xlabel('Model year')
ax[1].set_ylabel('Horsepower')
# supertitle for the whole figure
f.suptitle('Two plots', fontsize=16)
plt.show()
The figure (f) is the whole plot, whereas the axis (ax) contains the subplots, accessed through indices
Built on top of matplotlib – lots of powerful premade plots
The most powerful ones (like pairplot()
) are not easy to post-process
# create plot
f, ax = plt.subplots(1,2)
# subfigure 1
sns.histplot(data = df, hue='origin', x= 'mpg', kde=True, ax=ax[0])
# subfigure 2
sns.scatterplot(df['model_year'], df['horsepower'], ax=ax[1], hue=df['origin'])
# supertitle for the whole figure
f.suptitle('Two fancy plots', fontsize=16)
plt.show()
A large amount of different examples with code can be found online, e.g. here
To be continued.. 👉👉