= 1.5 A
Overview of notebook
This notebook consists of two independent parts. The first about basic Python where you get familiar with the most important concepts and tools. The second part is a short introduction to pandas, which is a tool for structuring data in Python, and numpy, which is a tool for matrix calculations.
Overview of Python content
In this integrated assignment and teaching module, we will learn the following things about basic Python: - Fundamental data types: numeric, string and boolean - Operators: numerical and logical - Conditional logic - Containers with indices - Loops: for and while - Reuseable code: functions, classes and modules
Additional sources:
As always, there are many sources out there: Google is your best friend. However, here are some recommendations
A book: Python for Data Analysis
Videos: pythonprogramming.net fundamental (basics and intermediate)
A tutorial website: The official python 3 tutorial (sections 3, 4 and 5)
1 Fundamentals of Python
Elementary Data Types
Examples with data types
Execute the code below to create a variable A
as a float equal to 1.5:
Execute the code below to convert the variable A
to an integer by typing:
int(A) # rounds down, i.e. floor
1
We can do the same for converting to float
, str
, bool
. Note some may at first come across as slightly odd:
bool(A)
True
While some are simply not allowed:
#float('A') # Attempt at converting the string (text) 'A' to a number
Printing Stuff
An essential procedure in Python is print
. Try executing some code - in this case, printing a string of text:
= 'I can do in Python, whatever I want' # define a string of text
my_str print(my_str) # print it
I can do in Python, whatever I want
We can also print several objects at the same time. Try it!
= 33
my_var1 = 2
my_var2 print(my_var1, my_var2)
33 2
Why do we print? - It allows us to inspect values - For instance when trying to understand why a piece of code does not give us what we expect (i.e. debugging) - In particular helpful when inspecting intermediate output (within a function, loops etc.).
Numeric Operators
What numeric computations can python do?
An operator in Python manipulates various data types.
We have seen addition +
. Other basic numeric operators: - multiplication *
; - subtraction -
; - division /
; - power, **
Try executing the code below. Explain in words what the expression does.
2**4
16
Problem 1.1
Having seen all the information, you are now ready for the first exercise. The exercise is found below in the indented text. > Ex. 1.1: Add the two integers 3
and 5
### BEGIN SOLUTION
= 3 + 5
answer_11
### END SOLUTION
Problem 1.2
Python also has a built in data type called a string. This is simply a sequence of letters/characters and can thus contain names, sentences etc. To define a string in python, you need to wrap your sentence in either double or single quotation marks. For example you could write "Hello world!"
.
Ex. 1.2: In Python the
+
is not only used for numbers. Use the+
to add together the three strings"VIVE"
,"Machine"
and"Learning"
. What is the result?
### BEGIN SOLUTION
= 'VIVE' + 'Machine' + 'Learning'
answer_12
### END SOLUTION
Boolean Operators
Helpful advice: If you are not certain what a boolean value is, try and go back to Fundamental Data Types
What else can operators do?
We can check the validity of a statement - using the equal operator, ==
, or not equal operator !=
. Try the examples below:
3 == (2 + 1)
True
3 != (2 + 1)
False
11 != 2 * 5
True
In all these cases, the outcomes were boolean.
We can also do numeric comparisons, such as greater than >
, greater than or equal >=
, etc.:
11 <= 2 * 5
False
How can we manipulate boolean values?
Combining boolean values can be done using:
- the
and
operator - equivalent to&
- the
or
operator - equivalent to|
Let’s try this!
print(True | False)
print(True & False)
True
False
What other things can we do?
We can negate/reverse the statement with the not
operator:
not (True and False)
True
Problem 1.3
Above you added two integers together, and got a result of 8
. Python separates numbers in two classes, the integers \(...,-1,0,1,2,...\) and the floats, which are an approximation of the real numbers \(\mathbb{R}\) (exactly how floats differ from reals is taught in introductory computer science courses).
Ex. 1.3:
* Add1.7
to4
* What type is0.667 * 100
in Python?
### BEGIN SOLUTION
= 1.3 + 4
answer_131 = 0.667 * 100
answer_132 ### END SOLUTION
Containers
What is a composite data type?
A data type that can contain more than entry of data, e.g. multiple numbers.
What are the most fundamental composite data types?
Three of the most fundamental composite data types are the tuple, the list and the dictionary.
The tuple is declared with round parentheses, e.g.
(1, 2, 3)
each element in the tuple is separated by a comma. One you have declared a tuple you cannot change it’s content without making a copy of the tuple first (you will read that the tuple is an immutable data type).The list is almost identical to the tuple. It is declared using square parentheses, e.g.
[1, 2, 3]
. Unlike the tuple, a list can be changed after definition, by adding, changing or removing elements. This is called a mutable data type.The dictionary or simply dict is also a mutable data type. Unlike the other data types the dict works like a lookup table, where each element of data stored in the dictionary is associated with a name. To look up an item in the dictionary you don’t need to know its position in the dictionary, only its name. The dict is defined with curly braces and a colon to separate the name from the value, e.g.
{'name_of_first_entry': 1, 'name_of_second_entry: 2}
.
Problem 1.4
Ex. 1.4: Define the variable
y
as a list containing the elements'k', 2, 'b', 9
. Also define a variablez
which is a tuple containing the same elements. Try to access the 0th element ofy
(python is 0-indexed) and the 1st element ofz
.Hint: To access the n’th element of a list/tuple write
myData[n]
, for exampley[0]
gets the 0th element ofy
.
### BEGIN SOLUTION
= ['k', 2, 'b', 9]
y = ('k', 2, 'b', 9)
z
= y[0]
answer_14_y0 = z[1]
answer_14_z1
### END SOLUTION
2 Control Flow
If-then syntax
Control flow means writing code that controls the way data or information flows through the program. The concepts of control flow should be recognizable outside of coding as well. For example when you go shopping you might want to buy koldskål, but only if the kammerjunker are on sale. else you will buy icecream. These kinds of logic come up everywhere in coding; self driving cars should go forward only if the light is green, items should be listed for sale in a web shop only if they are in stock, stars should be put on the estimates if they are significant etc.
Another kind of control flow deals with doing things repeatedly. For example dishes should be done while there are still dirty dishes to wash, for each student in a course a grade should be given, etc.
In the following problems you will work with both kinds of control flow.
How can we activate code based on data in Python?
In Python, the syntax is easy with the if
syntax.
if statement:
code
In the example above, the block called code
is run if the condition called statement
is true (either a variable or an expression).
Examples using if
Try to run the examples:
= (4 == 4)
my_statement if my_statement:
print ("I'm being executed, yay!")
I'm being executed, yay!
Introducing an alternative
If the statement in our condition is false, then we can execute other code with the else
statement. Try the example below - and change the boolean value of my_statement
.
= False
my_statement if my_statement:
print ("I'm being executed, yay!")
else:
print ("Shoot! I'm still being executed!")
Shoot! I'm still being executed!
Optional material
We have not covered the statements break
and continue
, or try
and except
which are also control flow statements. These are slightly more advanced, but it can be a good idea to look them up yourself.
In Python the if/else logic consists of three keywords: if, elif (else if) and else. The if and elif keywords should be followed by a logical statement that is either True
or False
. The code that should be executed if
the logic is True
is written on the lines below the if
, and should be indented one TAB (or 4 spaces). Finally all control flow statements must end with a colon.
Ex. 2.1: Read the code in the cell below. Assign a value to the variable
x
that makes the code print “Good job!”
### BEGIN SOLUTION
= 4
x ### END SOLUTION
if x > 5:
print("x is too large")
elif x >= 3 and x <= 5:
print("Good job!")
else:
print("x is too small")
Good job!
Above we used two different types of comparison: >=
and <
. To compare two values and check whether they are equal, python uses double equal signs ==
(remember a single = was used to assign values to a variable).
Ex. 2.2: The code below draws a random number between 0 and 1 and stores in the variable
randnum
. Write an if/else block that defines a new variable which is equal to 1 ifrandnum <= 0.1
and is 0 ifrandnum > 0.1
.
import random
= random.uniform(0,1)
randnum
### BEGIN SOLUTION
if randnum <= 0.1:
= 1
answer_22 else:
= 0
answer_22 ### END SOLUTION
Loops
For loops
Control flow that does the same thing repeatedly is called a loop. In python you can loop through anything that is iterable, e.g. anything where it makes sense to say “for each element in this item, do whatever.”
Lists, tuples and dictionaries are all iterable, and can thus be looped over. This kind of loop is called a for loop. The basic syntax is
for element in some_iterable:
do_something(element)
where element
is a temporary name given to the current element reached in the loop, and do_something
can be any valid python function applied to element
.
Example - try the following code:
= []
A
for i in [1, 3, 5]:
= i ** 2
i_squared
A.append(i_squared)
print(A)
[1, 9, 25]
For loops are smart when: iterating over files in a directory; iterating over specific set of columns.
Quiz: How does Python know where the code associated with inside of the loop begins?
Answer: By indenting the line with four whitespaces, see example above. This is the same as the if statements.
Ex. 2.3: Begin by initializing an emply list in the variable
answer_23
(simply writeanswer_23 = []
). Then loop trough the listy
that you defined in problem 1.4. For each element iny
, multiply that element by 7 and append it toanswer_23
. (You can finish off by showing the content ofanswer_23
after the loop has run.)
Hint: To append data to a list you can write
answer_23.append(new_element)
wherenew_element
is the new data that you want to append.
### BEGIN SOLUTION
= ['k', 2, 'b', 9]
y = []
answer_23 for element in y:
7 * element)
answer_23.append(print(answer_23)
### END SOLUTION
['kkkkkkk', 14, 'bbbbbbb', 63]
While loops
The other kind of loop in Python is the while loop. Instead of looping over an iterable, the while
loop continues going as long as a supplied logical condition is True.
Most commonly, the while loop is combined with a counting variable that keeps track of how many times the loop has been run.
One specific application where a while loop can be useful is data collection on the internet (scraping) which is often open ended. Another application is when we are computing something that we do not know how long will take to compute, e.g. when a model is being estimated.
The basic syntax is seen in the example below. This code will run 100 times before stopping. At each iteration, it checks that i
is smaller than 100. If it is, it does something and adds 1 to the variable i
before repeating.
= 0
i while i < 100:
do_something() = i + 1 i
In the example below, we provide an example of what do_something()
can be. Try the code below and explain why it outputs what it does.
= 0
i = []
L while (i < 5):
* 3)
L.append(i += 1
i
print(L)
[0, 3, 6, 9, 12]
Problem 2.4
Ex. 2.4: Begin by defining an empty list. Write a while loop that runs from \(i=0\) up to but not including \(i=1500\). In each loop, it should determine whether the current value of
i
is a multiple of 19. If it is, append the number to the list. (recall that \(i\) is divisible by \(a\) if \(i \text{ mod } a = 0\). The modulo operator in python is%
)
Hint: The
if
statement does not need to be followed by anelse
. You can simply code theif
part and python will automatically skip it and continue if the logical condition is False.Hint: Remember to increment
i
in each iteration. Otherwise the program will run forever. If this happens, press kernel > interrupt in the menu bar.
= 0
i = []
answer_24 ### BEGIN SOLUTION
while i < 1500:
if i % 19 == 0:
answer_24.append(i)+= 1
i ### END SOLUTION
3 Reusable Code
Functions
If you have never programmed in anything but statistical software such as Stata or SAS, the concept of functions might be new to you. In python, a function is simply a “recipe” that is first written, and then later used to compute something.
Conceptually, functions in programming are similar to functions in math. They have between \(0\) and “\(\infty\)” inputs, do some calculation using their inputs and then return between 1 and “\(\infty\)” outputs.
By making these recipes, we can save time by making a concise function that undertakes exactly the task that we want to complete.
Python contains a large number of built-in functions. Below, you are given examples of how to use the most commonly used built-ins. You should make yourself comfortable using each of the functions shown below.
# Setup for the examples. We define two lists to show you the built-in functions.
= [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
l1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] l2
# The len(x) function gives you the length of the input
len(l2)
10
# The abs(x) function returns the absolute value of x
abs(-5)
5
# The min(x) and max(x) functions return the minimum and maximum of the input.
min(l1), max(l1)
(0, 90)
# The map(function, Iterable) function applies the supplied function to each element in Iterable:
# Note that the list() call just converts the result to a list
list(map(len, l2))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# The range([start], stop, [step]) function returns a range of numbers from `start` to `stop`, in increments of `step`.
# The values in [] are optional.
# If no start value is set, it defaults to 0.
# If no step value is set it defaults to 1.
# A stop value must always be set.
print("Range from 0 to 100, step=1:", range(100))
print("Range from 0 to 100, step=2:", range(0, 100, 2))
print("Range from 10 to 65, step=3:", range(10, 65, 3))
Range from 0 to 100, step=1: range(0, 100)
Range from 0 to 100, step=2: range(0, 100, 2)
Range from 10 to 65, step=3: range(10, 65, 3)
# The reversed(x) function reverses the input.
# We can then loop trough it backwards
= reversed(l1)
l1_reverse
for e in l1_reverse:
print(e)
90
80
70
60
50
40
30
20
10
0
# The enumerate(x) function returns the index of the item as well as the item itself in sequence.
# With it, you can loop through things while keeping track of their position:
= enumerate(l2)
l2_enumerate
for index, element in l2_enumerate:
print(index, element)
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
# The zip(x,y,...) function "zips" together two or more iterables allowing you to loop through them pairwise:
= zip(l1, l2)
l1l2_zip
for e1, e2 in l1l2_zip:
print(e1, e2)
0 a
10 b
20 c
30 d
40 e
50 f
60 g
70 h
80 i
90 j
The how
You can also write your own python functions. A python function is defined with the def
keyword, followed by a user-defined name of the function, the inputs to the function and a colon. On the following lines, the function body is written, indented by one TAB.
Functions use the keyword return
to signal what values the function should return after doing its calculations on the inputs. For example, we can define a function named my_first_function
seen in the cell below. Run the code below and explain the printed output.
def my_first_function(x): # takes input x
= x ** 2 # x squared
x_squared return x_squared + 1
print('Output for input of 0: ', my_first_function(0))
print('Output for input of 1: ', my_first_function(1))
print('Output for input of 2: ', my_first_function(2))
print('Output for input of 3: ', my_first_function(3))
Output for input of 0: 1
Output for input of 1: 2
Output for input of 2: 5
Output for input of 3: 10
We can also make more complex functions. The function below, named my_second_function
, takes two inputs a
and b
that is used to compute the values \(a^b\) (written in python as a ** b
) and \(b^a\) and returns the larger of the two.
Provide the function below with different inputs of a
and b
. Explain the output to yourself.
def my_second_function(a, b):
= a ** b
v1 = b ** a
v2
if v1 > v2:
return v1
else:
return v2
Problem 3.1
Ex. 3.1: Write a function called
minimum
that takes as input a list of numbers, and returns the index and value of the minimum number as atuple
. Use your function to calculate the index and value of the minimum number in the list[-342, 195, 573, -234, 762, -175, 847, -882, 153, -22]
.
Hint: A “pythonic” way to keep count of the index of the minimum value would be to loop over the list of numbers by using the enumerate function on the list of numbers.
### BEGIN SOLUTION
def minimum(numbers):
= float('inf'), float('inf')
min_num_index, min_num for (number_index, value) in enumerate(numbers):
if value < min_num:
= number_index
min_num_index = value
min_num return min_num_index, min_num
# # Alternative solution:
# def minimum(numbers):
# min_value = min(numbers)
# idx_min_value = numbers.index(min_value)
# return idx_min_value, min_value
= [-342, 195, 573, -234, 762, -175, 847, -882, 153, -22]
numbers = minimum(numbers)
answer_31 ### END SOLUTION
Problem 3.2
Ex. 3.2: Write a function called
average
that takes as input a list of numbers, and returns the average of the values in the list. Use your function to calculate the average of the values[-1, 2, -3, 4, 0, -4, 3, -2, 1]
### BEGIN SOLUTION
def average(num_list):
return sum(num_list) / len(num_list)
= average([-1, 2, -3, 4, 0, -4, 3, -2, 1])
answer_32 ### END SOLUTION
Problem 3.3 (OPTIONAL)
Recall that Eulers constant \(e\) can be calculated as \[ e=\lim_{n\rightarrow \infty}\left(1+\frac{x}{n}\right)^{n} \] Of course we cannot compute the limit on a finite memory computer. Instead we can calculate approximations by taking \(n\) large enough.
Ex. 3.3: Write a function named
eulers_e
that takes two inputsx
andn
, calculates \[ \left(1+\frac{x}{n}\right)^{n} \] and returns this value. Use your function to calculateeulers_e(1, 5)
and store this value in the variableanswer_33
.
### BEGIN SOLUTION
def eulers_e(x, n):
return (1 + x / n) ** n
= eulers_e(1, 5)
answer_33 ### END SOLUTION
Problem 3.4 (OPTIONAL)
The inverse of the exponential is the logarithm. Like the exponential function, there are limit definitions of the logarithm. One of these is \[ \log(x) = 2 \cdot \sum_{k=0}^{\infty} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \]
where \(\sum_{k=0}^{\infty}\) signifies the sum of infinitely many elements, starting from \(k=0\). Each element in the sum takes the value \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\) for some \(k\). As before, we must approximate this with a finite sum.
Ex. 3.4: Define another function called
natural_logarithm
which takes two inputsx
andk_max
. In the function body calculate \[ 2 \cdot \sum_{k=0}^{k\_max} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \] and return this value.
Hint: to calculate the sum, first initialize a value total = 0, loop through \(k\in \{0, 1, \ldots, k\_max\}\) and compute \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\). Add the computed value to your total in each step of the loop. After finalizing the loop you can then multiply the total by 2 and return the result.
### BEGIN SOLUTION
def natural_logarithm(x, k_max):
return 2 * sum([1 / (2 * k + 1) * ((x - 1)/(x + 1)) ** (2 * k + 1)
for k in range(k_max + 1)])
print(natural_logarithm(1,100))
### END SOLUTION
0.0
Problem 3.5 (OPTIONAL)
Just like numbers, strings and data types, python treats functions as an object. This means you can write functions, that take a function as input and functions that return functions after being executed. This is sometimes a useful tool to have when you need to add extra functionality to an already existing function, or if you need to write function factories.
Ex. 3.5: Write a function called
exponentiate
that takes one input namedfunc
. In the body ofexponentiate
define a nested function (i.e. a function within a function) calledwith_exp
that takes two inputsx
andk
. The nested function should returnfunc(e, k)
wheree = eulers_e(x, k)
. The outer function should return the nested functionwith_exp
, i.e. write something like
def exponentiate(func):
def with_exp(x, k):
= eulers_e(x, k)
e = #[FILL IN]
value return value
return with_exp
Call the
exponentiate
function onnatural_logarithm
and store the result in a new variable calledlogexp
.
Hint: You will not get exactly the same result as you put in due to approximations and numerical accuracy.
### BEGIN SOLUTION
def exponentiate(func):
def with_exp(x, k):
= eulers_e(x, k)
e = func(e, k)
value return value
return with_exp
= exponentiate(natural_logarithm)
logexp print(logexp(1, 100))
### END SOLUTION
0.9950330853168091
Getting More General
Modules
Whatever we attempt in programming, it is likely nowadays that someone has done it before us. Therefore, we can reuse code which allows to 1. save time by using others’ code, and 2. learn from others’ code.
Moreover, often the code implemented by someone with more experience is likely to work better and faster than what we can come up with! That’s why we introduce modules. These are packages of Python code that we can load - and by doing that, we get access to powerful tools.
Let’s see how modules work. Run the code below to load a module called numpy
which allows us to work with linear algebra and other numeric tools.
import numpy as np
Let’s create an array
with numpy
.
= [1, 2]
row1 = [3, 4]
row2 = [row1, row2]
table
= np.array(table)
my_array my_array
array([[1, 2],
[3, 4]])
What is a numpy
array?
An n-dimensional container that can store specific data types, e.g. bool and float. The arrays come with certain available methods and tools. E.g. 2-d array can act like a matrix, in 3-d it can act like a tensor.
Objects can have useful attributes and methods that are built-in. These are accessed using "."
Example, an array can be transposed as follows:
my_array.T
array([[1, 3],
[2, 4]])
(Optional) Classes
In Python, we can also define our types of objects, which is known as class
. Each class contains rules and properties that governs how objects of the class will behave. If you are curious and want to learn, which is totally optional, then read more here (note: quite technical). Otherwise move on.
4 Pandas for data structuring
You may ask yourself: Why do we need to learn data structuring?
Data never comes in the form of our model (unless you or someone else has done it in another program, which is perfectly fine). We need to ‘wrangle’ our data. As of right now, even the most advanced techniques needs data in a structured format to work with it.
An Overview
Tabular data is like the table below. Each row is an observation which consist of two entries, one for each of the columns/fields, i.e. animal and day.
index | Animal | Date |
---|---|---|
Observation 1 | Elk | July 1, 2019 |
Observation 2 | Pig | July 3, 2019 |
What pandas provides is a smart way of structuring data. It has two fundamental data types, see below. These are essentially just container but come with a lot of extra functionality for structuring data and performing analysis.
Series
: tabular data with a single column (field)- akin to a vector in mathematics
- has labelled columns (e.g. Animal and Date above) and named rows, called indices.
DataFrame
: tabular data that allows for more than one column (multiple fields)- akin to a matrix in mathematics
Run the code below to make your first pandas dataframe. Try to print it and explain the content it shows.
import pandas as pd
= pd.DataFrame(data=[[1, 2],[3, 4],[5, 6],[7, 8]],
df1 =['i', 'ii','iii','iv'],
index=['A', 'B']) columns
The code below makes a series from a list. We can see that it contains all the four fundamental data types!
= [1, 1.2, 'abc', True]
L = pd.Series(L) ser1
Now you may ask yourself: why don’t we just use numpy
?
There are many reasons. pandas
is easier for loading, structuring and making simple analysis of tabular data. However, in many cases, if you are working with custom data or need to performing fast and complex array computations, then numpy
is a better option. If you are interested see discussion here.
Switching Among Python, Numpy and Pandas
Pandas dataframes can be thought of as numpy arrays with some additional stuff. Note that columns can have different datatypes!
Most functions from numpy
can be applied directly to Pandas. We can convert a DataFrame to a numpy
array with values
attribute:
df1.values
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]], dtype=int64)
In Python, we can describe it as a list of lists.
df1.values.tolist()
[[1, 2], [3, 4], [5, 6], [7, 8]]
Both dataframes and series have indices which are both a blessing and a curse. These indices means that we can often convert a Series into a dictionary:
ser1.to_dict()
{0: 1, 1: 1.2, 2: 'abc', 3: True}
WARNING!: Series indices are NOT unique thus we may lose data if we convert to a dict which requires unique keys.
Inspection
Often we want to see what our dataframe contains. This can be done by putting the dataframe at the end of our cell, then it will automatically be printed.
The example below consist of 100 rows, with 5 columns of random data. We see that putting the dataframe in the end prints the dataframe.
= pd.DataFrame(data=np.random.rand(100, 5),
df2 =['A','B','C','D','E'])
columns df2
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 |
2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 |
3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
4 | 0.756690 | 0.119340 | 0.269215 | 0.099179 | 0.411532 |
... | ... | ... | ... | ... | ... |
95 | 0.728482 | 0.232860 | 0.854766 | 0.784101 | 0.711444 |
96 | 0.706587 | 0.819365 | 0.090774 | 0.303287 | 0.224769 |
97 | 0.796380 | 0.783840 | 0.740566 | 0.747527 | 0.969443 |
98 | 0.433955 | 0.938853 | 0.932820 | 0.845110 | 0.583784 |
99 | 0.658342 | 0.699536 | 0.337664 | 0.424492 | 0.458236 |
100 rows × 5 columns
We can also use head
and the tail
method that select respectively the first and last observations in a DataFrame. The code below prints the first four rows.
= df2.head(n=4)
df3 df3
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 |
2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 |
3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
Input-output
We can load and save dataframes from our computer or the internet. Try the code below to save our dataframe as a CSV file called my_data.csv
. If you are unsure what a CSV file is then check the Wikipedia description.
'my_data.csv') df3.to_csv(
Loading data is just as easy. Some data sources are open and easy to collect data from. They do not require formatting as they come in a table format. The code below load a CSV file on school test data from NYC.
= 'https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv'
my_url = pd.read_csv(my_url)
my_df
10) my_df.head(
DBN | School Name | Number of Test Takers | Critical Reading Mean | Mathematics Mean | Writing Mean | |
---|---|---|---|---|---|---|
0 | 01M292 | Henry Street School for International Studies | 31.0 | 391.0 | 425.0 | 385.0 |
1 | 01M448 | University Neighborhood High School | 60.0 | 394.0 | 419.0 | 387.0 |
2 | 01M450 | East Side Community High School | 69.0 | 418.0 | 431.0 | 402.0 |
3 | 01M458 | SATELLITE ACADEMY FORSYTH ST | 26.0 | 385.0 | 370.0 | 378.0 |
4 | 01M509 | CMSP HIGH SCHOOL | NaN | NaN | NaN | NaN |
5 | 01M515 | Lower East Side Preparatory High School | 154.0 | 314.0 | 532.0 | 314.0 |
6 | 01M539 | New Explorations into Sci, Tech and Math HS | 47.0 | 568.0 | 583.0 | 568.0 |
7 | 01M650 | CASCADES HIGH SCHOOL | 35.0 | 411.0 | 401.0 | 401.0 |
8 | 01M696 | BARD HIGH SCHOOL EARLY COLLEGE | 138.0 | 630.0 | 608.0 | 630.0 |
9 | 02M047 | AMERICAN SIGN LANG ENG DUAL | 11.0 | 405.0 | 415.0 | 385.0 |
Working with weather data
We will now work with a dataset regarding weather. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). The data contains daily weather recorded at the weather stations. A description of GHCN can be found here.
Problem 4.1
Ex. 4.1: Use Pandas’ CSV reader to fetch daily data weather from 1863 for various stations - available somewhere on your common drive. If you cannot find it, it can also be found at this website.
Hint: you will need to give
read_csv
some keywords. Here are some suggestions - Specify the path, using either a string or through thepathlib
module, see documentation (nice for interoperability between macOS + Windows and relative paths). - for compressed files you may need to specify the keywordcompression
when calling the.read_csv
method. -header
can be specified as the CSV has no column names.
import pandas as pd
### BEGIN SOLUTION
# using online url
= "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1863.csv.gz"
path
# ASSUMING data is in a folder called 'data':
# using string
#path = 'data/1863.csv.gz'
# using pathlib
#from pathlib import Path
#cwd = Path.cwd()
#path = cwd / 'data' / '1863.csv.gz'
= pd.read_csv(
df_weather
path,='gzip', # decompress gzip
compression=None # use no header information from the csv
header
) ### END SOLUTION
Selecting Rows and Columns
In pandas there are two canonical ways of accessing subsets of a dataframe. - The iloc
attribute: access rows and columns using integer indices (like a list). - The loc
attribute: access rows and columns using immutable keys, e.g. numbers, strings (like a dictionary).
In what follows we will describe some different way of selection using .iloc
and .loc
as well as a simpler way of simply accesing the dataframe using []
. The different ways are meant to give you an overview.
Using list of keys/indices
Below is an example of using the iloc
attribute to select specific rows:
# show df1 before indexing it with .iloc[] df1
A | B | |
---|---|---|
i | 1 | 2 |
ii | 3 | 4 |
iii | 5 | 6 |
iv | 7 | 8 |
= [0, 3]
my_irows df1.iloc[my_irows]
A | B | |
---|---|---|
i | 1 | 2 |
iv | 7 | 8 |
We can select columns and rows simultaneously. Below is an example of using the loc
attribute, which does that:
= ['i', 'iii']
my_rows = ['A']
my_cols df1.loc[my_rows, my_cols]
A | |
---|---|
i | 1 |
iii | 5 |
Using thresholds
We can also use iloc
and loc
for selecting rows and/or columns below or above some treshold, see below. Note that whether or not the :
is on front determines whether it is above or below.
3, :4] df2.iloc[:
A | B | C | D | |
---|---|---|---|---|
0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 |
1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 |
2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 |
Using boolean data
If we provide the dataframe with a boolean, it will select rows (also works with iloc
and loc
). We will see soon that this is an extremely useful way of selecting certain rows.
True, False, False, True]] df3[[
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
Selecting columns
Often we need to select specific columns. If we provide the dataframe with a list of column names it will make a dataframe keep only these columns:
'B', 'D']] df3[[
B | D | |
---|---|---|
0 | 0.094683 | 0.911622 |
1 | 0.830997 | 0.235283 |
2 | 0.708393 | 0.171770 |
3 | 0.735157 | 0.580469 |
Problem 4.2
Ex 4.2: Select the four left-most columns which contain: station identifier, data, observation type, observation value. Rename them as ‘station’, ‘datetime’, ‘obs_type’, ‘obs_value’.
Hint: Renaming can be done with
df.columns = cols
wherecols
is a list of column names.
### BEGIN SOLUTION
= df_weather.iloc[:, :4] # select only first four columns
df_weather
= ['station', 'datetime', 'obs_type', 'obs_value']
column_names = column_names # set column names
df_weather.columns ### END SOLUTION
Basic Operations
How do we perform elementary operations like we learned for basic Python? E.g. numeric operations such as summation (+
) or logical operations such as greater than (>
). Actually we are in luck - they are exactly the same.
Let’s see how it works for numeric data using a numpy array (works the same way as Pandas).
= np.array([2, 3, 2, 1, 1])
my_arr1 = my_arr1 ** 2
my_arr2 my_arr2
array([4, 9, 4, 1, 1])
Can we do the same with two vectors? Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example:
+ my_arr2 my_arr1
array([ 6, 12, 6, 2, 2])
Changing and Copying Data
Everything in the dataframe can be changed. For instance, we can also update our dataframe with new values, e.g. by making new variables or overwriting existing ones. In the example below we add a new column to add a DataFrame.
'F'] = df2['A'] > df2['D']
df2[10) df2.head(
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 | False |
1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 | True |
2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 | True |
3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 | False |
4 | 0.756690 | 0.119340 | 0.269215 | 0.099179 | 0.411532 | True |
5 | 0.634309 | 0.958614 | 0.330676 | 0.454304 | 0.098996 | True |
6 | 0.327120 | 0.263946 | 0.884487 | 0.238092 | 0.283622 | True |
7 | 0.180478 | 0.433104 | 0.719118 | 0.188784 | 0.674121 | False |
8 | 0.645979 | 0.667443 | 0.978808 | 0.531604 | 0.241179 | True |
9 | 0.940160 | 0.744014 | 0.657913 | 0.348178 | 0.940021 | True |
WARNING!: If you work on a subset of data from another dataframe, then this dataframe is what is known as a view! Therefore, all changes made in the view will also be made in the original version.
In the example below, we try to change the dataframe df2
which is a view of df3
, and we get a warning. Thus, changes to df3
also happen in df2
. Notice that we can also use loc
for changing the data.
'D'] = df3['A'] - df3['E']
df3.loc[:,print(df2['D'].head(3), '\n')
print(df3['D'].head(3))
0 -0.083219
1 0.804520
2 0.397187
Name: D, dtype: float64
0 -0.083219
1 0.804520
2 0.397187
Name: D, dtype: float64
C:\Users\wkg579\AppData\Local\Temp\ipykernel_18024\2462267356.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3.loc[:,'D'] = df3['A'] - df3['E']
To avoid the problem of having a view, we can instead copy the data as in the example below. Try to verify that if you change things in df4
things do not change in df2
.
= df2.copy() df4
# Verify that the code from above doesn't throw the same "SettingWithCopyWarning"
# when using the copied dataframe, df4, instead of df3.
'D'] = df4['A'] - df4['E'] df4.loc[:,
Problem 4.3
Ex. 4.3: Further, select the subset of data for the station
UK000056225
and only observations for maximal temperature. Make a copy of the DataFrame and store this in the variabledf_select
. Explain in a one or two sentences how copying works. Write your answer in a multi line comment like""" Your answer here """
.
Hint: The
&
operator works elementwise on boolean series (likeand
in core python). This allows to combine conditions for selections.
### BEGIN SOLUTION
= df_weather.station == 'UK000056225' # boolean: first weather station
select_stat = df_weather.obs_type == 'TMAX' # boolean: maximal temp.
select_tmax
= select_stat & select_tmax # row selection - require both conditions
select_rows
= df_weather[select_rows].copy() # apply selection and copy
df_select
= """Copying of the dataframe breaks the dependency with original DataFrame `df_weather`.
explanation If dependency is not broken, then changing values in one of the two dataframes
would imply changes in the other."""
print(explanation)
### END SOLUTION
Copying of the dataframe breaks the dependency with original DataFrame `df_weather`.
If dependency is not broken, then changing values in one of the two dataframes
would imply changes in the other.
Problem 4.4
Ex 4.4: Make sure that max temperature is correctly formated (how many decimals should we add? one? Look through this .txt file for an answer https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). Make a new column called
TMAX_F
where you have converted the temperature variables to Fahrenheit.
Hint: Conversion is \(F = 32 + 1.8*C\) where \(F\) is Fahrenheit and \(C\) is Celsius.
### BEGIN SOLUTION
# In the readme.txt file downloaded from the link given in the exercise text,
# it can be seen, in the section 'III. FORMAT OF DATA FILES', that
# TMAX = Maximum temperature (tenths of degrees C).
# Therefore, we convert to degrees celcius by dividing by 10.
'obs_value'] = df_select['obs_value'] / 10
df_select['TMAX_F'] = 32 + 1.8 * df_select['obs_value']
df_select[### END SOLUTION
Changing and Rearranging Indices
In addition to replacing values of our data, we can also rearrange the order of variables and rows as well as make new ones. We have already seen how to change column names but we can also reset the index, as seen below. Alternatively, we can set our own custom index using set_index
, with temporal data etc. which provides the DataFrame with new functionality.
= df1.reset_index(drop=True)
df1_new_index df1_new_index
A | B | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
2 | 5 | 6 |
3 | 7 | 8 |
A powerful tool for re-organizing the data is to sort
the data. That is, we can re-organize rows (or columns) such that they are ascending or descending according to one or more columns.
= df3.sort_values(by=['A','B'], ascending=True)
df3_sorted df3_sorted
A | B | C | D | E | |
---|---|---|---|---|---|
3 | 0.214144 | 0.735157 | 0.651842 | -0.234138 | 0.448282 |
0 | 0.291100 | 0.094683 | 0.550356 | -0.083219 | 0.374320 |
2 | 0.900874 | 0.708393 | 0.950499 | 0.397187 | 0.503687 |
1 | 0.933523 | 0.830997 | 0.430150 | 0.804520 | 0.129003 |
Problem 4.5
Ex 4.5: Inspect the indices in
df_select
. Are they following the sequence of natural numbers, 0,1,2,…? If not, reset the index and make sure to drop the old.
### BEGIN SOLUTION
= df_select.reset_index(drop=True)
df_select ### END SOLUTION
Problem 4.6
Ex 4.6: Make a new DataFrame
df_sorted
where you have sorted by the maximum temperature. What is the date for the first and last observations?
### BEGIN SOLUTION
= df_select.sort_values(by=['obs_value'])
df_sorted print(
f"Date for the min temp: {df_sorted['datetime'].iloc[0]}",
f"Date for the max temp: {df_sorted['datetime'].iloc[-1]}",
="\n"
sep
)### END SOLUTION
Date for the min temp: 18631231
Date for the max temp: 18630714
5 Pandas with datetimes and aggregations (OPTIONAL)
Pandas supports many more functions, many of which are covered in the Python for Data Analysis (PDA) book. These could be things such as more data cleaning (PDA chapter 7) merging and joining (PDA chapter 8), groupby functionality (PDA chapter 10), datetimes (PDA chapter 11) and more.
Problem 5.1
When working with datetimes, it is common to get them as pure strings from the data source. In the weather data, it is a string of the format YYYYMMDD, which can be converted to a date pandas understands using the pandas functionality to_datetime()
, with documentation here.
Ex 5.1: Convert the string date to a pandas date and add this to a new column called
datetime_dt
.
Hint: When converting string dates to pandas dates, it is always wise to specify the format. PDA has a table with format information
### BEGIN SOLUTION
= pd.to_datetime(df_select['datetime'], format = '%Y%m%d')
datetime_dt 'datetime_dt'] = datetime_dt
df_select[### END SOLUTION
Problem 5.2
Ex 5.2: Create a new column with the month of the observation
Hint: If a Series/column has a date in it, the datetime functionality can be accessed by calling .dt on it, which can be followed by further commands.
### BEGIN SOLUTION
= datetime_dt.dt.month
month 'month'] = month
df_select[### END SOLUTION
Problem 5.3
A very powerful method to analyse data is the split-apply-combine method. In pandas this corresponds to the groupby functionality.
Ex 5.3: Compute the mean and median maximum daily temperature for each month on the dataframe
df_select
using the split-apply-combine procedure. Store the results in new columnstmax_mean
andtmax_median
.
Hint: The groupby functionality can be ‘unwrapped’ using the transform method, such that it retains the original length. This is very handy when trying to create new columns, and not reporting statistics.
### BEGIN SOLUTION
'tmax_mean'] = df_select.groupby(['month'])['obs_value'].transform('mean')
df_select['tmax_median'] = df_select.groupby(['month'])['obs_value'].transform('median')
df_select[### END SOLUTION
6 Linear regression with numpy (OPTIONAL)
NOTE: If you previosly skipped 3.3
, 3.4
or 3.5
, you might benefit from completing these before continuing with this numpy
section.
Python supports all of the regular matrix computations, if one wishes to implement a predictor or estimator on their own.
To showcase this, you will in this example be tasked to convert a subset of a DataFrame into a numpy array. Based on this, you can implement estimators such as the ordinary least squares estimator:
\[\hat \beta = (X'X)^{-1}(X'y)\]
To test this out, we will estimate how age, passenger class and fare influenced chance of survival for the passengers of Titanic.
Problem 6.1
Ex 6.1: Load the
titanic
dataset from seaborn using theload_dataset
function. Remove any rows with missing values.Hint: - The dataset is aptly named
titanic
. -pandas
has a built-in function calleddropna
.
### BEGIN SOLUTION
import seaborn as sns
= sns.load_dataset('titanic')
df_titanic = df_titanic.dropna()
df_titanic ### END SOLUTION
Problem 6.2
Ex 6.2: Convert the columns
age
,pclass
andfare
to an array with dimensionsN*3
and the columnsurvived
to an array with dimensionsN*1
Hint: Try subsetting the data in the DataFrame and then converting it to an array
### BEGIN SOLUTION
# Using pandas method
= df_titanic[['age','pclass','fare']].to_numpy()
X = df_titanic[['survived']].to_numpy()
y
# Using as input to np.array
= np.array(df_titanic[['age','pclass','fare']])
X_alt = np.array(df_titanic[['survived']])
y_alt
# equivalence
assert (X == X_alt).all()
assert (y == y_alt).all()
### END SOLUTION
Problem 6.3
Ex 6.3: Implement the ordinary least squares estimator with no intercept using
numpy
Hint:
numpy
offers a lot of methods for arrays - numpy.linalg offers a lot of functionality for linear algebra -@
calculates a dot-product -inv
inverts a matrix - If you’ve importednumpy
asnp
, these functions can be accessed asnp.linalg.function
- You can also import specific functions asfrom numpy.linalg import function
- This can reduce the clutter in your code (e.g.np.linalg.inv(X)
versusinv(X)
)
### BEGIN SOLUTION
from numpy.linalg import inv
= inv(X.T@X)@(X.T@y)
beta
# equivalence with sklearn method
from sklearn.linear_model import LinearRegression
= LinearRegression(fit_intercept=False)
OLS
OLS.fit(X, y)
# np.isclose instead of == due to numerical inaccuracies
assert np.isclose(OLS.coef_, beta.T).all()
### END SOLUTION