A = 1.5Overview of notebook
This notebook consists of two independent parts. The first about basic Python where you get familiar with the most important concepts and tools. The second part is a short introduction to pandas, which is a tool for structuring data in Python, and numpy, which is a tool for matrix calculations.
Overview of Python content
In this integrated assignment and teaching module, we will learn the following things about basic Python: - Fundamental data types: numeric, string and boolean - Operators: numerical and logical - Conditional logic - Containers with indices - Loops: for and while - Reuseable code: functions, classes and modules
Additional sources:
As always, there are many sources out there: Google is your best friend. However, here are some recommendations
A book: Python for Data Analysis
Videos: pythonprogramming.net fundamental (basics and intermediate)
A tutorial website: The official python 3 tutorial (sections 3, 4 and 5)
1 Fundamentals of Python
Elementary Data Types
Examples with data types
Execute the code below to create a variable A as a float equal to 1.5:
Execute the code below to convert the variable A to an integer by typing:
int(A) # rounds down, i.e. floor 1
We can do the same for converting to float, str, bool. Note some may at first come across as slightly odd:
bool(A)True
While some are simply not allowed:
#float('A') # Attempt at converting the string (text) 'A' to a numberPrinting Stuff
An essential procedure in Python is print. Try executing some code - in this case, printing a string of text:
my_str = 'I can do in Python, whatever I want' # define a string of text
print(my_str) # print itI can do in Python, whatever I want
We can also print several objects at the same time. Try it!
my_var1 = 33
my_var2 = 2
print(my_var1, my_var2)33 2
Why do we print? - It allows us to inspect values - For instance when trying to understand why a piece of code does not give us what we expect (i.e. debugging) - In particular helpful when inspecting intermediate output (within a function, loops etc.).
Numeric Operators
What numeric computations can python do?
An operator in Python manipulates various data types.
We have seen addition +. Other basic numeric operators: - multiplication *; - subtraction -; - division /; - power, **
Try executing the code below. Explain in words what the expression does.
2**4 16
Problem 1.1
Having seen all the information, you are now ready for the first exercise. The exercise is found below in the indented text. > Ex. 1.1: Add the two integers 3 and 5
### BEGIN SOLUTION
answer_11 = 3 + 5
### END SOLUTIONProblem 1.2
Python also has a built in data type called a string. This is simply a sequence of letters/characters and can thus contain names, sentences etc. To define a string in python, you need to wrap your sentence in either double or single quotation marks. For example you could write "Hello world!".
Ex. 1.2: In Python the
+is not only used for numbers. Use the+to add together the three strings"VIVE","Machine"and"Learning". What is the result?
### BEGIN SOLUTION
answer_12 = 'VIVE' + 'Machine' + 'Learning'
### END SOLUTIONBoolean Operators
Helpful advice: If you are not certain what a boolean value is, try and go back to Fundamental Data Types
What else can operators do?
We can check the validity of a statement - using the equal operator, ==, or not equal operator !=. Try the examples below:
3 == (2 + 1)True
3 != (2 + 1)False
11 != 2 * 5True
In all these cases, the outcomes were boolean.
We can also do numeric comparisons, such as greater than >, greater than or equal >=, etc.:
11 <= 2 * 5False
How can we manipulate boolean values?
Combining boolean values can be done using:
- the
andoperator - equivalent to& - the
oroperator - equivalent to|
Let’s try this!
print(True | False)
print(True & False)True
False
What other things can we do?
We can negate/reverse the statement with the not operator:
not (True and False)True
Problem 1.3
Above you added two integers together, and got a result of 8. Python separates numbers in two classes, the integers \(...,-1,0,1,2,...\) and the floats, which are an approximation of the real numbers \(\mathbb{R}\) (exactly how floats differ from reals is taught in introductory computer science courses).
Ex. 1.3:
* Add1.7to4* What type is0.667 * 100in Python?
### BEGIN SOLUTION
answer_131 = 1.3 + 4
answer_132 = 0.667 * 100
### END SOLUTIONContainers
What is a composite data type?
A data type that can contain more than entry of data, e.g. multiple numbers.
What are the most fundamental composite data types?
Three of the most fundamental composite data types are the tuple, the list and the dictionary.
The tuple is declared with round parentheses, e.g.
(1, 2, 3)each element in the tuple is separated by a comma. One you have declared a tuple you cannot change it’s content without making a copy of the tuple first (you will read that the tuple is an immutable data type).The list is almost identical to the tuple. It is declared using square parentheses, e.g.
[1, 2, 3]. Unlike the tuple, a list can be changed after definition, by adding, changing or removing elements. This is called a mutable data type.The dictionary or simply dict is also a mutable data type. Unlike the other data types the dict works like a lookup table, where each element of data stored in the dictionary is associated with a name. To look up an item in the dictionary you don’t need to know its position in the dictionary, only its name. The dict is defined with curly braces and a colon to separate the name from the value, e.g.
{'name_of_first_entry': 1, 'name_of_second_entry: 2}.
Problem 1.4
Ex. 1.4: Define the variable
yas a list containing the elements'k', 2, 'b', 9. Also define a variablezwhich is a tuple containing the same elements. Try to access the 0th element ofy(python is 0-indexed) and the 1st element ofz.Hint: To access the n’th element of a list/tuple write
myData[n], for exampley[0]gets the 0th element ofy.
### BEGIN SOLUTION
y = ['k', 2, 'b', 9]
z = ('k', 2, 'b', 9)
answer_14_y0 = y[0]
answer_14_z1 = z[1]
### END SOLUTION2 Control Flow
If-then syntax
Control flow means writing code that controls the way data or information flows through the program. The concepts of control flow should be recognizable outside of coding as well. For example when you go shopping you might want to buy koldskål, but only if the kammerjunker are on sale. else you will buy icecream. These kinds of logic come up everywhere in coding; self driving cars should go forward only if the light is green, items should be listed for sale in a web shop only if they are in stock, stars should be put on the estimates if they are significant etc.
Another kind of control flow deals with doing things repeatedly. For example dishes should be done while there are still dirty dishes to wash, for each student in a course a grade should be given, etc.
In the following problems you will work with both kinds of control flow.
How can we activate code based on data in Python?
In Python, the syntax is easy with the if syntax.
if statement:
codeIn the example above, the block called code is run if the condition called statement is true (either a variable or an expression).
Examples using if
Try to run the examples:
my_statement = (4 == 4)
if my_statement:
print ("I'm being executed, yay!")I'm being executed, yay!
Introducing an alternative
If the statement in our condition is false, then we can execute other code with the else statement. Try the example below - and change the boolean value of my_statement.
my_statement = False
if my_statement:
print ("I'm being executed, yay!")
else:
print ("Shoot! I'm still being executed!")Shoot! I'm still being executed!
Optional material
We have not covered the statements break and continue, or try and except which are also control flow statements. These are slightly more advanced, but it can be a good idea to look them up yourself.
In Python the if/else logic consists of three keywords: if, elif (else if) and else. The if and elif keywords should be followed by a logical statement that is either True or False. The code that should be executed if the logic is True is written on the lines below the if, and should be indented one TAB (or 4 spaces). Finally all control flow statements must end with a colon.
Ex. 2.1: Read the code in the cell below. Assign a value to the variable
xthat makes the code print “Good job!”
### BEGIN SOLUTION
x = 4
### END SOLUTION
if x > 5:
print("x is too large")
elif x >= 3 and x <= 5:
print("Good job!")
else:
print("x is too small")
Good job!
Above we used two different types of comparison: >= and <. To compare two values and check whether they are equal, python uses double equal signs == (remember a single = was used to assign values to a variable).
Ex. 2.2: The code below draws a random number between 0 and 1 and stores in the variable
randnum. Write an if/else block that defines a new variable which is equal to 1 ifrandnum <= 0.1and is 0 ifrandnum > 0.1.
import random
randnum = random.uniform(0,1)
### BEGIN SOLUTION
if randnum <= 0.1:
answer_22 = 1
else:
answer_22 = 0
### END SOLUTIONLoops
For loops
Control flow that does the same thing repeatedly is called a loop. In python you can loop through anything that is iterable, e.g. anything where it makes sense to say “for each element in this item, do whatever.”
Lists, tuples and dictionaries are all iterable, and can thus be looped over. This kind of loop is called a for loop. The basic syntax is
for element in some_iterable:
do_something(element)where element is a temporary name given to the current element reached in the loop, and do_something can be any valid python function applied to element.
Example - try the following code:
A = []
for i in [1, 3, 5]:
i_squared = i ** 2
A.append(i_squared)
print(A)[1, 9, 25]
For loops are smart when: iterating over files in a directory; iterating over specific set of columns.
Quiz: How does Python know where the code associated with inside of the loop begins?
Answer: By indenting the line with four whitespaces, see example above. This is the same as the if statements.
Ex. 2.3: Begin by initializing an emply list in the variable
answer_23(simply writeanswer_23 = []). Then loop trough the listythat you defined in problem 1.4. For each element iny, multiply that element by 7 and append it toanswer_23. (You can finish off by showing the content ofanswer_23after the loop has run.)
Hint: To append data to a list you can write
answer_23.append(new_element)wherenew_elementis the new data that you want to append.
### BEGIN SOLUTION
y = ['k', 2, 'b', 9]
answer_23 = []
for element in y:
answer_23.append(7 * element)
print(answer_23)
### END SOLUTION['kkkkkkk', 14, 'bbbbbbb', 63]
While loops
The other kind of loop in Python is the while loop. Instead of looping over an iterable, the while loop continues going as long as a supplied logical condition is True.
Most commonly, the while loop is combined with a counting variable that keeps track of how many times the loop has been run.
One specific application where a while loop can be useful is data collection on the internet (scraping) which is often open ended. Another application is when we are computing something that we do not know how long will take to compute, e.g. when a model is being estimated.
The basic syntax is seen in the example below. This code will run 100 times before stopping. At each iteration, it checks that i is smaller than 100. If it is, it does something and adds 1 to the variable i before repeating.
i = 0
while i < 100:
do_something()
i = i + 1In the example below, we provide an example of what do_something() can be. Try the code below and explain why it outputs what it does.
i = 0
L = []
while (i < 5):
L.append(i * 3)
i += 1
print(L) [0, 3, 6, 9, 12]
Problem 2.4
Ex. 2.4: Begin by defining an empty list. Write a while loop that runs from \(i=0\) up to but not including \(i=1500\). In each loop, it should determine whether the current value of
iis a multiple of 19. If it is, append the number to the list. (recall that \(i\) is divisible by \(a\) if \(i \text{ mod } a = 0\). The modulo operator in python is%)
Hint: The
ifstatement does not need to be followed by anelse. You can simply code theifpart and python will automatically skip it and continue if the logical condition is False.Hint: Remember to increment
iin each iteration. Otherwise the program will run forever. If this happens, press kernel > interrupt in the menu bar.
i = 0
answer_24 = []
### BEGIN SOLUTION
while i < 1500:
if i % 19 == 0:
answer_24.append(i)
i += 1
### END SOLUTION3 Reusable Code
Functions
If you have never programmed in anything but statistical software such as Stata or SAS, the concept of functions might be new to you. In python, a function is simply a “recipe” that is first written, and then later used to compute something.
Conceptually, functions in programming are similar to functions in math. They have between \(0\) and “\(\infty\)” inputs, do some calculation using their inputs and then return between 1 and “\(\infty\)” outputs.
By making these recipes, we can save time by making a concise function that undertakes exactly the task that we want to complete.
Python contains a large number of built-in functions. Below, you are given examples of how to use the most commonly used built-ins. You should make yourself comfortable using each of the functions shown below.
# Setup for the examples. We define two lists to show you the built-in functions.
l1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
l2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']# The len(x) function gives you the length of the input
len(l2)10
# The abs(x) function returns the absolute value of x
abs(-5)5
# The min(x) and max(x) functions return the minimum and maximum of the input.
min(l1), max(l1)(0, 90)
# The map(function, Iterable) function applies the supplied function to each element in Iterable:
# Note that the list() call just converts the result to a list
list(map(len, l2))[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# The range([start], stop, [step]) function returns a range of numbers from `start` to `stop`, in increments of `step`.
# The values in [] are optional.
# If no start value is set, it defaults to 0.
# If no step value is set it defaults to 1.
# A stop value must always be set.
print("Range from 0 to 100, step=1:", range(100))
print("Range from 0 to 100, step=2:", range(0, 100, 2))
print("Range from 10 to 65, step=3:", range(10, 65, 3))Range from 0 to 100, step=1: range(0, 100)
Range from 0 to 100, step=2: range(0, 100, 2)
Range from 10 to 65, step=3: range(10, 65, 3)
# The reversed(x) function reverses the input.
# We can then loop trough it backwards
l1_reverse = reversed(l1)
for e in l1_reverse:
print(e)90
80
70
60
50
40
30
20
10
0
# The enumerate(x) function returns the index of the item as well as the item itself in sequence.
# With it, you can loop through things while keeping track of their position:
l2_enumerate = enumerate(l2)
for index, element in l2_enumerate:
print(index, element)0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
# The zip(x,y,...) function "zips" together two or more iterables allowing you to loop through them pairwise:
l1l2_zip = zip(l1, l2)
for e1, e2 in l1l2_zip:
print(e1, e2)0 a
10 b
20 c
30 d
40 e
50 f
60 g
70 h
80 i
90 j
The how
You can also write your own python functions. A python function is defined with the def keyword, followed by a user-defined name of the function, the inputs to the function and a colon. On the following lines, the function body is written, indented by one TAB.
Functions use the keyword return to signal what values the function should return after doing its calculations on the inputs. For example, we can define a function named my_first_function seen in the cell below. Run the code below and explain the printed output.
def my_first_function(x): # takes input x
x_squared = x ** 2 # x squared
return x_squared + 1
print('Output for input of 0: ', my_first_function(0))
print('Output for input of 1: ', my_first_function(1))
print('Output for input of 2: ', my_first_function(2))
print('Output for input of 3: ', my_first_function(3))Output for input of 0: 1
Output for input of 1: 2
Output for input of 2: 5
Output for input of 3: 10
We can also make more complex functions. The function below, named my_second_function, takes two inputs a and b that is used to compute the values \(a^b\) (written in python as a ** b) and \(b^a\) and returns the larger of the two.
Provide the function below with different inputs of a and b. Explain the output to yourself.
def my_second_function(a, b):
v1 = a ** b
v2 = b ** a
if v1 > v2:
return v1
else:
return v2Problem 3.1
Ex. 3.1: Write a function called
minimumthat takes as input a list of numbers, and returns the index and value of the minimum number as atuple. Use your function to calculate the index and value of the minimum number in the list[-342, 195, 573, -234, 762, -175, 847, -882, 153, -22].
Hint: A “pythonic” way to keep count of the index of the minimum value would be to loop over the list of numbers by using the enumerate function on the list of numbers.
### BEGIN SOLUTION
def minimum(numbers):
min_num_index, min_num = float('inf'), float('inf')
for (number_index, value) in enumerate(numbers):
if value < min_num:
min_num_index = number_index
min_num = value
return min_num_index, min_num
# # Alternative solution:
# def minimum(numbers):
# min_value = min(numbers)
# idx_min_value = numbers.index(min_value)
# return idx_min_value, min_value
numbers = [-342, 195, 573, -234, 762, -175, 847, -882, 153, -22]
answer_31 = minimum(numbers)
### END SOLUTIONProblem 3.2
Ex. 3.2: Write a function called
averagethat takes as input a list of numbers, and returns the average of the values in the list. Use your function to calculate the average of the values[-1, 2, -3, 4, 0, -4, 3, -2, 1]
### BEGIN SOLUTION
def average(num_list):
return sum(num_list) / len(num_list)
answer_32 = average([-1, 2, -3, 4, 0, -4, 3, -2, 1])
### END SOLUTIONProblem 3.3 (OPTIONAL)
Recall that Eulers constant \(e\) can be calculated as \[ e=\lim_{n\rightarrow \infty}\left(1+\frac{x}{n}\right)^{n} \] Of course we cannot compute the limit on a finite memory computer. Instead we can calculate approximations by taking \(n\) large enough.
Ex. 3.3: Write a function named
eulers_ethat takes two inputsxandn, calculates \[ \left(1+\frac{x}{n}\right)^{n} \] and returns this value. Use your function to calculateeulers_e(1, 5)and store this value in the variableanswer_33.
### BEGIN SOLUTION
def eulers_e(x, n):
return (1 + x / n) ** n
answer_33 = eulers_e(1, 5)
### END SOLUTIONProblem 3.4 (OPTIONAL)
The inverse of the exponential is the logarithm. Like the exponential function, there are limit definitions of the logarithm. One of these is \[ \log(x) = 2 \cdot \sum_{k=0}^{\infty} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \]
where \(\sum_{k=0}^{\infty}\) signifies the sum of infinitely many elements, starting from \(k=0\). Each element in the sum takes the value \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\) for some \(k\). As before, we must approximate this with a finite sum.
Ex. 3.4: Define another function called
natural_logarithmwhich takes two inputsxandk_max. In the function body calculate \[ 2 \cdot \sum_{k=0}^{k\_max} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \] and return this value.
Hint: to calculate the sum, first initialize a value total = 0, loop through \(k\in \{0, 1, \ldots, k\_max\}\) and compute \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\). Add the computed value to your total in each step of the loop. After finalizing the loop you can then multiply the total by 2 and return the result.
### BEGIN SOLUTION
def natural_logarithm(x, k_max):
return 2 * sum([1 / (2 * k + 1) * ((x - 1)/(x + 1)) ** (2 * k + 1)
for k in range(k_max + 1)])
print(natural_logarithm(1,100))
### END SOLUTION0.0
Problem 3.5 (OPTIONAL)
Just like numbers, strings and data types, python treats functions as an object. This means you can write functions, that take a function as input and functions that return functions after being executed. This is sometimes a useful tool to have when you need to add extra functionality to an already existing function, or if you need to write function factories.
Ex. 3.5: Write a function called
exponentiatethat takes one input namedfunc. In the body ofexponentiatedefine a nested function (i.e. a function within a function) calledwith_expthat takes two inputsxandk. The nested function should returnfunc(e, k)wheree = eulers_e(x, k). The outer function should return the nested functionwith_exp, i.e. write something like
def exponentiate(func):
def with_exp(x, k):
e = eulers_e(x, k)
value = #[FILL IN]
return value
return with_expCall the
exponentiatefunction onnatural_logarithmand store the result in a new variable calledlogexp.
Hint: You will not get exactly the same result as you put in due to approximations and numerical accuracy.
### BEGIN SOLUTION
def exponentiate(func):
def with_exp(x, k):
e = eulers_e(x, k)
value = func(e, k)
return value
return with_exp
logexp = exponentiate(natural_logarithm)
print(logexp(1, 100))
### END SOLUTION0.9950330853168091
Getting More General
Modules
Whatever we attempt in programming, it is likely nowadays that someone has done it before us. Therefore, we can reuse code which allows to 1. save time by using others’ code, and 2. learn from others’ code.
Moreover, often the code implemented by someone with more experience is likely to work better and faster than what we can come up with! That’s why we introduce modules. These are packages of Python code that we can load - and by doing that, we get access to powerful tools.
Let’s see how modules work. Run the code below to load a module called numpy which allows us to work with linear algebra and other numeric tools.
import numpy as npLet’s create an array with numpy.
row1 = [1, 2]
row2 = [3, 4]
table = [row1, row2]
my_array = np.array(table)
my_arrayarray([[1, 2],
[3, 4]])
What is a numpy array?
An n-dimensional container that can store specific data types, e.g. bool and float. The arrays come with certain available methods and tools. E.g. 2-d array can act like a matrix, in 3-d it can act like a tensor.
Objects can have useful attributes and methods that are built-in. These are accessed using "." Example, an array can be transposed as follows:
my_array.Tarray([[1, 3],
[2, 4]])
(Optional) Classes
In Python, we can also define our types of objects, which is known as class. Each class contains rules and properties that governs how objects of the class will behave. If you are curious and want to learn, which is totally optional, then read more here (note: quite technical). Otherwise move on.
4 Pandas for data structuring
You may ask yourself: Why do we need to learn data structuring?
Data never comes in the form of our model (unless you or someone else has done it in another program, which is perfectly fine). We need to ‘wrangle’ our data. As of right now, even the most advanced techniques needs data in a structured format to work with it.
An Overview
Tabular data is like the table below. Each row is an observation which consist of two entries, one for each of the columns/fields, i.e. animal and day.
| index | Animal | Date |
|---|---|---|
| Observation 1 | Elk | July 1, 2019 |
| Observation 2 | Pig | July 3, 2019 |
What pandas provides is a smart way of structuring data. It has two fundamental data types, see below. These are essentially just container but come with a lot of extra functionality for structuring data and performing analysis.
Series: tabular data with a single column (field)- akin to a vector in mathematics
- has labelled columns (e.g. Animal and Date above) and named rows, called indices.
DataFrame: tabular data that allows for more than one column (multiple fields)- akin to a matrix in mathematics
Run the code below to make your first pandas dataframe. Try to print it and explain the content it shows.
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2],[3, 4],[5, 6],[7, 8]],
index=['i', 'ii','iii','iv'],
columns=['A', 'B'])The code below makes a series from a list. We can see that it contains all the four fundamental data types!
L = [1, 1.2, 'abc', True]
ser1 = pd.Series(L)Now you may ask yourself: why don’t we just use numpy?
There are many reasons. pandas is easier for loading, structuring and making simple analysis of tabular data. However, in many cases, if you are working with custom data or need to performing fast and complex array computations, then numpy is a better option. If you are interested see discussion here.
Switching Among Python, Numpy and Pandas
Pandas dataframes can be thought of as numpy arrays with some additional stuff. Note that columns can have different datatypes!
Most functions from numpy can be applied directly to Pandas. We can convert a DataFrame to a numpy array with values attribute:
df1.valuesarray([[1, 2],
[3, 4],
[5, 6],
[7, 8]], dtype=int64)
In Python, we can describe it as a list of lists.
df1.values.tolist()[[1, 2], [3, 4], [5, 6], [7, 8]]
Both dataframes and series have indices which are both a blessing and a curse. These indices means that we can often convert a Series into a dictionary:
ser1.to_dict(){0: 1, 1: 1.2, 2: 'abc', 3: True}
WARNING!: Series indices are NOT unique thus we may lose data if we convert to a dict which requires unique keys.
Inspection
Often we want to see what our dataframe contains. This can be done by putting the dataframe at the end of our cell, then it will automatically be printed.
The example below consist of 100 rows, with 5 columns of random data. We see that putting the dataframe in the end prints the dataframe.
df2 = pd.DataFrame(data=np.random.rand(100, 5),
columns=['A','B','C','D','E'])
df2| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
| 1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 |
| 2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 |
| 3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
| 4 | 0.756690 | 0.119340 | 0.269215 | 0.099179 | 0.411532 |
| ... | ... | ... | ... | ... | ... |
| 95 | 0.728482 | 0.232860 | 0.854766 | 0.784101 | 0.711444 |
| 96 | 0.706587 | 0.819365 | 0.090774 | 0.303287 | 0.224769 |
| 97 | 0.796380 | 0.783840 | 0.740566 | 0.747527 | 0.969443 |
| 98 | 0.433955 | 0.938853 | 0.932820 | 0.845110 | 0.583784 |
| 99 | 0.658342 | 0.699536 | 0.337664 | 0.424492 | 0.458236 |
100 rows × 5 columns
We can also use head and the tail method that select respectively the first and last observations in a DataFrame. The code below prints the first four rows.
df3 = df2.head(n=4)
df3| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
| 1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 |
| 2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 |
| 3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
Input-output
We can load and save dataframes from our computer or the internet. Try the code below to save our dataframe as a CSV file called my_data.csv. If you are unsure what a CSV file is then check the Wikipedia description.
df3.to_csv('my_data.csv')Loading data is just as easy. Some data sources are open and easy to collect data from. They do not require formatting as they come in a table format. The code below load a CSV file on school test data from NYC.
my_url = 'https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv'
my_df = pd.read_csv(my_url)
my_df.head(10)| DBN | School Name | Number of Test Takers | Critical Reading Mean | Mathematics Mean | Writing Mean | |
|---|---|---|---|---|---|---|
| 0 | 01M292 | Henry Street School for International Studies | 31.0 | 391.0 | 425.0 | 385.0 |
| 1 | 01M448 | University Neighborhood High School | 60.0 | 394.0 | 419.0 | 387.0 |
| 2 | 01M450 | East Side Community High School | 69.0 | 418.0 | 431.0 | 402.0 |
| 3 | 01M458 | SATELLITE ACADEMY FORSYTH ST | 26.0 | 385.0 | 370.0 | 378.0 |
| 4 | 01M509 | CMSP HIGH SCHOOL | NaN | NaN | NaN | NaN |
| 5 | 01M515 | Lower East Side Preparatory High School | 154.0 | 314.0 | 532.0 | 314.0 |
| 6 | 01M539 | New Explorations into Sci, Tech and Math HS | 47.0 | 568.0 | 583.0 | 568.0 |
| 7 | 01M650 | CASCADES HIGH SCHOOL | 35.0 | 411.0 | 401.0 | 401.0 |
| 8 | 01M696 | BARD HIGH SCHOOL EARLY COLLEGE | 138.0 | 630.0 | 608.0 | 630.0 |
| 9 | 02M047 | AMERICAN SIGN LANG ENG DUAL | 11.0 | 405.0 | 415.0 | 385.0 |
Working with weather data
We will now work with a dataset regarding weather. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). The data contains daily weather recorded at the weather stations. A description of GHCN can be found here.
Problem 4.1
Ex. 4.1: Use Pandas’ CSV reader to fetch daily data weather from 1863 for various stations - available somewhere on your common drive. If you cannot find it, it can also be found at this website.
Hint: you will need to give
read_csvsome keywords. Here are some suggestions - Specify the path, using either a string or through thepathlibmodule, see documentation (nice for interoperability between macOS + Windows and relative paths). - for compressed files you may need to specify the keywordcompressionwhen calling the.read_csvmethod. -headercan be specified as the CSV has no column names.
import pandas as pd### BEGIN SOLUTION
# using online url
path = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1863.csv.gz"
# ASSUMING data is in a folder called 'data':
# using string
#path = 'data/1863.csv.gz'
# using pathlib
#from pathlib import Path
#cwd = Path.cwd()
#path = cwd / 'data' / '1863.csv.gz'
df_weather = pd.read_csv(
path,
compression='gzip', # decompress gzip
header=None # use no header information from the csv
)
### END SOLUTIONSelecting Rows and Columns
In pandas there are two canonical ways of accessing subsets of a dataframe. - The iloc attribute: access rows and columns using integer indices (like a list). - The loc attribute: access rows and columns using immutable keys, e.g. numbers, strings (like a dictionary).
In what follows we will describe some different way of selection using .iloc and .loc as well as a simpler way of simply accesing the dataframe using []. The different ways are meant to give you an overview.
Using list of keys/indices
Below is an example of using the iloc attribute to select specific rows:
df1 # show df1 before indexing it with .iloc[]| A | B | |
|---|---|---|
| i | 1 | 2 |
| ii | 3 | 4 |
| iii | 5 | 6 |
| iv | 7 | 8 |
my_irows = [0, 3]
df1.iloc[my_irows]| A | B | |
|---|---|---|
| i | 1 | 2 |
| iv | 7 | 8 |
We can select columns and rows simultaneously. Below is an example of using the loc attribute, which does that:
my_rows = ['i', 'iii']
my_cols = ['A']
df1.loc[my_rows, my_cols]| A | |
|---|---|
| i | 1 |
| iii | 5 |
Using thresholds
We can also use iloc and loc for selecting rows and/or columns below or above some treshold, see below. Note that whether or not the : is on front determines whether it is above or below.
df2.iloc[:3, :4]| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 |
| 1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 |
| 2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 |
Using boolean data
If we provide the dataframe with a boolean, it will select rows (also works with iloc and loc). We will see soon that this is an extremely useful way of selecting certain rows.
df3[[True, False, False, True]]| A | B | C | D | E | |
|---|---|---|---|---|---|
| 0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 |
| 3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 |
Selecting columns
Often we need to select specific columns. If we provide the dataframe with a list of column names it will make a dataframe keep only these columns:
df3[['B', 'D']]| B | D | |
|---|---|---|
| 0 | 0.094683 | 0.911622 |
| 1 | 0.830997 | 0.235283 |
| 2 | 0.708393 | 0.171770 |
| 3 | 0.735157 | 0.580469 |
Problem 4.2
Ex 4.2: Select the four left-most columns which contain: station identifier, data, observation type, observation value. Rename them as ‘station’, ‘datetime’, ‘obs_type’, ‘obs_value’.
Hint: Renaming can be done with
df.columns = colswherecolsis a list of column names.
### BEGIN SOLUTION
df_weather = df_weather.iloc[:, :4] # select only first four columns
column_names = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather.columns = column_names # set column names
### END SOLUTIONBasic Operations
How do we perform elementary operations like we learned for basic Python? E.g. numeric operations such as summation (+) or logical operations such as greater than (>). Actually we are in luck - they are exactly the same.
Let’s see how it works for numeric data using a numpy array (works the same way as Pandas).
my_arr1 = np.array([2, 3, 2, 1, 1])
my_arr2 = my_arr1 ** 2
my_arr2array([4, 9, 4, 1, 1])
Can we do the same with two vectors? Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example:
my_arr1 + my_arr2array([ 6, 12, 6, 2, 2])
Changing and Copying Data
Everything in the dataframe can be changed. For instance, we can also update our dataframe with new values, e.g. by making new variables or overwriting existing ones. In the example below we add a new column to add a DataFrame.
df2['F'] = df2['A'] > df2['D']
df2.head(10)| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 0.291100 | 0.094683 | 0.550356 | 0.911622 | 0.374320 | False |
| 1 | 0.933523 | 0.830997 | 0.430150 | 0.235283 | 0.129003 | True |
| 2 | 0.900874 | 0.708393 | 0.950499 | 0.171770 | 0.503687 | True |
| 3 | 0.214144 | 0.735157 | 0.651842 | 0.580469 | 0.448282 | False |
| 4 | 0.756690 | 0.119340 | 0.269215 | 0.099179 | 0.411532 | True |
| 5 | 0.634309 | 0.958614 | 0.330676 | 0.454304 | 0.098996 | True |
| 6 | 0.327120 | 0.263946 | 0.884487 | 0.238092 | 0.283622 | True |
| 7 | 0.180478 | 0.433104 | 0.719118 | 0.188784 | 0.674121 | False |
| 8 | 0.645979 | 0.667443 | 0.978808 | 0.531604 | 0.241179 | True |
| 9 | 0.940160 | 0.744014 | 0.657913 | 0.348178 | 0.940021 | True |
WARNING!: If you work on a subset of data from another dataframe, then this dataframe is what is known as a view! Therefore, all changes made in the view will also be made in the original version.
In the example below, we try to change the dataframe df2 which is a view of df3, and we get a warning. Thus, changes to df3 also happen in df2. Notice that we can also use loc for changing the data.
df3.loc[:,'D'] = df3['A'] - df3['E']
print(df2['D'].head(3), '\n')
print(df3['D'].head(3))0 -0.083219
1 0.804520
2 0.397187
Name: D, dtype: float64
0 -0.083219
1 0.804520
2 0.397187
Name: D, dtype: float64
C:\Users\wkg579\AppData\Local\Temp\ipykernel_18024\2462267356.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3.loc[:,'D'] = df3['A'] - df3['E']
To avoid the problem of having a view, we can instead copy the data as in the example below. Try to verify that if you change things in df4 things do not change in df2.
df4 = df2.copy()# Verify that the code from above doesn't throw the same "SettingWithCopyWarning"
# when using the copied dataframe, df4, instead of df3.
df4.loc[:, 'D'] = df4['A'] - df4['E']Problem 4.3
Ex. 4.3: Further, select the subset of data for the station
UK000056225and only observations for maximal temperature. Make a copy of the DataFrame and store this in the variabledf_select. Explain in a one or two sentences how copying works. Write your answer in a multi line comment like""" Your answer here """.
Hint: The
&operator works elementwise on boolean series (likeandin core python). This allows to combine conditions for selections.
### BEGIN SOLUTION
select_stat = df_weather.station == 'UK000056225' # boolean: first weather station
select_tmax = df_weather.obs_type == 'TMAX' # boolean: maximal temp.
select_rows = select_stat & select_tmax # row selection - require both conditions
df_select = df_weather[select_rows].copy() # apply selection and copy
explanation = """Copying of the dataframe breaks the dependency with original DataFrame `df_weather`.
If dependency is not broken, then changing values in one of the two dataframes
would imply changes in the other."""
print(explanation)
### END SOLUTIONCopying of the dataframe breaks the dependency with original DataFrame `df_weather`.
If dependency is not broken, then changing values in one of the two dataframes
would imply changes in the other.
Problem 4.4
Ex 4.4: Make sure that max temperature is correctly formated (how many decimals should we add? one? Look through this .txt file for an answer https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). Make a new column called
TMAX_Fwhere you have converted the temperature variables to Fahrenheit.
Hint: Conversion is \(F = 32 + 1.8*C\) where \(F\) is Fahrenheit and \(C\) is Celsius.
### BEGIN SOLUTION
# In the readme.txt file downloaded from the link given in the exercise text,
# it can be seen, in the section 'III. FORMAT OF DATA FILES', that
# TMAX = Maximum temperature (tenths of degrees C).
# Therefore, we convert to degrees celcius by dividing by 10.
df_select['obs_value'] = df_select['obs_value'] / 10
df_select['TMAX_F'] = 32 + 1.8 * df_select['obs_value']
### END SOLUTIONChanging and Rearranging Indices
In addition to replacing values of our data, we can also rearrange the order of variables and rows as well as make new ones. We have already seen how to change column names but we can also reset the index, as seen below. Alternatively, we can set our own custom index using set_index, with temporal data etc. which provides the DataFrame with new functionality.
df1_new_index = df1.reset_index(drop=True)
df1_new_index| A | B | |
|---|---|---|
| 0 | 1 | 2 |
| 1 | 3 | 4 |
| 2 | 5 | 6 |
| 3 | 7 | 8 |
A powerful tool for re-organizing the data is to sort the data. That is, we can re-organize rows (or columns) such that they are ascending or descending according to one or more columns.
df3_sorted = df3.sort_values(by=['A','B'], ascending=True)
df3_sorted| A | B | C | D | E | |
|---|---|---|---|---|---|
| 3 | 0.214144 | 0.735157 | 0.651842 | -0.234138 | 0.448282 |
| 0 | 0.291100 | 0.094683 | 0.550356 | -0.083219 | 0.374320 |
| 2 | 0.900874 | 0.708393 | 0.950499 | 0.397187 | 0.503687 |
| 1 | 0.933523 | 0.830997 | 0.430150 | 0.804520 | 0.129003 |
Problem 4.5
Ex 4.5: Inspect the indices in
df_select. Are they following the sequence of natural numbers, 0,1,2,…? If not, reset the index and make sure to drop the old.
### BEGIN SOLUTION
df_select = df_select.reset_index(drop=True)
### END SOLUTIONProblem 4.6
Ex 4.6: Make a new DataFrame
df_sortedwhere you have sorted by the maximum temperature. What is the date for the first and last observations?
### BEGIN SOLUTION
df_sorted = df_select.sort_values(by=['obs_value'])
print(
f"Date for the min temp: {df_sorted['datetime'].iloc[0]}",
f"Date for the max temp: {df_sorted['datetime'].iloc[-1]}",
sep="\n"
)
### END SOLUTIONDate for the min temp: 18631231
Date for the max temp: 18630714
5 Pandas with datetimes and aggregations (OPTIONAL)
Pandas supports many more functions, many of which are covered in the Python for Data Analysis (PDA) book. These could be things such as more data cleaning (PDA chapter 7) merging and joining (PDA chapter 8), groupby functionality (PDA chapter 10), datetimes (PDA chapter 11) and more.
Problem 5.1
When working with datetimes, it is common to get them as pure strings from the data source. In the weather data, it is a string of the format YYYYMMDD, which can be converted to a date pandas understands using the pandas functionality to_datetime(), with documentation here.
Ex 5.1: Convert the string date to a pandas date and add this to a new column called
datetime_dt.
Hint: When converting string dates to pandas dates, it is always wise to specify the format. PDA has a table with format information
### BEGIN SOLUTION
datetime_dt = pd.to_datetime(df_select['datetime'], format = '%Y%m%d')
df_select['datetime_dt'] = datetime_dt
### END SOLUTIONProblem 5.2
Ex 5.2: Create a new column with the month of the observation
Hint: If a Series/column has a date in it, the datetime functionality can be accessed by calling .dt on it, which can be followed by further commands.
### BEGIN SOLUTION
month = datetime_dt.dt.month
df_select['month'] = month
### END SOLUTIONProblem 5.3
A very powerful method to analyse data is the split-apply-combine method. In pandas this corresponds to the groupby functionality.
Ex 5.3: Compute the mean and median maximum daily temperature for each month on the dataframe
df_selectusing the split-apply-combine procedure. Store the results in new columnstmax_meanandtmax_median.
Hint: The groupby functionality can be ‘unwrapped’ using the transform method, such that it retains the original length. This is very handy when trying to create new columns, and not reporting statistics.
### BEGIN SOLUTION
df_select['tmax_mean'] = df_select.groupby(['month'])['obs_value'].transform('mean')
df_select['tmax_median'] = df_select.groupby(['month'])['obs_value'].transform('median')
### END SOLUTION6 Linear regression with numpy (OPTIONAL)
NOTE: If you previosly skipped 3.3, 3.4 or 3.5, you might benefit from completing these before continuing with this numpy section.
Python supports all of the regular matrix computations, if one wishes to implement a predictor or estimator on their own.
To showcase this, you will in this example be tasked to convert a subset of a DataFrame into a numpy array. Based on this, you can implement estimators such as the ordinary least squares estimator:
\[\hat \beta = (X'X)^{-1}(X'y)\]
To test this out, we will estimate how age, passenger class and fare influenced chance of survival for the passengers of Titanic.
Problem 6.1
Ex 6.1: Load the
titanicdataset from seaborn using theload_datasetfunction. Remove any rows with missing values.Hint: - The dataset is aptly named
titanic. -pandashas a built-in function calleddropna.
### BEGIN SOLUTION
import seaborn as sns
df_titanic = sns.load_dataset('titanic')
df_titanic = df_titanic.dropna()
### END SOLUTIONProblem 6.2
Ex 6.2: Convert the columns
age,pclassandfareto an array with dimensionsN*3and the columnsurvivedto an array with dimensionsN*1Hint: Try subsetting the data in the DataFrame and then converting it to an array
### BEGIN SOLUTION
# Using pandas method
X = df_titanic[['age','pclass','fare']].to_numpy()
y = df_titanic[['survived']].to_numpy()
# Using as input to np.array
X_alt = np.array(df_titanic[['age','pclass','fare']])
y_alt = np.array(df_titanic[['survived']])
# equivalence
assert (X == X_alt).all()
assert (y == y_alt).all()
### END SOLUTIONProblem 6.3
Ex 6.3: Implement the ordinary least squares estimator with no intercept using
numpyHint:
numpyoffers a lot of methods for arrays - numpy.linalg offers a lot of functionality for linear algebra -@calculates a dot-product -invinverts a matrix - If you’ve importednumpyasnp, these functions can be accessed asnp.linalg.function- You can also import specific functions asfrom numpy.linalg import function- This can reduce the clutter in your code (e.g.np.linalg.inv(X)versusinv(X))
### BEGIN SOLUTION
from numpy.linalg import inv
beta = inv(X.T@X)@(X.T@y)
# equivalence with sklearn method
from sklearn.linear_model import LinearRegression
OLS = LinearRegression(fit_intercept=False)
OLS.fit(X, y)
# np.isclose instead of == due to numerical inaccuracies
assert np.isclose(OLS.coef_, beta.T).all()
### END SOLUTION