Overview of notebook

This notebook consists of two independent parts. The first about basic Python where you get familiar with the most important concepts and tools. The second part is a short introduction to pandas, which is a tool for structuring data in Python, and numpy, which is a tool for matrix calculations.

Overview of Python content

In this integrated assignment and teaching module, we will learn the following things about basic Python: - Fundamental data types: numeric, string and boolean - Operators: numerical and logical - Conditional logic - Containers with indices - Loops: for and while - Reuseable code: functions, classes and modules

Additional sources:

As always, there are many sources out there: Google is your best friend. However, here are some recommendations

A book: Python for Data Analysis
Videos: pythonprogramming.net fundamental (basics and intermediate)
A tutorial website: The official python 3 tutorial (sections 3, 4 and 5)

1 Fundamentals of Python

Elementary Data Types

Examples with data types

Execute the code below to create a variable A as a float equal to 1.5:

A = 1.5

Execute the code below to convert the variable A to an integer by typing:

int(A) # rounds down, i.e. floor

We can do the same for converting to float, str, bool. Note some may at first come across as slightly odd:

bool(A)

True

While some are simply not allowed:

#float('A') # Attempt at converting the string (text) 'A' to a number

Printing Stuff

An essential procedure in Python is print. Try executing some code - in this case, printing a string of text:

my_str = 'I can do in Python, whatever I want' # define a string of text
print(my_str) # print it

I can do in Python, whatever I want

We can also print several objects at the same time. Try it!

my_var1 = 33
my_var2 = 2
print(my_var1, my_var2)

33 2

Why do we print? - It allows us to inspect values - For instance when trying to understand why a piece of code does not give us what we expect (i.e. debugging) - In particular helpful when inspecting intermediate output (within a function, loops etc.).

Numeric Operators

What numeric computations can python do?

An operator in Python manipulates various data types.

We have seen addition +. Other basic numeric operators: - multiplication *; - subtraction -; - division /; - power, **

Try executing the code below. Explain in words what the expression does.

2**4

Problem 1.1

Having seen all the information, you are now ready for the first exercise. The exercise is found below in the indented text. > Ex. 1.1: Add the two integers 3 and 5

### BEGIN SOLUTION

answer_11 = 3 + 5

### END SOLUTION

Problem 1.2

Python also has a built in data type called a string. This is simply a sequence of letters/characters and can thus contain names, sentences etc. To define a string in python, you need to wrap your sentence in either double or single quotation marks. For example you could write "Hello world!".

Ex. 1.2: In Python the + is not only used for numbers. Use the + to add together the three strings "VIVE", "Machine" and "Learning". What is the result?

### BEGIN SOLUTION

answer_12 = 'VIVE' + 'Machine' + 'Learning'

### END SOLUTION

Boolean Operators

Helpful advice: If you are not certain what a boolean value is, try and go back to Fundamental Data Types

What else can operators do?

We can check the validity of a statement - using the equal operator, ==, or not equal operator !=. Try the examples below:

3 == (2 + 1)

True

3 != (2 + 1)

False

11 != 2 * 5

True

In all these cases, the outcomes were boolean.

We can also do numeric comparisons, such as greater than >, greater than or equal >=, etc.:

11 <= 2 * 5

False

How can we manipulate boolean values?

Combining boolean values can be done using:

the and operator - equivalent to &
the or operator - equivalent to |

Let’s try this!

print(True | False)
print(True & False)

True
False

What other things can we do?

We can negate/reverse the statement with the not operator:

not (True and False)

True

Problem 1.3

Above you added two integers together, and got a result of 8. Python separates numbers in two classes, the integers \(...,-1,0,1,2,...\) and the floats, which are an approximation of the real numbers \(\mathbb{R}\) (exactly how floats differ from reals is taught in introductory computer science courses).

Ex. 1.3:
* Add 1.7 to 4 * What type is 0.667 * 100 in Python?

### BEGIN SOLUTION
answer_131 = 1.3 + 4
answer_132 = 0.667 * 100 
### END SOLUTION

Containers

What is a composite data type?

A data type that can contain more than entry of data, e.g. multiple numbers.

What are the most fundamental composite data types?

Three of the most fundamental composite data types are the tuple, the list and the dictionary.

The tuple is declared with round parentheses, e.g. (1, 2, 3) each element in the tuple is separated by a comma. One you have declared a tuple you cannot change it’s content without making a copy of the tuple first (you will read that the tuple is an immutable data type).
The list is almost identical to the tuple. It is declared using square parentheses, e.g. [1, 2, 3]. Unlike the tuple, a list can be changed after definition, by adding, changing or removing elements. This is called a mutable data type.
The dictionary or simply dict is also a mutable data type. Unlike the other data types the dict works like a lookup table, where each element of data stored in the dictionary is associated with a name. To look up an item in the dictionary you don’t need to know its position in the dictionary, only its name. The dict is defined with curly braces and a colon to separate the name from the value, e.g. {'name_of_first_entry': 1, 'name_of_second_entry: 2}.

Problem 1.4

Ex. 1.4: Define the variable y as a list containing the elements 'k', 2, 'b', 9. Also define a variable z which is a tuple containing the same elements. Try to access the 0th element of y (python is 0-indexed) and the 1st element of z.

Hint: To access the n’th element of a list/tuple write myData[n], for example y[0] gets the 0th element of y.

### BEGIN SOLUTION

y = ['k', 2, 'b', 9]
z = ('k', 2, 'b', 9)

answer_14_y0 = y[0]
answer_14_z1 = z[1]

### END SOLUTION

2 Control Flow

If-then syntax

Control flow means writing code that controls the way data or information flows through the program. The concepts of control flow should be recognizable outside of coding as well. For example when you go shopping you might want to buy koldskål, but only if the kammerjunker are on sale. else you will buy icecream. These kinds of logic come up everywhere in coding; self driving cars should go forward only if the light is green, items should be listed for sale in a web shop only if they are in stock, stars should be put on the estimates if they are significant etc.

Another kind of control flow deals with doing things repeatedly. For example dishes should be done while there are still dirty dishes to wash, for each student in a course a grade should be given, etc.

In the following problems you will work with both kinds of control flow.

How can we activate code based on data in Python?

In Python, the syntax is easy with the if syntax.

if statement:  
    code

In the example above, the block called code is run if the condition called statement is true (either a variable or an expression).

Examples using if

Try to run the examples:

my_statement = (4 == 4)
if my_statement:  
    print ("I'm being executed, yay!")

I'm being executed, yay!

Introducing an alternative

If the statement in our condition is false, then we can execute other code with the else statement. Try the example below - and change the boolean value of my_statement.

my_statement = False
if my_statement:  
    print ("I'm being executed, yay!")
else:
    print ("Shoot! I'm still being executed!")

Shoot! I'm still being executed!

Optional material

We have not covered the statements break and continue, or try and except which are also control flow statements. These are slightly more advanced, but it can be a good idea to look them up yourself.

In Python the if/else logic consists of three keywords: if, elif (else if) and else. The if and elif keywords should be followed by a logical statement that is either True or False. The code that should be executed if the logic is True is written on the lines below the if, and should be indented one TAB (or 4 spaces). Finally all control flow statements must end with a colon.

Ex. 2.1: Read the code in the cell below. Assign a value to the variable x that makes the code print “Good job!”

### BEGIN SOLUTION
x = 4
### END SOLUTION

if x > 5:
    print("x is too large")

elif x >= 3 and x <= 5:
    print("Good job!")

else:   
    print("x is too small")

Good job!

Above we used two different types of comparison: >= and <. To compare two values and check whether they are equal, python uses double equal signs == (remember a single = was used to assign values to a variable).

Ex. 2.2: The code below draws a random number between 0 and 1 and stores in the variable randnum. Write an if/else block that defines a new variable which is equal to 1 if randnum <= 0.1 and is 0 if randnum > 0.1.

import random
randnum = random.uniform(0,1)

### BEGIN SOLUTION
if randnum <= 0.1:
    answer_22 = 1
else:
    answer_22 = 0
### END SOLUTION

Loops

For loops

Control flow that does the same thing repeatedly is called a loop. In python you can loop through anything that is iterable, e.g. anything where it makes sense to say “for each element in this item, do whatever.”

Lists, tuples and dictionaries are all iterable, and can thus be looped over. This kind of loop is called a for loop. The basic syntax is

for element in some_iterable:
    do_something(element)

where element is a temporary name given to the current element reached in the loop, and do_something can be any valid python function applied to element.

Example - try the following code:

A = []

for i in [1, 3, 5]:
    i_squared = i ** 2
    A.append(i_squared)
    
print(A)

[1, 9, 25]

For loops are smart when: iterating over files in a directory; iterating over specific set of columns.

Quiz: How does Python know where the code associated with inside of the loop begins?

Answer: By indenting the line with four whitespaces, see example above. This is the same as the if statements.

Ex. 2.3: Begin by initializing an emply list in the variable answer_23 (simply write answer_23 = []). Then loop trough the list y that you defined in problem 1.4. For each element in y, multiply that element by 7 and append it to answer_23. (You can finish off by showing the content of answer_23 after the loop has run.)

Hint: To append data to a list you can write answer_23.append(new_element) where new_element is the new data that you want to append.

### BEGIN SOLUTION
y = ['k', 2, 'b', 9]
answer_23 = [] 
for element in y:
    answer_23.append(7 * element)
print(answer_23)
### END SOLUTION

['kkkkkkk', 14, 'bbbbbbb', 63]

While loops

The other kind of loop in Python is the while loop. Instead of looping over an iterable, the while loop continues going as long as a supplied logical condition is True.

Most commonly, the while loop is combined with a counting variable that keeps track of how many times the loop has been run.

One specific application where a while loop can be useful is data collection on the internet (scraping) which is often open ended. Another application is when we are computing something that we do not know how long will take to compute, e.g. when a model is being estimated.

The basic syntax is seen in the example below. This code will run 100 times before stopping. At each iteration, it checks that i is smaller than 100. If it is, it does something and adds 1 to the variable i before repeating.


i = 0
while i < 100:
    do_something()    
    i = i + 1

In the example below, we provide an example of what do_something() can be. Try the code below and explain why it outputs what it does.

i = 0
L = []
while (i < 5):
    L.append(i * 3)
    i += 1

print(L)

[0, 3, 6, 9, 12]

Problem 2.4

Ex. 2.4: Begin by defining an empty list. Write a while loop that runs from \(i=0\) up to but not including \(i=1500\). In each loop, it should determine whether the current value of i is a multiple of 19. If it is, append the number to the list. (recall that \(i\) is divisible by \(a\) if \(i \text{ mod } a = 0\). The modulo operator in python is %)

Hint: The if statement does not need to be followed by an else. You can simply code the if part and python will automatically skip it and continue if the logical condition is False.

Hint: Remember to increment i in each iteration. Otherwise the program will run forever. If this happens, press kernel > interrupt in the menu bar.

i = 0
answer_24 = []
### BEGIN SOLUTION
while i < 1500:
    if i % 19 == 0:
        answer_24.append(i)
    i += 1
### END SOLUTION

3 Reusable Code

Functions

If you have never programmed in anything but statistical software such as Stata or SAS, the concept of functions might be new to you. In python, a function is simply a “recipe” that is first written, and then later used to compute something.

Conceptually, functions in programming are similar to functions in math. They have between \(0\) and “\(\infty\)” inputs, do some calculation using their inputs and then return between 1 and “\(\infty\)” outputs.

By making these recipes, we can save time by making a concise function that undertakes exactly the task that we want to complete.

Python contains a large number of built-in functions. Below, you are given examples of how to use the most commonly used built-ins. You should make yourself comfortable using each of the functions shown below.

# Setup for the examples. We define two lists to show you the built-in functions.
l1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
l2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

# The len(x) function gives you the length of the input
len(l2)

# The abs(x) function returns the absolute value of x
abs(-5)

# The min(x) and max(x) functions return the minimum and maximum of the input.
min(l1), max(l1)

(0, 90)

# The map(function, Iterable) function applies the supplied function to each element in Iterable:
# Note that the list() call just converts the result to a list
list(map(len, l2))

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# The range([start], stop, [step]) function returns a range of numbers from `start` to `stop`, in increments of `step`.
# The values in [] are optional.
# If no start value is set, it defaults to 0.
# If no step value is set it defaults to 1. 
# A stop value must always be set.

print("Range from 0 to 100, step=1:", range(100))
print("Range from 0 to 100, step=2:", range(0, 100, 2))
print("Range from 10 to 65, step=3:", range(10, 65, 3))

Range from 0 to 100, step=1: range(0, 100)
Range from 0 to 100, step=2: range(0, 100, 2)
Range from 10 to 65, step=3: range(10, 65, 3)

# The reversed(x) function reverses the input.
# We can then loop trough it backwards
l1_reverse = reversed(l1)

for e in l1_reverse:
    print(e)

# The enumerate(x) function returns the index of the item as well as the item itself in sequence.
# With it, you can loop through things while keeping track of their position:
l2_enumerate = enumerate(l2)

for index, element in l2_enumerate:
    print(index, element)

0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j

# The zip(x,y,...) function "zips" together two or more iterables allowing you to loop through them pairwise:
l1l2_zip = zip(l1, l2)

for e1, e2 in l1l2_zip:
    print(e1, e2)

0 a
10 b
20 c
30 d
40 e
50 f
60 g
70 h
80 i
90 j

The how

You can also write your own python functions. A python function is defined with the def keyword, followed by a user-defined name of the function, the inputs to the function and a colon. On the following lines, the function body is written, indented by one TAB.

Functions use the keyword return to signal what values the function should return after doing its calculations on the inputs. For example, we can define a function named my_first_function seen in the cell below. Run the code below and explain the printed output.

def my_first_function(x): # takes input x
    x_squared = x ** 2 # x squared
    return x_squared + 1

print('Output for input of 0: ', my_first_function(0))
print('Output for input of 1: ', my_first_function(1))
print('Output for input of 2: ', my_first_function(2))
print('Output for input of 3: ', my_first_function(3))

Output for input of 0:  1
Output for input of 1:  2
Output for input of 2:  5
Output for input of 3:  10

We can also make more complex functions. The function below, named my_second_function, takes two inputs a and b that is used to compute the values \(a^b\) (written in python as a ** b) and \(b^a\) and returns the larger of the two.

Provide the function below with different inputs of a and b. Explain the output to yourself.

def my_second_function(a, b):
    v1 = a ** b
    v2 = b ** a
    
    if v1 > v2:
        return v1
    else:
        return v2

Problem 3.1

Ex. 3.1: Write a function called minimum that takes as input a list of numbers, and returns the index and value of the minimum number as a tuple. Use your function to calculate the index and value of the minimum number in the list [-342, 195, 573, -234, 762, -175, 847, -882, 153, -22].

Hint: A “pythonic” way to keep count of the index of the minimum value would be to loop over the list of numbers by using the enumerate function on the list of numbers.

### BEGIN SOLUTION
def minimum(numbers):
    min_num_index, min_num = float('inf'), float('inf')
    for (number_index, value) in enumerate(numbers):
        if value < min_num:
            min_num_index = number_index
            min_num = value
    return min_num_index, min_num


# # Alternative solution: 
# def minimum(numbers):
#     min_value = min(numbers)
#     idx_min_value = numbers.index(min_value)
#     return idx_min_value, min_value

numbers = [-342, 195, 573, -234, 762, -175, 847, -882, 153, -22]
answer_31 = minimum(numbers)
### END SOLUTION

Problem 3.2

Ex. 3.2: Write a function called average that takes as input a list of numbers, and returns the average of the values in the list. Use your function to calculate the average of the values [-1, 2, -3, 4, 0, -4, 3, -2, 1]

### BEGIN SOLUTION
def average(num_list):
    return sum(num_list) / len(num_list)
answer_32 = average([-1, 2, -3, 4, 0, -4, 3, -2, 1])
### END SOLUTION

Problem 3.3 (OPTIONAL)

Recall that Eulers constant \(e\) can be calculated as \[ e=\lim_{n\rightarrow \infty}\left(1+\frac{x}{n}\right)^{n} \] Of course we cannot compute the limit on a finite memory computer. Instead we can calculate approximations by taking \(n\) large enough.

Ex. 3.3: Write a function named eulers_e that takes two inputs x and n, calculates \[ \left(1+\frac{x}{n}\right)^{n} \] and returns this value. Use your function to calculate eulers_e(1, 5) and store this value in the variable answer_33.

### BEGIN SOLUTION
def eulers_e(x, n):
    return (1 + x / n) ** n

answer_33 = eulers_e(1, 5) 
### END SOLUTION

Problem 3.4 (OPTIONAL)

The inverse of the exponential is the logarithm. Like the exponential function, there are limit definitions of the logarithm. One of these is \[ \log(x) = 2 \cdot \sum_{k=0}^{\infty} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \]

where \(\sum_{k=0}^{\infty}\) signifies the sum of infinitely many elements, starting from \(k=0\). Each element in the sum takes the value \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\) for some \(k\). As before, we must approximate this with a finite sum.

Ex. 3.4: Define another function called natural_logarithm which takes two inputs x and k_max. In the function body calculate \[ 2 \cdot \sum_{k=0}^{k\_max} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1} \] and return this value.

Hint: to calculate the sum, first initialize a value total = 0, loop through \(k\in \{0, 1, \ldots, k\_max\}\) and compute \(\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}\). Add the computed value to your total in each step of the loop. After finalizing the loop you can then multiply the total by 2 and return the result.

### BEGIN SOLUTION
def natural_logarithm(x, k_max):
    return 2 * sum([1 / (2 * k + 1) * ((x - 1)/(x + 1)) ** (2 * k + 1) 
                    for k in range(k_max + 1)])
print(natural_logarithm(1,100))
### END SOLUTION

0.0

Problem 3.5 (OPTIONAL)

Just like numbers, strings and data types, python treats functions as an object. This means you can write functions, that take a function as input and functions that return functions after being executed. This is sometimes a useful tool to have when you need to add extra functionality to an already existing function, or if you need to write function factories.

Ex. 3.5: Write a function called exponentiate that takes one input named func. In the body of exponentiate define a nested function (i.e. a function within a function) called with_exp that takes two inputs x and k. The nested function should return func(e, k) where e = eulers_e(x, k). The outer function should return the nested function with_exp, i.e. write something like

def exponentiate(func):
    def with_exp(x, k):
        e = eulers_e(x, k)
        value = #[FILL IN]
        return value
    return with_exp

Call the exponentiate function on natural_logarithm and store the result in a new variable called logexp.

Hint: You will not get exactly the same result as you put in due to approximations and numerical accuracy.

### BEGIN SOLUTION
def exponentiate(func):
    def with_exp(x, k):
        e = eulers_e(x, k)
        value = func(e, k)
        return value
    return with_exp

logexp = exponentiate(natural_logarithm)
print(logexp(1, 100))
### END SOLUTION

0.9950330853168091

Getting More General

Modules

Whatever we attempt in programming, it is likely nowadays that someone has done it before us. Therefore, we can reuse code which allows to 1. save time by using others’ code, and 2. learn from others’ code.

Moreover, often the code implemented by someone with more experience is likely to work better and faster than what we can come up with! That’s why we introduce modules. These are packages of Python code that we can load - and by doing that, we get access to powerful tools.

Let’s see how modules work. Run the code below to load a module called numpy which allows us to work with linear algebra and other numeric tools.

import numpy as np

Let’s create an array with numpy.

row1 = [1, 2]
row2 = [3, 4]
table = [row1, row2]

my_array = np.array(table)
my_array

array([[1, 2],
       [3, 4]])

What is a numpy array?

An n-dimensional container that can store specific data types, e.g. bool and float. The arrays come with certain available methods and tools. E.g. 2-d array can act like a matrix, in 3-d it can act like a tensor.

Objects can have useful attributes and methods that are built-in. These are accessed using "." Example, an array can be transposed as follows:

my_array.T

array([[1, 3],
       [2, 4]])

(Optional) Classes

In Python, we can also define our types of objects, which is known as class. Each class contains rules and properties that governs how objects of the class will behave. If you are curious and want to learn, which is totally optional, then read more here (note: quite technical). Otherwise move on.

4 Pandas for data structuring

You may ask yourself: Why do we need to learn data structuring?

Data never comes in the form of our model (unless you or someone else has done it in another program, which is perfectly fine). We need to ‘wrangle’ our data. As of right now, even the most advanced techniques needs data in a structured format to work with it.

An Overview

Tabular data is like the table below. Each row is an observation which consist of two entries, one for each of the columns/fields, i.e. animal and day.

index	Animal	Date
Observation 1	Elk	July 1, 2019
Observation 2	Pig	July 3, 2019

What pandas provides is a smart way of structuring data. It has two fundamental data types, see below. These are essentially just container but come with a lot of extra functionality for structuring data and performing analysis.

Series: tabular data with a single column (field)
- akin to a vector in mathematics
- has labelled columns (e.g. Animal and Date above) and named rows, called indices.
DataFrame: tabular data that allows for more than one column (multiple fields)
- akin to a matrix in mathematics

Run the code below to make your first pandas dataframe. Try to print it and explain the content it shows.

import pandas as pd

df1 = pd.DataFrame(data=[[1, 2],[3, 4],[5, 6],[7, 8]],
                   index=['i', 'ii','iii','iv'],
                   columns=['A', 'B'])

The code below makes a series from a list. We can see that it contains all the four fundamental data types!

L = [1, 1.2, 'abc', True]
ser1 = pd.Series(L)

Now you may ask yourself: why don’t we just use numpy?

There are many reasons. pandas is easier for loading, structuring and making simple analysis of tabular data. However, in many cases, if you are working with custom data or need to performing fast and complex array computations, then numpy is a better option. If you are interested see discussion here.

Switching Among Python, Numpy and Pandas

Pandas dataframes can be thought of as numpy arrays with some additional stuff. Note that columns can have different datatypes!

Most functions from numpy can be applied directly to Pandas. We can convert a DataFrame to a numpy array with values attribute:

df1.values

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]], dtype=int64)

In Python, we can describe it as a list of lists.

df1.values.tolist()

[[1, 2], [3, 4], [5, 6], [7, 8]]

Both dataframes and series have indices which are both a blessing and a curse. These indices means that we can often convert a Series into a dictionary:

ser1.to_dict()

{0: 1, 1: 1.2, 2: 'abc', 3: True}

WARNING!: Series indices are NOT unique thus we may lose data if we convert to a dict which requires unique keys.

Inspection

Often we want to see what our dataframe contains. This can be done by putting the dataframe at the end of our cell, then it will automatically be printed.

The example below consist of 100 rows, with 5 columns of random data. We see that putting the dataframe in the end prints the dataframe.

df2 = pd.DataFrame(data=np.random.rand(100, 5), 
                   columns=['A','B','C','D','E'])
df2

	A	B	C	D	E
0	0.291100	0.094683	0.550356	0.911622	0.374320
1	0.933523	0.830997	0.430150	0.235283	0.129003
2	0.900874	0.708393	0.950499	0.171770	0.503687
3	0.214144	0.735157	0.651842	0.580469	0.448282
4	0.756690	0.119340	0.269215	0.099179	0.411532
...	...	...	...	...	...
95	0.728482	0.232860	0.854766	0.784101	0.711444
96	0.706587	0.819365	0.090774	0.303287	0.224769
97	0.796380	0.783840	0.740566	0.747527	0.969443
98	0.433955	0.938853	0.932820	0.845110	0.583784
99	0.658342	0.699536	0.337664	0.424492	0.458236

100 rows × 5 columns

We can also use head and the tail method that select respectively the first and last observations in a DataFrame. The code below prints the first four rows.

df3 = df2.head(n=4)
df3

	A	B	C	D	E
0	0.291100	0.094683	0.550356	0.911622	0.374320
1	0.933523	0.830997	0.430150	0.235283	0.129003
2	0.900874	0.708393	0.950499	0.171770	0.503687
3	0.214144	0.735157	0.651842	0.580469	0.448282

Input-output

We can load and save dataframes from our computer or the internet. Try the code below to save our dataframe as a CSV file called my_data.csv. If you are unsure what a CSV file is then check the Wikipedia description.

df3.to_csv('my_data.csv')

Loading data is just as easy. Some data sources are open and easy to collect data from. They do not require formatting as they come in a table format. The code below load a CSV file on school test data from NYC.

my_url = 'https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv'
my_df = pd.read_csv(my_url)

my_df.head(10)

	DBN	School Name	Number of Test Takers	Critical Reading Mean	Mathematics Mean	Writing Mean
0	01M292	Henry Street School for International Studies	31.0	391.0	425.0	385.0
1	01M448	University Neighborhood High School	60.0	394.0	419.0	387.0
2	01M450	East Side Community High School	69.0	418.0	431.0	402.0
3	01M458	SATELLITE ACADEMY FORSYTH ST	26.0	385.0	370.0	378.0
4	01M509	CMSP HIGH SCHOOL	NaN	NaN	NaN	NaN
5	01M515	Lower East Side Preparatory High School	154.0	314.0	532.0	314.0
6	01M539	New Explorations into Sci, Tech and Math HS	47.0	568.0	583.0	568.0
7	01M650	CASCADES HIGH SCHOOL	35.0	411.0	401.0	401.0
8	01M696	BARD HIGH SCHOOL EARLY COLLEGE	138.0	630.0	608.0	630.0
9	02M047	AMERICAN SIGN LANG ENG DUAL	11.0	405.0	415.0	385.0

Working with weather data

We will now work with a dataset regarding weather. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). The data contains daily weather recorded at the weather stations. A description of GHCN can be found here.

Problem 4.1

Ex. 4.1: Use Pandas’ CSV reader to fetch daily data weather from 1863 for various stations - available somewhere on your common drive. If you cannot find it, it can also be found at this website.

Hint: you will need to give read_csv some keywords. Here are some suggestions - Specify the path, using either a string or through the pathlib module, see documentation (nice for interoperability between macOS + Windows and relative paths). - for compressed files you may need to specify the keyword compression when calling the .read_csv method. - header can be specified as the CSV has no column names.

import pandas as pd

### BEGIN SOLUTION
# using online url
path = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1863.csv.gz"

# ASSUMING data is in a folder called 'data':

# using string 
#path = 'data/1863.csv.gz'

# using pathlib
#from pathlib import Path
#cwd = Path.cwd()
#path = cwd / 'data' / '1863.csv.gz'

df_weather = pd.read_csv(
    path,
    compression='gzip', # decompress gzip
    header=None # use no header information from the csv
) 
### END SOLUTION

Selecting Rows and Columns

In pandas there are two canonical ways of accessing subsets of a dataframe. - The iloc attribute: access rows and columns using integer indices (like a list). - The loc attribute: access rows and columns using immutable keys, e.g. numbers, strings (like a dictionary).

In what follows we will describe some different way of selection using .iloc and .loc as well as a simpler way of simply accesing the dataframe using []. The different ways are meant to give you an overview.

Using list of keys/indices

Below is an example of using the iloc attribute to select specific rows:

df1 # show df1 before indexing it with .iloc[]

	A	B
i	1	2
ii	3	4
iii	5	6
iv	7	8

my_irows = [0, 3]
df1.iloc[my_irows]

	A	B
i	1	2
iv	7	8

We can select columns and rows simultaneously. Below is an example of using the loc attribute, which does that:

my_rows = ['i', 'iii']
my_cols = ['A']
df1.loc[my_rows, my_cols]

	A
i	1
iii	5

Using thresholds

We can also use iloc and loc for selecting rows and/or columns below or above some treshold, see below. Note that whether or not the : is on front determines whether it is above or below.

df2.iloc[:3, :4]

	A	B	C	D
0	0.291100	0.094683	0.550356	0.911622
1	0.933523	0.830997	0.430150	0.235283
2	0.900874	0.708393	0.950499	0.171770

Using boolean data

If we provide the dataframe with a boolean, it will select rows (also works with iloc and loc). We will see soon that this is an extremely useful way of selecting certain rows.

df3[[True, False, False, True]]

	A	B	C	D	E
0	0.291100	0.094683	0.550356	0.911622	0.374320
3	0.214144	0.735157	0.651842	0.580469	0.448282

Selecting columns

Often we need to select specific columns. If we provide the dataframe with a list of column names it will make a dataframe keep only these columns:

df3[['B', 'D']]

	B	D
0	0.094683	0.911622
1	0.830997	0.235283
2	0.708393	0.171770
3	0.735157	0.580469

Problem 4.2

Ex 4.2: Select the four left-most columns which contain: station identifier, data, observation type, observation value. Rename them as ‘station’, ‘datetime’, ‘obs_type’, ‘obs_value’.

Hint: Renaming can be done with df.columns = cols where cols is a list of column names.

### BEGIN SOLUTION
df_weather = df_weather.iloc[:, :4] # select only first four columns

column_names = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather.columns = column_names # set column names
### END SOLUTION

Basic Operations

How do we perform elementary operations like we learned for basic Python? E.g. numeric operations such as summation (+) or logical operations such as greater than (>). Actually we are in luck - they are exactly the same.

Let’s see how it works for numeric data using a numpy array (works the same way as Pandas).

my_arr1 = np.array([2, 3, 2, 1, 1])
my_arr2 = my_arr1 ** 2
my_arr2

array([4, 9, 4, 1, 1])

Can we do the same with two vectors? Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example:

my_arr1 + my_arr2

array([ 6, 12,  6,  2,  2])

Changing and Copying Data

Everything in the dataframe can be changed. For instance, we can also update our dataframe with new values, e.g. by making new variables or overwriting existing ones. In the example below we add a new column to add a DataFrame.

df2['F'] = df2['A'] > df2['D']
df2.head(10)

	A	B	C	D	E	F
0	0.291100	0.094683	0.550356	0.911622	0.374320	False
1	0.933523	0.830997	0.430150	0.235283	0.129003	True
2	0.900874	0.708393	0.950499	0.171770	0.503687	True
3	0.214144	0.735157	0.651842	0.580469	0.448282	False
4	0.756690	0.119340	0.269215	0.099179	0.411532	True
5	0.634309	0.958614	0.330676	0.454304	0.098996	True
6	0.327120	0.263946	0.884487	0.238092	0.283622	True
7	0.180478	0.433104	0.719118	0.188784	0.674121	False
8	0.645979	0.667443	0.978808	0.531604	0.241179	True
9	0.940160	0.744014	0.657913	0.348178	0.940021	True

WARNING!: If you work on a subset of data from another dataframe, then this dataframe is what is known as a view! Therefore, all changes made in the view will also be made in the original version.

In the example below, we try to change the dataframe df2 which is a view of df3, and we get a warning. Thus, changes to df3 also happen in df2. Notice that we can also use loc for changing the data.

df3.loc[:,'D'] = df3['A'] - df3['E']
print(df2['D'].head(3), '\n')
print(df3['D'].head(3))

0   -0.083219
1    0.804520
2    0.397187
Name: D, dtype: float64 

0   -0.083219
1    0.804520
2    0.397187
Name: D, dtype: float64

C:\Users\wkg579\AppData\Local\Temp\ipykernel_18024\2462267356.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3.loc[:,'D'] = df3['A'] - df3['E']

To avoid the problem of having a view, we can instead copy the data as in the example below. Try to verify that if you change things in df4 things do not change in df2.

df4 = df2.copy()

# Verify that the code from above doesn't throw the same "SettingWithCopyWarning" 
# when using the copied dataframe, df4, instead of df3.
df4.loc[:, 'D'] = df4['A'] - df4['E']

Problem 4.3

Ex. 4.3: Further, select the subset of data for the station UK000056225 and only observations for maximal temperature. Make a copy of the DataFrame and store this in the variable df_select. Explain in a one or two sentences how copying works. Write your answer in a multi line comment like """ Your answer here """.

Hint: The & operator works elementwise on boolean series (like and in core python). This allows to combine conditions for selections.

### BEGIN SOLUTION
select_stat = df_weather.station == 'UK000056225' # boolean: first weather station
select_tmax = df_weather.obs_type == 'TMAX' # boolean: maximal temp.

select_rows = select_stat & select_tmax # row selection - require both conditions

df_select = df_weather[select_rows].copy() # apply selection and copy

explanation = """Copying of the dataframe breaks the dependency with original DataFrame `df_weather`.
If dependency is not broken, then changing values in one of the two dataframes 
would imply changes in the other."""
print(explanation)
### END SOLUTION

Copying of the dataframe breaks the dependency with original DataFrame `df_weather`.
If dependency is not broken, then changing values in one of the two dataframes 
would imply changes in the other.

Problem 4.4

Ex 4.4: Make sure that max temperature is correctly formated (how many decimals should we add? one? Look through this .txt file for an answer https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). Make a new column called TMAX_F where you have converted the temperature variables to Fahrenheit.

Hint: Conversion is \(F = 32 + 1.8*C\) where \(F\) is Fahrenheit and \(C\) is Celsius.

### BEGIN SOLUTION
# In the readme.txt file downloaded from the link given in the exercise text,
# it can be seen, in the section 'III. FORMAT OF DATA FILES', that 
# TMAX = Maximum temperature (tenths of degrees C). 
# Therefore, we convert to degrees celcius by dividing by 10. 
df_select['obs_value'] = df_select['obs_value'] / 10 
df_select['TMAX_F'] = 32 + 1.8 * df_select['obs_value']
### END SOLUTION

Changing and Rearranging Indices

In addition to replacing values of our data, we can also rearrange the order of variables and rows as well as make new ones. We have already seen how to change column names but we can also reset the index, as seen below. Alternatively, we can set our own custom index using set_index, with temporal data etc. which provides the DataFrame with new functionality.

df1_new_index = df1.reset_index(drop=True)
df1_new_index

	A	B
0	1	2
1	3	4
2	5	6
3	7	8

A powerful tool for re-organizing the data is to sort the data. That is, we can re-organize rows (or columns) such that they are ascending or descending according to one or more columns.

df3_sorted = df3.sort_values(by=['A','B'], ascending=True)
df3_sorted

	A	B	C	D	E
3	0.214144	0.735157	0.651842	-0.234138	0.448282
0	0.291100	0.094683	0.550356	-0.083219	0.374320
2	0.900874	0.708393	0.950499	0.397187	0.503687
1	0.933523	0.830997	0.430150	0.804520	0.129003

Problem 4.5

Ex 4.5: Inspect the indices in df_select. Are they following the sequence of natural numbers, 0,1,2,…? If not, reset the index and make sure to drop the old.

### BEGIN SOLUTION
df_select = df_select.reset_index(drop=True)
### END SOLUTION

Problem 4.6

Ex 4.6: Make a new DataFrame df_sorted where you have sorted by the maximum temperature. What is the date for the first and last observations?

### BEGIN SOLUTION
df_sorted = df_select.sort_values(by=['obs_value'])
print(
    f"Date for the min temp: {df_sorted['datetime'].iloc[0]}", 
    f"Date for the max temp: {df_sorted['datetime'].iloc[-1]}",
    sep="\n"
)
### END SOLUTION

Date for the min temp: 18631231
Date for the max temp: 18630714

5 Pandas with datetimes and aggregations (OPTIONAL)

Pandas supports many more functions, many of which are covered in the Python for Data Analysis (PDA) book. These could be things such as more data cleaning (PDA chapter 7) merging and joining (PDA chapter 8), groupby functionality (PDA chapter 10), datetimes (PDA chapter 11) and more.

Problem 5.1

When working with datetimes, it is common to get them as pure strings from the data source. In the weather data, it is a string of the format YYYYMMDD, which can be converted to a date pandas understands using the pandas functionality to_datetime(), with documentation here.

Ex 5.1: Convert the string date to a pandas date and add this to a new column called datetime_dt.

Hint: When converting string dates to pandas dates, it is always wise to specify the format. PDA has a table with format information

### BEGIN SOLUTION
datetime_dt =  pd.to_datetime(df_select['datetime'], format = '%Y%m%d')
df_select['datetime_dt'] = datetime_dt
### END SOLUTION

Problem 5.2

Ex 5.2: Create a new column with the month of the observation

Hint: If a Series/column has a date in it, the datetime functionality can be accessed by calling .dt on it, which can be followed by further commands.

### BEGIN SOLUTION
month = datetime_dt.dt.month
df_select['month'] = month
### END SOLUTION

Problem 5.3

A very powerful method to analyse data is the split-apply-combine method. In pandas this corresponds to the groupby functionality.

Ex 5.3: Compute the mean and median maximum daily temperature for each month on the dataframe df_select using the split-apply-combine procedure. Store the results in new columns tmax_mean and tmax_median.

Hint: The groupby functionality can be ‘unwrapped’ using the transform method, such that it retains the original length. This is very handy when trying to create new columns, and not reporting statistics.

### BEGIN SOLUTION
df_select['tmax_mean'] = df_select.groupby(['month'])['obs_value'].transform('mean')
df_select['tmax_median'] = df_select.groupby(['month'])['obs_value'].transform('median')
### END SOLUTION

6 Linear regression with numpy (OPTIONAL)

NOTE: If you previosly skipped 3.3, 3.4 or 3.5, you might benefit from completing these before continuing with this numpy section.

Python supports all of the regular matrix computations, if one wishes to implement a predictor or estimator on their own.

To showcase this, you will in this example be tasked to convert a subset of a DataFrame into a numpy array. Based on this, you can implement estimators such as the ordinary least squares estimator:

\[\hat \beta = (X'X)^{-1}(X'y)\]

To test this out, we will estimate how age, passenger class and fare influenced chance of survival for the passengers of Titanic.

Problem 6.1

Ex 6.1: Load the titanic dataset from seaborn using the load_dataset function. Remove any rows with missing values.

Hint: - The dataset is aptly named titanic. - pandas has a built-in function called dropna.

### BEGIN SOLUTION
import seaborn as sns
df_titanic = sns.load_dataset('titanic')
df_titanic = df_titanic.dropna()
### END SOLUTION

Problem 6.2

Ex 6.2: Convert the columns age, pclass and fare to an array with dimensions N*3 and the column survived to an array with dimensions N*1

Hint: Try subsetting the data in the DataFrame and then converting it to an array

### BEGIN SOLUTION

# Using pandas method
X = df_titanic[['age','pclass','fare']].to_numpy()
y = df_titanic[['survived']].to_numpy()

# Using as input to np.array

X_alt = np.array(df_titanic[['age','pclass','fare']])
y_alt = np.array(df_titanic[['survived']])

# equivalence

assert (X == X_alt).all()
assert (y == y_alt).all()

### END SOLUTION

Problem 6.3

Ex 6.3: Implement the ordinary least squares estimator with no intercept using numpy

Hint: numpy offers a lot of methods for arrays - numpy.linalg offers a lot of functionality for linear algebra - @ calculates a dot-product - inv inverts a matrix - If you’ve imported numpy as np, these functions can be accessed as np.linalg.function - You can also import specific functions as from numpy.linalg import function - This can reduce the clutter in your code (e.g. np.linalg.inv(X) versus inv(X))

### BEGIN SOLUTION
from numpy.linalg import inv

beta = inv(X.T@X)@(X.T@y)

# equivalence with sklearn method

from sklearn.linear_model import LinearRegression
OLS = LinearRegression(fit_intercept=False)
OLS.fit(X, y)

# np.isclose instead of == due to numerical inaccuracies
assert np.isclose(OLS.coef_, beta.T).all()

### END SOLUTION