Problem Set 7 - Due Tuesday, April 9, 2019

Reading

  1. Slides and notebooks from Lec 12 (Testing and Debugging), Lec 13 (Nested Loops), and Lec 15 (Sorting and Lambda).
  2. Problems and solutions from Lab 07 (List Comprehensions, Debugging, and Testing) and [Lab 08 (Nested Loops and Sorting)].
  3. Think Python, Ch. 7: Iteration and Think Python, Ch. 10: Lists

About this Problem Set

This problem set will give you practice with loop debugging, nested lists and loops, and sorting.

  1. In Task 1 (Individual Task) you will develop counterexamples that show why buggy solutions to a looping problem do not work.

  2. In Task 2 (Recommended Partner task), you will use loops and sorting to manipulate and display data.

Use this shared Google Doc to find a pair programming partner and record your partner.

For Task 2, having a partner is optional, but strongly recommended.

Other notes:

All code for this assignment is available in the ps07 folder in the cs111/download directory within your cs server account. This assignment also uses the Codder program to help you do some basic checks for Tasks 1 and 2.

(Click the '+' symbols to expand each section)

(Also, click on collapse or expand and refresh the page to change the default)


Task 1: Loop Debugging

Background: The correct function

On a previous midterm exam, students were asked to write a function that takes two arguments (a pivot string and a list of strings wordlist) and returns a list of all the strings of wordlist that precede pivot in dictionary order. The words in the returned list must have the same relative order that they did in the original list.

Here is a function named correct that correctly solves this problem using the filtering pattern studied in class.

def correct(pivot, wordlist):
    '''Correct solution to Exam 1, Problem 1: a function that takes a pivot 
    string and a list of strings wordlist and returns a list that includes 
    only those elements of wordlist that precede pivot in dictionary order.
    '''
    result = []
    for word in wordlist:
        if word < pivot: # If word precedes pivot in dictionary order
            result.append(word)
    return result

Note that word < pivot is a simple way in Python to test if the string word comes before pivot in dictionary order.

Here's a sample use of the correct function:

In [1]: correct('cat', ['bunny', 'elephant', 'dog', 'bat', 'ant', 'dog', 'bat'])
Out[1]: ['bunny', 'bat', 'ant', 'bat']

Note that elements in the output list have the same relative order they did in the input list. Also note that if a string preceding the pivot (such as 'bat') occurs multiple times in the input list, it must occur multiple times in the output list.

Background: counterexamples

In this problem you will study ten incorrect programs for this problem and determine why they don't work. Incorrect programs are said to be buggy because they contain bugs = reasons why they don't work. So your goal in this problem is to track down the bugs in each of the ten programs.

One way to show that a program is buggy is to provide a counterexample, which is a particular set of inputs for which the program does not behave correctly. For a counterexample, the buggy program might return a wrong answer, or it might raise an exception when the correct program does not.

The next two sections give examples of what a buggy function looks like and how to figure out the bugs in these functions (click to expand these sections):

First buggy function example: buggy0a

Here is a function buggy0a that is an incorrect version of the correct function:

def buggy0a(pivot, wordlist):
    result = []
    for word in wordlist:
        if word > pivot: 
            result.append(word)
    return result

buggy0a illustrates a common bug for this function on the exam: it performs the wrong comparison between word and pivot, and so returns all the words in wordlist that come after the pivot, not before the pivot.

In the case of buggy0a, counterexamples are very easy to find. Any pivot with a nonempty wordlist list that is not just a singleton list (a list of length 1) containing just the pivot is a counterexample. For instance, ('c', ['a', 'd', 'c', 'e', 'b']) is a counterexample, because correct and buggy0 return different results for these inputs:

In [2]: correct('c', ['b', 'd', 'a', 'c', 'e'])
Out[2]: ['b', 'a']

In [3]: buggy0a('c', ['b', 'd', 'a', 'c', 'e'])
Out[3]: ['d', 'e'] # Different answer from the correct one

However, ('c', []) and ('c', ['c']) are not counterexamples, because both correct and buggy0a return the same results for these:

In [4]: correct('c', [])
Out[4]: []

In [5]: buggy0a('c', [])
Out[5]: [] # The same answer as correct

In [6]: correct('c', ['c'])
Out[6]: []

In [7]: buggy0a('c', ['c'])
Out[7]: [] # The same answer as correct

For the exam problem, we will say that a counterexample is minimal if it uses the shortest wordlist that a counterexample can have. For buggy0a, the shortest wordlist is one that contains a single string that is not the pivot. Here are two examples of minimal counterexamples for buggy0a:

In [8]: correct('c', ['a']) # Can use any string before 'c' in dictionary order. 
Out[8]: ['a']

In [9]: buggy0('c', ['a'])
Out[9]: [] # A different answer from correct

In [10]: correct('c', ['d']) # Can use any string after 'c' in dictionary order.
Out[10]: []

In [11]: buggy0('c', ['d'])
Out[11]: ['d'] # A different answer from correct

Second buggy function example: buggy0b

As a more complex buggy funtion, consider buggy0b:

def buggy0b(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] <= pivot[i]:
            i += 1
        if i == minlen and (word[:minlen] != pivot[:minlen] or (len(word) < len(pivot))):
            result.append(word)
    return result

This function is rather complex, so to think about it, let's start by assuming that word and pivot have the same length and ignoring the condition (word[:minlen] != pivot[:minlen] or (len(word) < len(pivot))) at the end.

What does the while loop do? It counts (in the variable i) the number of consecutive times that an index in word has a letter that is less than or equal to the letter at the corresponding index in pivot. The counting stops when either i reaches minlen or word[i] is strictly greater than pivot[i]. Only if every letter in the first minlen letters of word is less than or equal to the corresponding letter in pivot will word be added to the result list. For example, if pivot is 'do', then the word 'an' satisfies this condition, because 'a' <= 'd' and 'n' <= 'o'. But the word 'at' does not satisfy this condition, since 't' is not less than or equal to 'o'.

This is problematic, since 'at' comes before 'do' in alphabetical order, but buggy0b will not include it in the result list:

In [12]: correct('do', ['an', 'at', 'do', 'ex'])
Out[12]: ['an', 'at']

In [13]: buggy0b('do', ['an', 'at', 'do', 'ex'])
Out[13]: ['an']

A minimal counterexample for buggy0b involves a list with one string that is included by correct but not by buggy0b:

In [14]: correct('do', ['at'])
Out[14]: ['at']

In [15]: buggy0b('do', ['at'])
Out[15]: []

Now let's consider some of the complexities of buggy0b that we ignored at first.

Your Task

The file ps07/debugging.py contains 10 buggy implementations of the correct function, which are shown below. For each function buggyi, you must do two things:

  1. Define the variable counterExamplei to be a pair (2-tuple) of (1) a pivot string and (2) a wordlist. The pair counterExamplei should be a minimal counterexample that indicates why buggyi is incorrect.

    # Minimal counterexample for buggy0a
    counterExample0a = ('c', ['a'])
    
    # Minimal counterexample for buggy0b
    counterExample0b = ('do', ['at']) 
    
  2. Define the variable explanationi to be a triple-quoted string that briefly explains why the function is buggy. For example,

    # An explanation of why buggy0a is buggy
    explanation0a = """
    Returns all words *after* pivot rather than *before* pivot.
    """
    
    # An explanation of why buggy0b is buggy
    explanation0b = """
    Requires *each* index shared by word and pivot to contain a letter in word
    that is less than or equal to the letter at the corresponding index of
    pivot. But words that should correctly precede pivot in dictionary order and
    are not prefixes of pivot only need to have *one* index at which the letter
    in word is less than the letter in pivot (following a sequence of indices at
    which the letters are equal). So the result list excludes some words it
    should include. 
    """

Explanation strings must be delimited by three double-quotes, which allows them to contain multiple lines.

Notes:

The 10 Buggy Functions

The ten buggy functions can be found in the file ps07/debugging.py, which also contains definitions for the correct, buggy0a, and buggy0b function.

Below each buggyi function function is a counterExamplei variable that you must define as a pair (2-tuple) of (1) a pivot string and (2) a minimum length wordlist that serves as a counterexample relative to the pivot.

Below each buggyi function function is also a explanationi variable that you must define as a triple-quoted string that explains why buggyi is buggy.

Most of these buggy programs are based on incorrect programs that students wrote for Problem 1 on Exam 1. They have been cleaned up in various ways, but they illustrate reasoning issues that students actually exhibit in practice. Most commonly, students did not understand that strings could use < in Python to compare two strings in dictionary order, and came up with complex solutions involving loops that tested the strings letter-by-letter. It is possible to define a correct loop involving letter-by-letter comparisons, but it is rather challenging to do so, as several of the following functions indicate.

# ------------------------------------------------------------------------------
def buggy1(pivot, wordlist):
    result = []
    for word in wordlist:
        if word <= pivot: 
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy2(pivot, wordlist):
    result = []
    for word in wordlist:
        if word[0] < pivot[0]:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy3(pivot, wordlist):
    result = []
    for word in wordlist:
        if word < pivot:
            result.append(word)
        return result

# ------------------------------------------------------------------------------                                                         
def buggy4(pivot, wordlist):
    result = []
    for word in wordlist:
        if word < pivot and word not in result:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy5(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        for i in range(minlen):
            if word[i] < pivot[i]:
                result.append(word)            
            elif word[i] > pivot[i]:
                break # exit inner for loop if word comes after pivot
        if len(word) < len(pivot) and word == pivot[:minlen]:
            result.append(word) # Handle prefixes of pivot specially
    return result

# ------------------------------------------------------------------------------
def buggy6(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        allLess = True
        for i in range(minlen):
            if word[i] >= pivot[i]:            
                allLess = False
        if allLess:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy7(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] == pivot[i]:
            i += 1
        # After the while loop, either (1) i is equal to minlen or 
        # (2) i is the first index at which word and pivot differ
        if (i == minlen or word[i] < pivot[i]) and pivot != word: 
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy8(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] == pivot[i]:
            i += 1
        # After the while loop, either (1) i is equal to minlen or 
        # (2) i is the first index at which word and pivot differ
        if i < minlen and word[i] < pivot[i]:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy9(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        for i in range(minlen):
            if word[i] < pivot[i]:
                result.append(word)            
                break # Exit inner for loop
        if len(word) < len(pivot) and word == pivot[:minlen]:
            result.append(word) # Handle prefixes of pivot specially
    return result

# ------------------------------------------------------------------------------
def buggy10(pivot, wordlist):
    result = []
    for word in wordlist:
        for i in range(len(word)):
            if word[i] < pivot[i]:
                result.append(word)
                break # Exit the inner for loop
            elif word[i] > pivot[i]:
                break # Exit the inner for loop
            # Otherwise word[i] == pivot[i], and we continue loop
    return result

Task 2: Data Processing & Analysis

The ps07/temperatures.py and ps07/cities.py files contain data on global surface temperatures and on city locations and populations. Each file describes how its data is organized, but both use the same convention: a list of rows, where each row is a list where each entry holds a specific piece of data. For example, in the cities variable from ps07/cities.py, each row starts with the name of the city, so if we indexed once to get a row:

row = cities[0]

...we could then index again to get the name of the city:

name = row[0]

A for loop to print out the name of each city would look like this:

print("City names:")
for row in cities:
  print("  {}".format(row[0]))

If we know what all of the entries in a row are, we can also use multiple iteration variables to unpack the rows, like this:

print("City names:")
for name, lat, lon, pop, area, elev, coastal in cities:
  print("  {}".format(name))

In this example, we've taken advantage of the fact that we know each row will store the name, latitude, longitude, population, area, elevation, and coastal status of a city in that exact order, as specified in the comments at the top of ps07/cities.py.

You should use this pattern often while solving the problems in this task.

Each function in this task will take a dataset as an argument, possibly with some other arguments, and most will return a modified dataset. Your functions should be able to handle different datasets with the same format, for example, although the cities variable (defined in the cities.py file) happens to contain 21 rows with the first city Istanbul, your functions must be able to handle any number of city rows in any order.

Also note that you should not use print in this task (except for debugging). The testing code will use print to display the results of your functions, but the functions that you write should just return new lists.

Subtask A: Sorting Cities

Recall that the sorted function produces a new list, and its key= argument can be used to specify how to sort. In particular, the value of this argument should be a function (possibly a lambda function) which takes an item as an argument (in our case a row) and returns a value to be used in sorting (which can be a string, number, or a tuple to sort by multiple values in a hierarchy).

Using the sorted function, complete the definition of the following functions:

Subtask B: Processing Grids

The ps07/temperatures.py file contains global temperature data from the National Oceanic and Atmospheric Administration (NOAA) of the United States. The data organization is explained at the top of that file, and it is critical that you understand how the data is organized. To understand the data, open the file in Canopy and run it, and then try some of the following statements:

print(len(temperatures))

print(len(temperatures[0]))

print(type(temperatures[0]))

print(type(temperatures[0][0]))
print(type(temperatures[0][1]))
print(type(temperatures[0][2]))

print(len(temperatures[0][2]))
print(len(temperatures[0][2][0]))
print(type(temperatures[0][2][0]))

print(temperatures[0][2][0])
print(temperatures[0][2][3])
print(temperatures[0][2][15])

print(type(temperatures[0][2][0][0]))
print(type(temperatures[0][2][15][20]))

# Note: why do the first two contain so many None values?
# Hint: what area of the globe do they come from?

As you can see, there will be a lot of indexing in this part of the assignment. Before moving forward, it's a good idea to make sure you understand the structure of the various lists. Feel free to ask your classmates about this (you can discuss the structure of the data without sharing any code) and also of course consult instructors or tutors during drop-in hours if you want.

Hint: Because there are so many layers of lists here, it's good practice to use the divide/conquer/glue strategy: write one function to iterate over an outer list, which can call another function to deal with an inner list without having to worry about more than two layers of loops at once.

To get another perspective on what the data is like, this image was created by the NOAA based on similar data:

A map of the globe with red and blue colors in a grid indicating temperature difference relative to a 1981-2010 baseline. Most of North America and parts of Northwest China are a deep blue (much lower temperatures than in the past), while Eastern Europe, parts of Northeastern Russia, and Alaska are deep red (much higher temperatures than before), with the rest of the globe being a mix of pale blues and pale reds (slightly lower/higher temperatures).

(credit: NOAA's February 2019 Global Climate Report)

Each colored block in the image is represented by a number in the grid, with gray blocks holding None values. There are 72 blocks across the globe, and 36 blocks from the North Pole to the South Pole.

Explanation of Grid Data

For now, focus on the grid structure that each month of data contains. These grids are lists of lists, where each entry is a number representing the average temperature in that month for part of the world, using degrees Celsius compared against an baseline computed using data from 1971--2000. So if the number is 0.8, for example, it means that the average temperature for that part of the world was 0.8°C higher than it had been during 1971-2000 (this is called the temperature anomaly for that region). Where data is not available, the value None is used to indicate this, and all of the functions you write must be able to deal with these None values.

Each grid list contains 36 rows, each of which has 72 entries, with each grid cell drawn from a 5°×5° latitude/longitude area of the globe (36×5 = 180 degrees of latitude from -90 at the South Pole to 90 at the North Pole, and 72×5 = 360 degrees of longitude from 0 at the Greenwich Meridian to 360 at the same place). The latitudes and longitudes lists have length 36 and 72 respectively and provide the coordinate values for the center of each grid unit. So for example, the grid unit centered at 7.5° North and 2.5° East would be in the 1st column of the 20th row because:

To access this cell given a grid (already plucked out of a row), we would write (note the use of 19 and 0 for the 20th row and the 1st column):

entry = grid[19][0]

To access the temperature anomaly for this location in March of 2001, we need the grid from the 15th row of the data (it begins in January 2000, so March 2001 is the 12 + 3 = 15th month), which could be obtained as follows (note the use of 14 for the 15th row because indices start at 0, and we know that the grid is always in column 2 of each row):

march_2001_grid = temperatures[14][2]

The Grid Functions

Before writing functions to deal with the full temperature data over time, write the following functions that just process an individual grid from a single month.

Note: these functions each take a list-of-lists grid object as their argument and return a list of values.

Using these tools, we can compute the average temperature anomaly for entire horizontal or vertical regions of the world, or for the entire globe in the next subtask.

Subtask C: Combining Temperature Data

For subtask C, you need to write the following data-transformation functions, using your functions from subtask B to compute averages over different dimensions of the data. Each function will be given a dataset with the same structure as the temperatures data as an argument, and should produce a dataset with a different structure as output.

Note: The datasets your functions return should still be lists of lists, but the columns that they use for their inner lists will be different for each one.

Hint: The anomaliesByLocation function is provided for you, and you can study its structure, but it has a somewhat different structure than the other anomaliesBy functions, so don't try to imitate it exactly.

Subtask D: Analyzing and Combining the Data

For subtask D, you will write the following functions that use your work from subtask C along with the provided anomaliesByLocation function to reveal some trends in the data.

Note: In a real scientific context, statistics would be used to confirm that these trends were not just the result of random chance.

Note: A few of the functions in the ps07/analyze.py file include some extra questions if you want to think about how to actually draw conclusions from this data, although that isn't the focus of this assignment.


Task 3: Honor Code Form

As in the previous psets, your honor code submission for this pset will involve defining entering values for the variables in the honorcode.py file.


How to turn in this Problem Set

Softcopy (electronic) submission

Hardcopy (paper) submission

There is no hardcopy submission. Problem set submission is entirely electronic.