Problem Set 7 - Due Tuesday, November 5, 2019

Reading

  1. Slides and notebooks from Lec 12 (Testing and Debugging), Lec 13 (Nested Loops), and Lec 15 (Sorting).
  2. Problems and solutions from Lab 08 (Function Testing and Debugging) and Lab 09 (Nested Loops and Sorting).
  3. Think Python, Ch. 7: Iteration and Think Python, Ch. 10: Lists

About this Problem Set

This problem set will give you practice with helper functions, nested lists and loops, and sorting.

Use this shared Google Doc to find a pair programming partner and record your partner.

For Task 2, having a partner is optional, but strongly recommended.

Other notes:

All code for this assignment is available in the ps07 folder in the cs111/download directory within your cs server account. There are no Codder checks for task 1, and only basic checks for task 2.


Task 0: Scrambled Solutions

This is an individual problem which you must complete on your own, although you can ask for help from the CS111 staff.

This time we have two puzzles for you to solve to review the solutions to problem set 6. Go to:

CS 111 Puzzles

and select the options under Problem Set 6.

As before, please download and submit your solution files, and email pmawhort@wellesley.edu if you run into trouble.


Task 1: Helper Functions

Note: In this task, you are being asked to modify code, but not write entire functions from scratch.

Flu Data

This problem set uses functions which process some data on deaths caused by pneumonia and influenza. The data (see the file flu.py) is compiled by the American Centers for Disease Control (CDC) and includes one entry per week from 2012 to 2019. The processing functions are in the file helpers.py, which is what you will modify for this task. The provided functions are as follows:

  1. epidemicWeeks: This function takes flu data as an argument and returns a list of (year, week) pairs indicating weeks in which the number of flu deaths rose above the threshold to be considered an epidemic. The epidemic threshold is included in each row of the data, and varies from week to week.
  2. worstWeeks: This function takes flu data as an argument as well as an integer to limit the number of weeks returned. It averages data across all years in the dataset to determine which weeks of the year are the worst for flu deaths. It returns a list of (week #, average deaths) pairs that contains just the requested number of entries for the worst weeks of the dataset.
  3. buildPctTable: This function takes flu data as an argument and builds a table (a list of lists) with weeks as the rows and years as the columns, putting the % of all deaths that week which were flu-related into each cell of the table.


Your Task

The provided functions already work correctly, but they are each quite complicated. In order to make them easier to understand, you must modify each of them to use one or more helper functions:

  1. For epidemicWeeks, define a new helper function called pctFluDeaths which takes as an argument a single row of the data and returns the percentage of all deaths for that row attributed to either pneumonia or the flu.

    To do so, you should remove the lines in epidemicWeeks which define the allDeaths, pneumoniaDeaths, and fluDeaths variables, and use those in your helper function instead. Then, edit the filter condition in the epidemicWeeks function to use your new pctFluDeaths function.

  2. For worstWeeks, define a new helper function called insertWorstWeek which will handle the part of the second loop which inserts worst weeks into the list to be returned.

    This helper function will need three parameters: the list to insert into, the (week, avg) pair to be inserted, and the limit value for the max length of the worst weeks list (the variable howMany in the worstWeeks function).

    Given these three values, it must modify the given list to include the new value (or leave it unchanged if the new value would be placed beyond the limit).

    It is fine if your function ends up making the list too long when inserting a new element, because the list will be trimmed afterwards.

    To define the new helper function, use all of the code from where it says whereToInsert = 0 to the end of that loop. Move that code into your helper function (and modify it a bit to use the helper function's parameters), and then in worstWeeks, replace the code you removed with a single call to your helper function.

  3. For buildPctTable, you will make three changes. First, replace the code that computes the value for the pct variable with a call to your pctFluDeaths helper function which you defined for epidemicWeeks. Now that we have that helper function, we should use it wherever possible.

    Second, take the entire first for loop which computes the first and last years and maximum week in the dataset, and move it into a helper function called getDataLimits. This helper function should take a flu dataset as an argument and return a three-valued tuple consisting of the maxWeek, firstYear, and lastYear values for the given dataset. This function should use the code from the first for loop, and that loop must be replaced with a call to this function.

    Third, take the second for loop which creates an empty two-dimensional array filled with None values, and turn it into a helper function called buildEmptyArray. This function should take two parameters: nrows and ncols and use those to determine the number of rows and columns in the result. You will have to modify the code slightly to use these parameters instead of the maxWeek, firstYear, and lastYear values it currently uses, and then figure out how to use those values to construct appropriate arguments for the new helper function when you use it to replace that loop within buildPctTable.

After you have made these modifications, the epidemicWeeks, worstWeeks, and buildPctTable functions should all continue to work exactly the same as before the changes. The only difference is that the code is now a bit easier to understand as it's been broken up into smaller pieces. You can test each function using the tests at the end of the helpers.py file.

You will be graded based on whether those functions continue to work, and whether you have removed the appropriate sections of code from the original functions and replaced them with a call to your helper functions. You will also be graded based on whether each of your helper functions has an appropriate docstring describing what it does.

Note that for this task, you do not need to write more than a few lines of code, because most of the code for your helper functions can be cut and pasted from the original functions.

There is no Codder testing for this task. Use the tests in the file to ensure that your modified functions still work correctly.

Task 2: Data Processing & Analysis

The ps07/temperatures.py and ps07/cities.py files contain data on global surface temperatures and on city locations and populations. Each file describes how its data is organized, but both use the same convention: a list of rows, where each row is a list where each entry holds a specific piece of data. For example, in the cities variable from ps07/cities.py, each row starts with the name of the city, so if we indexed once to get a row:

row = cities[0]

...we could then index again to get the name of the city:

name = row[0]

A for loop to print out the name of each city would look like this:

print("City names:")
for row in cities:
  print("  {}".format(row[0]))

If we know what all of the entries in a row are, we can also use multiple iteration variables to unpack the rows, like this:

print("City names:")
for name, lat, lon, pop, area, elev, coastal in cities:
  print("  {}".format(name))

In this example, we've taken advantage of the fact that we know each row will store the name, latitude, longitude, population, area, elevation, and coastal status of a city in that exact order, as specified in the comments at the top of ps07/cities.py.

You should use this pattern often while solving the problems in this task.

Each function in this task will take a dataset as an argument, possibly with some other arguments, and most will return a modified dataset. Your functions should be able to handle different datasets with the same format, for example, although the cities variable (defined in the cities.py file) happens to contain 21 rows with the first city Istanbul, your functions must be able to handle any number of city rows in any order.

Also note that you should not use print in this task (except for debugging). The testing code will use print to display the results of your functions, but the functions that you write should just return new lists.

Final note: do not attempt to open the temperatures_full.py file in Canopy: it is so large that it may cause Canopy to freeze up for a while (Canopy should eventually come unstuck but may continue to lag while editing the file). You can open this file in TextEdit or some other non-code-oriented text editor (like WordPad or Notepad on Windows) if you want to see what it contains.

Subtask A: Sorting Cities

Recall that the sorted function produces a new list, and its key= argument can be used to specify how to sort. In particular, the value of this argument should be a function (not the result of calling a function) which takes an item as an argument (in our case a row) and returns a value to be used in sorting (which can be a string, number, or a tuple to sort by multiple values in a hierarchy). Also remember that the sorted function has an optional reverse= argument which can be used to reverse the order of the result.

Using the sorted function, complete the definition of the following functions (note: for each function, you will have to define your own helper function to be used with the key= argument):

Subtask B: Processing Grids

The ps07/temperatures.py file contains global temperature data from the National Oceanic and Atmospheric Administration (NOAA) of the United States. The data organization is explained at the top of that file, and it is critical that you understand how the data is organized. To understand the data, open the file in Canopy and run it, and then try some of the following statements:

print(len(temperatures))

print(len(temperatures[0]))

print(type(temperatures[0]))

print(type(temperatures[0][0]))
print(type(temperatures[0][1]))
print(type(temperatures[0][2]))

print(len(temperatures[0][2]))
print(len(temperatures[0][2][0]))
print(type(temperatures[0][2][0]))

print(temperatures[0][2][0])
print(temperatures[0][2][1])
print(temperatures[0][2][2])

print(type(temperatures[0][2][0][0]))
print(type(temperatures[0][2][1][3]))

As you can see, there will be a lot of indexing in this part of the assignment. Before moving forward, it's a good idea to make sure you understand the structure of the various lists. Feel free to ask your classmates about this (you can discuss the structure of the data without sharing any code) and also of course consult instructors or tutors during drop-in hours if you want.

Hint: Because there are so many layers of lists here, it's good practice to use the helper functions: write one function to iterate over an outer list, which can call another function to deal with an inner list without having to worry about more than two layers of loops at once.

To get another perspective on what the data is like, this image was created by the NOAA based on similar data (the higher-resolution data in temperatures_full.py):

A map of the globe with red and blue colors in a grid indicating temperature difference relative to a 1981-2010 baseline. Most of North America and parts of Northwest China are a deep blue (much lower temperatures than in the past), while Eastern Europe, parts of Northeastern Russia, and Alaska are deep red (much higher temperatures than before), with the rest of the globe being a mix of pale blues and pale reds (slightly lower/higher temperatures).

(credit: NOAA's February 2019 Global Climate Report)

Each colored block in the image is represented by a number in the grid, with gray blocks holding None values. There are six blocks across the globe, and three blocks from the North Pole to the South Pole.

Explanation of Grid Data

For now, focus on the grid structure that each month of data contains. These grids are lists of lists, where each entry is a number representing the average temperature in that month for part of the world, using degrees Celsius compared against an baseline computed using data from 1971--2000. So if the number is 0.8, for example, it means that the average temperature for that part of the world was 0.8°C higher than it had been during 1971-2000 (this is called the temperature anomaly for that region). Where data is not available, the value None is used to indicate this, and all of the functions you write must be able to deal with these None values.

Note: the simplified data in temperatures.py does not contain any None values, but the full data in temperatures_full.py does. temperatures.py provides a variable testTemperatures that includes some None values so that you can test your code.

Each grid list contains three rows, each of which has six entries, with each grid cell drawn from a 60°x60° latitude/longitude area of the globe (60x3 = 180 degrees of latitude from -90 at the South Pole to 90 at the North Pole, and 60x6 = 360 degrees of longitude from 0 at the Greenwich Meridian to 360 at the same place). The latitudes and longitudes lists have lengths three and six respectively and provide the coordinate values for the center of each grid unit. So for example, the grid unit centered at 0° North and 30° East would be in the 1st column (index 0) of the 2nd row (index 1).

To access this cell given a grid (already plucked out of a row), we would write:

entry = grid[1][0]

To access the temperature anomaly for this location in March of 2018, we need the grid from the 15th row of the data (it begins in January 2017, so March 2018 is the 12 + 3 = 15th month), which could be obtained as follows (note the use of 14 for the 15th row because indices start at 0, and we know that the grid is always in column 2 of each row):

march2018Grid = temperatures[14][2]

The Grid Functions

Before writing functions to deal with the full temperature data over time, write the following functions that just process an individual grid from a single month.

Note: these functions each take a list-of-lists grid object as their argument and return a list of values.

Using these tools, we can compute the average temperature anomaly for entire horizontal or vertical regions of the world, or for the entire globe in the next subtask.

Subtask C: Combining Temperature Data

For subtask C, you need to write the following data-transformation functions, using your functions from subtask B to compute averages over different dimensions of the data. Each function will be given a dataset with the same structure as the temperatures data as an argument, and should produce a dataset with a different structure as output.

Note: The datasets your functions return should still be lists of lists, but the columns that they use for their inner lists will be different for each one.

Hint: The anomaliesByLocation function is provided for you, and you can study its structure, but it has a somewhat different structure than the other anomaliesBy functions, so don't try to imitate it exactly.

Subtask D: Analyzing and Combining the Data

For subtask D, you will write the following functions that use your work from subtask C along with the provided anomaliesByLocation function to reveal some trends in the data.

Note: In a real scientific context, statistics would be used to confirm that these trends were not just the result of random chance.

Note: A few of the functions in the ps07/analyze.py file include some extra questions if you want to think about how to actually draw conclusions from this data, although that isn't the focus of this assignment.


Task 3: Honor Code Form

As in the previous psets, your honor code submission for this pset will involve defining entering values for the variables in the honorcode.py file.


How to turn in this problem set