Problem Set 11 - Due Friday, May 10 at 4:30 p.m.


  1. Lecture slides and notebook: File Trees
  2. Lecture slides and notebook: Data Visualization
  3. Lab notes and exercises: File Traversal and Data Visualization

About this Problem Set

This problem set is intended to give you practice with recursive file traversal and data visualization.

This assignment will not have a reflection or a quiz. 50 points on this assignment are allocated to the code in Task 1, and 50 points are allocated to the code in Task 2.


Task 1: File Tree Sizes

This task is an individual task.

This task involves defining fruitful recursive functions that return information from a file tree. Before starting this task, you should review terminology and examples from Lec Recursive File Traversal. The material from lab 12 may also be useful.


All code for this task must be written in the provided file in the ps11 folder.

The ps11 directory contains a subdirectory testfiles that you will use for examples. This testfiles directory is similar to the testdir directory used in the slides and notebook of Lec File Trees.

Below is an indented list showing the entire file tree rooted at testfiles. The total size of all non-hidden files in this tree is 3000 bytes. The so-called hidden files are files whose names begins with a dot. These hidden files (highlighted below) are often not shown when displaying the contents of a directory, and such hidden files must be ignored by the functions you define. This is especially important, because some operating systems add hidden files to all directories (on Mac OS, every directory gets a hidden .DS_Store file).

The os module supplies a function os.path.getsize that returns the size of a file in bytes. (You can think of "one byte" as the size of a single character, but things are actually a bit more complex than that.) For example:

In [1]: import os.path

In [2]: os.path.getsize('testfiles/hollow/.hidden')
Out[2]: 1000

In [3]: os.path.getsize('testfiles/html/img/100.png')
Out[3]: 995

In [4]: os.path.getsize('testfiles/txt/hello.txt')
Out[4]: 180

The above outputs indicated that the files .hidden, 100.png, and hello.txt have sizes of 1000, 995, and 180 bytes, respectively.

When os.path.getsize is called on a directory, it returns the size of bookkeeping information associated with the directory that is unrelated to the sizes of files contained in the directory.

In [5]: os.path.getsize('testfiles/html')
Out[5]: 204

In this task, you must ignore the size of bookkeeping information associated with a directory.

Also, you must ignore hidden files entirely (use the provided isHidden function to check if a file counts as a hidden file).

Finally, instead of using Codder, you will be writing your own test cases and testing functions for this task.

Specification: fileTreeSize

For the first few subtasks, you will work on a function called fileTreeSize that takes as its single argument the pathname (a string) of a file or directory relative to the current working directory.

For non-directory files, fileTreeSize returns the same size as os.path.getsize (unless the file is a hidden file, in which case it returns 0).


In [1]: fileTreeSize('testfiles/html/img/100.png')
Out[1]: 995

In [2]: fileTreeSize('testfiles/txt/hello.txt')
Out[2]: 180

In [3]: fileTreeSize('testfiles/hollow/.hidden')
Out[3]: 0

For directories, fileTreeSize returns the sum of the sizes of the non-hidden non-directory files in the tree rooted at the directory:

In [4]: fileTreeSize('testfiles/txt')
Out[4]: 300 # 180 + 120

Note that the size of the bookkeeping information for testfiles/txt is ignored by fileTreeSize, as is the size of any hidden files it encounters.

As another example, consider the size of the whole testfiles directory:

In [5]: os.listdir('testfiles')
['.secret', # hidden file; ignore
'ballast.txt', 'html', 'hollow', 'py', 'txt']

In [6]: [fileTreeSize(os.path.join('testfiles', f)) for f in
   ...:   ['ballast.txt', 'html', 'hollow', 'py','txt']]
Out[6]: [250, 2300, 0, 150, 300] # sizes of 5 non-hidden 
                                  # children of testfiles

In [7]: fileTreeSize('testfiles')
Out[7]: 3000 # 250 + 2300 + 0 + 150 + 300

Subtask 1a: sizeTestCases

Before starting to write the fileTreeSize function, we will write a testing function, so that as we are working on the actual function, we will have some feedback to help us debug it. To start with, define a set of test cases in the sizeTestCases list in

Each test case should be a pair of input (a string containing a file path) and expected output (an integer) for the fileTreeSize function. For instance, according to the examples given above, the tuple ('testfiles', 3000) would be a valid test case, because if the input is the string 'testfiles', the correct output would be 3000.

Use the os.path.getsize function in Canopy's interactive window to find the sizes of various files in the testfiles directory. With that information, add at least the following test cases to the sizeTestCases list:

Subtask 1b: testFileTreeSize

Now that you have some test cases, define the testing function testFileTreeSize that will run all of your test cases and report whether the fileTreeSize function works for each one.

The output of this function should look exactly like this (with more lines where the ... is):

Testing fileTreeSize...
Test case 'testfiles/html/img/100.png' passed!
Test case 'testfiles/txt/hello.txt' passed!

If a test case fails, the output for that test case must look exactly like this (and output for all other test cases should still be produced):

Test case 'testfiles/txt' FAILED:
Output was 370, but should have been 300.


Subtask 1c: fileTreeSize

Now that you have test cases and a testing function, complete the definition of fileTreeSize according to the specification above. You'll know it is working if all of your test cases pass.


Specification: fileTreeList

For the next few subtasks, you will work on a function fileTreeList that takes as its single argument the pathname (a string) of a file or directory relative to the current working directory. It must return a list of all the pathnames (relative to the current working directory) of the non-hidden files and directories in the file tree rooted at the given pathname. The order of these pathnames should be similar to the order in which the files are printed by printFileTree in slide 17 in the Lec File Tree slides except that a directory should come right after all its descendants rather than right before all its descendants. In particular, files or subdirectories in a directory should be listed in alphabetical order.


In [8]: fileTreeList('testfiles/txt/hello.txt')
Out[8]: ['testfiles/txt/hello.txt']

In [9]: fileTreeList('testfiles/txt')

In [10]: fileTreeList('testfiles')

Subtask 1d: listTestCases

As with fileTreeSize, before working on fileTreeList directly, you will start by defining test cases in the listTestCases list. Each test case will be a tuple containing a string as the first element, and a list of strings as the second element. The examples above can each be made into test cases. In total, you must define at least the following test cases for fileTreeSize:

Note that in your test cases, files (or directories) in each expected output list must appear in alphabetical order (alphabetical by outermost directory, then inner directories, and so on). For example, given a directory structure:

The ordering of entries for the top-level directory 'stuff' would be:


Subtask 1e: testFileTreeList

Now that you have test cases, define the testFileTreeList testing function, which will do the same thing that testFileTreeSize did: it will run fileTreeList on each of the test cases in listTestCases, and print a message about whether the test passed or failed. The output should look the same, except in the case of failed tests, which will look like this:

Test case 'testfiles/hollow' FAILED:
Output was:
['testfiles/hollow/.hidden', 'testfiles/hollow']
But it should have been:

Or for longer output:

Test case 'testfiles/txt' FAILED:
Output was:
But it should have been:
['testfiles/txt/hello.txt', 'testfiles/txt/world.txt', 'testfiles/txt']

Do not worry about formatting the expected and actual output values: we have provided a function called printNice which you can use to do that. You just have to print the appropriate messages, like 'Output was:' or 'But it should have been:' and then call printNice with the appropriate value.

Subtask 1f: fileTreeList

For the final part of Task 1, you must implement the fileTreeList function. Since you already have test cases and a testing function, you will be able to see whether your implementation is working as you create it. Refer to the specification and examples above for how it should behave.


Task 2: Stargazing

This task is a partner-recommended problem in which it is strongly recommended that you work with a partner as part of a two-person team. Use this shared Google Doc to find a partner for this pset.

Material from the lecture on data visualization and lab 12 will be useful for this task.

This task requires you to read data from a file and write code using matplotlib to produce graphs of it. You should write all of your code in the file.

All of the subtasks are independent: a Python version of the data is provided in case you cannot get your file loading code to work properly.

Subtask 2a: loadStars

Data about stars is provided in the .csv files in the stars directory of the starter code. Each .csv file provides the same information about different groups of stars. Your task is to define a loadStars function that when given the filename of one of these files, can load the information from the file into a Python list of dictionaries.

For example, the neighbors.csv file contains data for just three stars:

α Centauri A,0.01,4.37
α Centauri B,1.33,4.37
Proxima Centauri,11.13,4.244

Your code should, given the string 'stars/neighbors.csv' as an argument, produce the following list of dictionaries (keys are shown in order here for clarity, but will not be ordered in Python):

  { "name": "α Centauri A", "brightness": 0.01, "distance": 4.37 },
  { "name": "α Centauri B", "brightness": 1.33, "distance": 4.37 },
  { "name": "Proxima Centauri", "brightness": 11.13, "distance": 4.244 },

Your code should work for any of the files in the stars/ directory (but do not worry if non-Latin characters do not work perfectly on your system).


Subtask 2b: compareStars

For this subtask, you will write a function compareStars, which will take as an argument a list of dictionaries in the format produced by loadStars from Subtask 2a. compareStars must create and display a bar graph of the brightness of each star in the dictionary, sorted by brightness and labeled by name.

This is complicated by the fact that the original data stores brightness values as "magnitudes" where smaller numbers are brighter, but the graph must use units of microlux (μlx), where a larger number is brighter. We have supplied a magnitudeToLux conversion function that can convert the magnitudes from the original data directly into microlux values. Both when sorting the data and when displaying it, you should use converted values.

Another complication is that the star names are quite long, so when graphing lots of stars, they will overlap on the x-axis. To avoid this, you must rotate the x labels (created using xticks) by 90 degrees so they are vertical instead of horizontal (use the rotate= keyword argument; see the xticks documentation and the rotation entry in the text properties documentation).



You can uncomment the testCompareStars function in the __main__ block at the bottom of the file to test your code. The graphs that are displayed should look like this:

First graph:
Comparison of brightness between stars in the α Centauri system.
Second graph:
Comparison of brightness between stars in the Pleiades.

Note: as long as your bars are the right heights, if the exact values that matplotlib uses for the y-axis ticks are not the same as this image that's fine.

Subtask 2c: scatterStars

For this subtask, you will write a function scatterStars, which will create and display a scatter plot of a group of stars with apparent brightness on the x-axis and absolute brightness on the y-axis.


As mentioned in Subtask 2b, the star data files store brightness using a magnitude scale where lower numbers are brighter. These magnitude numbers use a logarithmic scale, where a star with magnitude 6 is 100x as bright as a star with magnitude 1, and similarly a star with magnitude 11 is 100x as bright as a star with magnitude 6.

Although you do not need to understand the details of this scale (we provide a conversion function magnitudeToLux to convert into units of microlux, where larger numbers are brighter in a linear scale), an important detail is that the magnitudes in the data files are apparent magnitudes: they record how bright the star appears to us on Earth. However, two stars that have very different brightnesses may appear the same on Earth if their relative distance to Earth is also different: the further a star from Earth, the dimmer it appears to us, no matter how bright it actually is.

Because of this, we provide a helper function absoluteMagnitude which takes two arguments: a brightness value (expressed in magnitude, as in the original data), and a distance value (expressed in light-years, also as in the original data). This function returns the absolute magnitude (still expressed as a magnitude value) of the star given its apparent magnitude and distance from Earth (the absolute magnitude measures how bright the star would look if it were 10 parsecs from Earth). As an example, the Sun (which is much closer than 10 parsecs from Earth), has an apparent brightness value of -26.7, but an absolute brightness value of +4.7.

In this subtask, your goal is to graph the apparent and absolute magnitudes of each star in the same graph, using a scatterplot. The plot will include one point for each star, with the x-value being the apparent brightness (converted to microlux), and the y-value being the absolute brightness (also converted to microlux). This graph will allow you to see how different stars in a constellation relate to each other in terms of both how bright they appear on Earth and how bright they actually are.


Your scatterStars function must produce a scatterplot with apparent brightness (in microlux) on the x-axis and absolute brightness (also in microlux) on the y-axis. Use the provided absoluteMagnitude and magnitudeToLux conversion functions to generate the required x- and y- data lists from the raw data given as the argument to scatterStars.

Besides plotting each star as a point, you must use the provided annotate function to label each point with the name of the star (do not worry if labels overlap on some graphs). To do this, you must use a for loop after calling the scatter function that builds the plot, and you must call annotate within that for loop.

The annotate function accepts a label and x/y values and adds text to the graph at that location (see the example plots in lab 12 for an example of how to manage annotations, but use the provided annotate function instead of the plt.annotate matplotlib function).



You can uncomment the testScatterStars invocation in the __main__ block to test your function. It should produce a single plot that looks like this:

A scatter plot of the stars in the Pleiades.

Note: as long as your points are in the right places, if the exact values that matplotlib uses for the x- or y-axis ticks are not the same as this image that's fine.

OPTIONAL: If you have time, compare this scatter plot to those that you get from plotting other constellations, like Orion. What is the significance of the fact that the stars in the Pleiades are all quite close to being on the same diagonal line in the graph?

Task 3: Honor Code File

As in the previous psets, your honor code submission for this pset will involve defining entering values for the variables in the file.

How to turn in this Problem Set

Soft-copy submission

Hard-copy submission

There is no hard copy submission. Problem set submission is entirely electronic.