Problem Set 7 - Due Tue, Nov. 06 at 23:59 EST

Reading material and Resources

Reading

  1. Slides and notebooks from Lec 13 Dictionaries, and Lec 14 Accumulation Pattern with Lists and Dictionaries, and Lec 16 Sorting, lambda, map, and filter,
  2. Lab 08 on Dictionaries and Lab 09 on Data Extraction
  3. Think Python, Ch. 11: Dictionaries
  4. Python documentation on dictionary operations
  5. Python documentation on set operations
  6. Lab notes on "How to make a pie chart"
  7. Lab plotting examples
  8. matplotlib gallery
  9. matplotlib plotting commands
  10. matplotlib examples

About this Problem Set

This problem set will give you practice with dictionaries, which are an important data structure for the rest of the course. It will also give you practice with debugging (in the context of finding counterexamples to buggy loop programs), testing (in the context of finding differences between large dictionaries), and data analysis and visualization.

  1. In Task 1 (Individual Task) you will develop counterexamples that show why buggy solutions to a looping problem do not work.
  2. In Task 2 (Individual Task), you will write a program to find anagrams using dictionaries.
  3. In Task 3 (Partner-recommended task), you will perform data analysis and plotting on a dataset about passengers on the Titanic. Use this shared Google Doc to find a pair programming partner and record your partner. Remember that you can talk with other people about high-level problem-solving strategies, but sharing code in an Honor Code violation.

Other notes:


Task 1: Loop Debugging

Background: The correct function

On a previous midterm exam,

students were asked to write a function that takes two arguments (a pivot string and a list of strings wordlist) and returns a list of all the strings of wordlist that precede pivot in dictionary order. The words in the returned list should have the same relative order that they did in the original list.

Here is a function named correct that correctly solves this problem using the filtering pattern studied in class.

def correct(pivot, wordlist):
    '''Correct solution to Exam 1, Problem 1: a function that takes a pivot 
    string and a list of strings wordlist and returns a list that includes 
    only those elements of wordlist that precede pivot in dictionary order.
    '''
    result = []
    for word in wordlist:
        if word < pivot: # If word precedes pivot in dictionary order
            result.append(word)
    return result

Note that word < pivot is a simple way in Python to test if the string word comes before pivot in dictionary order.

Here's a sample use of the correct function:

In [1]: correct('cat', ['bunny', 'elephant', 'dog', 'bat', 'ant', 'dog', 'bat'])
Out[1]: ['bunny', 'bat', 'ant', 'bat']

Note that elements in the output list have the same relative order they did in the input list. Also note that if a string preceding the pivot (such as 'bat') occurs multiple times in the input list, it should occur multiple times in the output list.

Background: counterexamples

In this problem you will study ten incorrect programs for this problem and determine why they don't work. Incorrect programs are said to be buggy because they contain bugs = reasons why they don't work. So your goal in this problem is to track down the bugs in each of the ten programs.

One way to show that a program is buggy is to provide a counterexample, which is a particular set of inputs for which the program does not behave correctly. For a counterexample, the buggy program might return a wrong answer, or it might raise an exception when the correct program does not.

First buggy function example: buggy0a

Here is a function buggy0a that is an incorrect version of the correct function:

def buggy0a(pivot, wordlist):
    result = []
    for word in wordlist:
        if word > pivot: 
            result.append(word)
    return result

buggy0a illustrates a common bug for this function on the exam: it performs the wrong comparison between word and pivot, and so returns all the words in wordlist that come after the pivot, not before the pivot.

In the case of buggy0a, counterexamples are very easy to find. Any pivot with a nonempty wordlist list that is not just a singleton list (a list of length 1) containing just the pivot is a counterexample. For instance, ('c', ['a', 'd', 'c', 'e', 'b']) is a counterexample, because correct and buggy0 return different results for these inputs:

In [2]: correct('c', ['b', 'd', 'a', 'c', 'e'])
Out[2]: ['b', 'a']

In [3]: buggy0a('c', ['b', 'd', 'a', 'c', 'e'])
Out[3]: ['d', 'e'] # Different answer from the correct one

However, ('c', []) and ('c', ['c']) are not counterexamples, because both correct and buggy0a return the same results for these:

In [4]: correct('c', [])
Out[4]: []

In [5]: buggy0a('c', [])
Out[5]: [] # The same answer as correct

In [6]: correct('c', ['c'])
Out[6]: []

In [7]: buggy0a('c', ['c'])
Out[7]: [] # The same answer as correct

For the exam problem, we will say that a counterexample is minimal if it uses the shortest wordlist that a counterexample can have. For buggy0a, the shortest wordlist is one that contains a single string that is not the pivot. Here are two examples of minimal counterexamples for buggy0a:

In [8]: correct('c', ['a']) # Can use any string before 'c' in dictionary order. 
Out[8]: ['a']

In [9]: buggy0('c', ['a'])
Out[9]: [] # A different answer from correct

In [10]: correct('c', ['d']) # Can use any string after 'c' in dictionary order.
Out[10]: []

In [11]: buggy0('c', ['d'])
Out[11]: ['d'] # A different answer from correct

Second buggy function example: buggy0b

As a more complex buggy funtion, consider buggy0b:

def buggy0b(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] <= pivot[i]:
            i += 1
        if i == minlen and (word[:minlen] != pivot[:minlen] or (len(word) < len(pivot))):
            result.append(word)
    return result

This function is rather complex, so to think about it, let's start by assuming that word and pivot have the same length and ignoring the condition (word[:minlen] != pivot[:minlen] or (len(word) < len(pivot))) at the end.

What does the while loop do? It counts (in the variable i) the number of consecutive times that an index in word has a letter that is less than or equal to the letter at the corresponding index in pivot. The counting stops when either i reaches minlen or word[i] is strictly greater than pivot[i]. Only if every letter in the first minlen letters of word is less than or equal to the corresponding letter in pivot will word be added to the result list. For example, if pivot is 'do', then the word 'an' satisfies this condition, because 'a' <= 'd' and 'n' <= 'o'. But the word 'at' does not satisfy this condition, since 't' is not less than or equal to 'o'.

This is problematic, since 'at' comes before 'do' in alphabetical order, but buggy0b will not include it in the result list:

In [12]: correct('do', ['an', 'at', 'do', 'ex'])
Out[12]: ['an', 'at']

In [13]: buggy0b('do', ['an', 'at', 'do', 'ex'])
Out[13]: ['an']

A minimal counterexample for buggy0b involves a list with one string that is included by correct but not by buggy0b:

In [14]: correct('do', ['at'])
Out[14]: ['at']

In [15]: buggy0b('do', ['at'])
Out[15]: []

Now let's consider some of the complexities of buggy0b that we ignored at first.

Finally, when debugging a complex function like buggy0b, it is often very helpful to add print statements to a function to better understand how it works. For example, here is a version of buggy0b augmented with numerous print statements:

def buggy0bWithPrints(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        print '-'*50
        print 'word ->', word, '; pivot ->', pivot, '; minlen ->', minlen
        i = 0
        if i < minlen: # while loop will only execute when i < minlen 
            print 'While loop: i ->', i, '; word[i] ->', word[i], '; pivot[i] ->', pivot[i],\
                 '; word[i] <= pivot[i] ->', word[i] <= pivot[i]
        while i < minlen and word[i] <= pivot[i]:
            i += 1
            if i < minlen: # while loop will only execute when i < minlen 
                print 'While loop: i ->', i, '; word[i] ->', word[i], '; pivot[i] ->', pivot[i],\
                      '; word[i] <= pivot[i] ->', word[i] <= pivot[i]
        print 'After while loop: i ->', i, '; i == minlen ->', i == minlen, ';'
        print '                  word[:minlen] != pivot[:minlen] ->', word[:minlen] != pivot[:minlen], ';'
        print '                  (len(word) < len(pivot)) ->', (len(word) < len(pivot))
        if i == minlen and (word[:minlen] != pivot[:minlen] or (len(word) < len(pivot))):
            print 'result.append ->', word
            result.append(word)
    return result

Running buggy0bWithPrints on an example provides a lot of insight into how it works:

In [18]: buggy0bWithPrints('do', ['a', 'an', 'ash', 'd', 'do', 'dog', 'egg'])
--------------------------------------------------
word -> a ; pivot -> do ; minlen -> 1
While loop: i -> 0 ; word[i] -> a ; pivot[i] -> d ; word[i] <= pivot[i] -> True
After while loop: i -> 1 ; i == minlen -> True ;
                  word[:minlen] != pivot[:minlen] -> True ;
                  (len(word) < len(pivot)) -> True
result.append -> a
--------------------------------------------------
word -> an ; pivot -> do ; minlen -> 2
While loop: i -> 0 ; word[i] -> a ; pivot[i] -> d ; word[i] <= pivot[i] -> True
While loop: i -> 1 ; word[i] -> n ; pivot[i] -> o ; word[i] <= pivot[i] -> True
After while loop: i -> 2 ; i == minlen -> True ;
                  word[:minlen] != pivot[:minlen] -> True ;
                  (len(word) < len(pivot)) -> False
result.append -> an
--------------------------------------------------
word -> ash ; pivot -> do ; minlen -> 2
While loop: i -> 0 ; word[i] -> a ; pivot[i] -> d ; word[i] <= pivot[i] -> True
While loop: i -> 1 ; word[i] -> s ; pivot[i] -> o ; word[i] <= pivot[i] -> False
After while loop: i -> 1 ; i == minlen -> False ;
                  word[:minlen] != pivot[:minlen] -> True ;
                  (len(word) < len(pivot)) -> False
--------------------------------------------------
word -> d ; pivot -> do ; minlen -> 1
While loop: i -> 0 ; word[i] -> d ; pivot[i] -> d ; word[i] <= pivot[i] -> True
After while loop: i -> 1 ; i == minlen -> True ;
                  word[:minlen] != pivot[:minlen] -> False ;
                  (len(word) < len(pivot)) -> True
result.append -> d
--------------------------------------------------
word -> do ; pivot -> do ; minlen -> 2
While loop: i -> 0 ; word[i] -> d ; pivot[i] -> d ; word[i] <= pivot[i] -> True
While loop: i -> 1 ; word[i] -> o ; pivot[i] -> o ; word[i] <= pivot[i] -> True
After while loop: i -> 2 ; i == minlen -> True ;
                  word[:minlen] != pivot[:minlen] -> False ;
                  (len(word) < len(pivot)) -> False
--------------------------------------------------
word -> dog ; pivot -> do ; minlen -> 2
While loop: i -> 0 ; word[i] -> d ; pivot[i] -> d ; word[i] <= pivot[i] -> True
While loop: i -> 1 ; word[i] -> o ; pivot[i] -> o ; word[i] <= pivot[i] -> True
After while loop: i -> 2 ; i == minlen -> True ;
                  word[:minlen] != pivot[:minlen] -> False ;
                  (len(word) < len(pivot)) -> False
--------------------------------------------------
word -> egg ; pivot -> do ; minlen -> 2
While loop: i -> 0 ; word[i] -> e ; pivot[i] -> d ; word[i] <= pivot[i] -> False
After while loop: i -> 0 ; i == minlen -> False ;
                  word[:minlen] != pivot[:minlen] -> True ;
                  (len(word) < len(pivot)) -> False
Out[18]: ['a', 'an', 'd']

Your Task

The file ps07/debugging.py contains 10 buggy implementations of the correct function, which are shown below. For each function buggyi, you should do two things:

  1. Define the variable counterExamplei to be a pair (2-tuple) of (1) a pivot string and (2) a wordlist. The pair counterExamplei should be a minimal counterexample that indicates why buggyi is incorrect.

    # Minimal counterexample for buggy0a
    counterExample0a = ('c', ['a'])
    
    # Minimal counterexample for buggy0b
    counterExample0b = ('do', ['at']) 
    
  2. Define the variable explanationi to be a triple-quoted string that briefly explains why the function is buggy. For example,

    # An explanation of why buggy0a is buggy
    explanation0a = """
    Returns all words *after* pivot rather than *before* pivot.
    """
    
    # An explanation of why buggy0b is buggy
    explanation0b = """
    Requires *each* index shared by word and pivot to contain a letter in word
    that is less than or equal to the letter at the corresponding index of pivot.
    But words that should correctly precede pivot in dictionary order and are not
    prefixes of pivot only need to have *one* index at which the letter in word
    is less than the letter in pivot (following a sequence of indices at which the
    letters are equal). So the result list excludes some words it should include. 
    """

Explanation strings should be delimited by three double-quotes, which allows them to contain multiple lines.

Notes:

The 10 Buggy Functions

The ten buggy functions can be found in the file ps07/debugging.py, which also contains definitions for the correct, buggy0a, and buggy0b function.

Below each buggyi function function is a counterExamplei variable that you should define as a pair (2-tuple) of (1) a pivot string and (2) a minimum length wordlist that serves as a counterexample relative to the pivot.

Below each buggyi function function is also a explanationi variable that you should define as a triple-quoted string that explains why buggyi is buggy.

Most of these buggy programs are based on incorrect programs that students wrote for Problem 1 on Exam 1. They have been cleaned up in various ways, but they illustrate reasoning issues that students actually exhibit in practice. Most commonly, students did not understand that strings could use < in Python to compare two strings in dictionary order, and came up with complex solutions involving loops that tested the strings letter-by-letter. It is possible to define a correct loop involving letter-by-letter comparisons, but it is rather challenging to do so, as several of the following functions indicate.

# ------------------------------------------------------------------------------
def buggy1(pivot, wordlist):
    result = []
    for word in wordlist:
        if word <= pivot: 
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy2(pivot, wordlist):
    result = []
    for word in wordlist:
        if word[0] < pivot[0]:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy3(pivot, wordlist):
    result = []
    for word in wordlist:
        if word < pivot:
            result.append(word)
        return result

# ------------------------------------------------------------------------------                                                         
def buggy4(pivot, wordlist):
    result = []
    for word in wordlist:
        if word < pivot and word not in result:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy5(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        for i in range(minlen):
            if word[i] < pivot[i]:
                result.append(word)            
            elif word[i] > pivot[i]:
                break # exit inner for loop if word comes after pivot
        if len(word) < len(pivot) and word == pivot[:minlen]:
            result.append(word) # Handle prefixes of pivot specially
    return result

# ------------------------------------------------------------------------------
def buggy6(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        allLess = True
        for i in range(minlen):
            if word[i] >= pivot[i]:            
                allLess = False
        if allLess:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy7(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] == pivot[i]:
            i += 1
        # After the while loop, either (1) i is equal to minlen or 
        # (2) i is the first index at which word and pivot differ
        if (i == minlen or word[i] < pivot[i]) and pivot != word: 
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy8(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        i = 0
        while i < minlen and word[i] == pivot[i]:
            i += 1
        # After the while loop, either (1) i is equal to minlen or 
        # (2) i is the first index at which word and pivot differ
        if i < minlen and word[i] < pivot[i]:
            result.append(word)
    return result

# ------------------------------------------------------------------------------
def buggy9(pivot, wordlist):
    result = []
    for word in wordlist:
        minlen = min(len(word), len(pivot))
        for i in range(minlen):
            if word[i] < pivot[i]:
                result.append(word)            
                break # Exit inner for loop
        if len(word) < len(pivot) and word == pivot[:minlen]:
            result.append(word) # Handle prefixes of pivot specially
    return result

# ------------------------------------------------------------------------------
def buggy10(pivot, wordlist):
    result = []
    for word in wordlist:
        for i in range(len(word)):
            if word[i] < pivot[i]:
                result.append(word)
                break # Exit the inner for loop
            elif word[i] > pivot[i]:
                break # Exit the inner for loop
            # Otherwise word[i] == pivot[i], and we continue loop
    return result

Task 2: Unjumble

This is an individual problem which you must complete on your own, though you may ask for help from the CS111 staff.

Word Jumble is a popular game that appears in many newspapers and online. The game involves "unjumbling" English words whose letters have been reordered. For instance, the jumbled word ytikt can be unjumbled to kitty. Here is one version of the game online.


You will create a Python program in unjumble.py that is able to successfully unjumble jumbled words. Here is how to produce candidate words from a jumbled word:


We have provided three lists of English words of different sizes.

Fill in the functions as described in the subtasks below. All your code should be in your unjumble.py file in the ps07 folder from the download directory.

Subtask 2a: unjumbleKey

The unjumbleKey function takes a single argument, a string, and returns a string that is the unjumble key of its input -- i.e., a string consisting of the lowercase versions of all characters of the string in alphabetical order. For example:

In [1]: unjumbleKey('argle')
Out[1]: 'aeglr'

In [2]: unjumbleKey('regal')
Out[2]: 'aeglr'

In [3]: unjumbleKey('Star')
Out[3]: 'arst'

In [4]: unjumbleKey('HISTRIONICS')
Out[4]: 'chiiinorsst'

Notes

Subtask 2b: makeUnjumbleDictionary

The makeUnjumbleDictionary function takes a single argument, a list of words, and returns a dictionary that associates the unjumble key of every word in the word list with the set of all words with that unjumble key. All words in the same set will be anagrams -- i.e., words that all have exactly the same letters (including repeated ones), but in different orders.

unjumble.py imports four word lists: tinyWordList (33 words), smallWordList (1,159 words), mediumWordList(45,423 words), andlargeWordList(445,528 words). TestmakeUnjumbleDictionary` on these lists.

In [1]: tinyDict = makeUnjumbleDictionary(tinyWordList)

In [2]: tinyDict
Out[2]:
{'acerst': {'caster', 'caters', 'crates', 'reacts', 'recast', 'traces'},
 'aegilnrt': {'alerting', 'altering', 'integral', 'relating', 'triangle'},
 'aeglr': {'glare', 'lager', 'large', 'regal'},
 'aeinrrst': {'restrain', 'retains', 'strainer', 'terrains', 'trainers'},
 'arst': {'arts', 'rats', 'star', 'tars', 'tsar'},
 'chiiinorsst': {'histrionics', 'trichinosis'},
 'opst': {'opts', 'post', 'pots', 'spot', 'stop', 'tops'}}

In [3]: smallDict = makeUnjumbleDictionary(smallWordList)

In [4]: print 'len of smallDict:', len(smallDict)
len of smallDict: 483

In [5]: mediumDict = makeUnjumbleDictionary(mediumWordList)

In [6]: print 'len of mediumDict:', len(mediumDict)
len of mediumDict: 42271

In [7]: largeDict = makeUnjumbleDictionary(largeWordList)

In [8]: print 'len of largeDct:', len(largeDict)
len of largeDict: 388509

Notes on Sets

The dictionary values in this problem are sets, not lists. Like lists, sets are mutable collections. Unlike lists, they are unordered and cannot contain duplicate elements. Below are some notes on creating and using sets. See this Python documentation for more details on sets manipulation.

Other Notes

Subtask 2c: unjumble

The unjumble function takes an unjumble dictionary (of the type created by makeUnjumbleDictionary) and a string (that doesn't need to be a meaningful word), and returns a set of all words in the dictionary to which the input string unjumbles. If there are no such words, it returns the empty set. For example,

In [1]: tinyDict = makeUnjumbleDictionary(tinyWordList)

In [2]: unjumble(tinyDict, 'esrat')
Out[2]: set()

In [3]: print unjumble(tinyDict, 'esrat')
set([])

In [4]: mediumDict = makeUnjumbleDictionary(mediumWordList)

In [5]: unjumble(mediumDict, 'esrat')
Out[5]: {'aster', 'rates', 'stare', 'tears'}

In [6]: print unjumble(mediumDict, 'esrat')
set(['stare', 'rates', 'tears', 'aster'])

In [7]: largeDict = makeUnjumbleDictionary(largeWordList)

In [8]: unjumble(largeDict, 'esrat')
Out[8]: {'arest', 'arets', 'aster', 'astre', 'earst', 'rates', 'reast', 'resat', 
'serta', 'stare', 'stear', 'strae', 'tares', 'tarse', 'taser', 'tears', 'teras'}

For fun, use your unjumble function as an assistant in playing the online jumble game

Notes

Subtask 2d: mostAnagrams

Which words have the most anagrams? Define a function named mostAnagrams that takes an unjumble dictionary (of the type created by makeUnjumbleDictionary) and returns the largest set of anagrams in that dictionary.

In [1]: tinyDict = makeUnjumbleDictionary(tinyWordList)

In [2]: mostAnagrams(tinyDict)
Out[2]: {'opts', 'post', 'pots', 'spot', 'stop', 'tops'}
# This is one of two correct results for mostAnagrams(tinyDict). 
# The other is {'caster', 'caters', 'crates', 'reacts', 'recast', 'traces'}

In [3]: mediumDict = makeUnjumbleDictionary(mediumWordList)

In [4]: mostAnagrams(mediumDict)
Out[4]: {'pares', 'parse', 'pears', 'rapes', 'reaps', 'spare', 'spear'}

In [5]: mostAnagrams(largeDict)
Out[5]: {'arest',
 'arets',
 'aster',
 'astre',
 'earst',
 'rates',
 'reast',
 'resat',
 'serta',
 'stare',
 'stear',
 'strae',
 'tares',
 'tarse',
 'taser',
 'tears',
 'teras'}

Notes

Subtask 2e: compareUnjumbleDicts

Motivation

One downside to using Codder for testing code is that it doesn't give you practice with writing your own testing code, which is an important real-world programming skill. For example, if you go on to take CS230 Data Structures, in that course no testing code will be provided; you will be expected to write all of your testing code from scratch by yourself!

This problem is designed to give you some practice with writing and using your own version of the kind of testing code used in Codder. In particular, you will address the following problem: how do you know whether or not the dictionaries created by makeUnjumbleDictionary in Subtask 2b are correct?

To simplify things, we provide you with with the correct unjumbler dictionaries tinyDictCorrect, smallDictCorrect mediumDictCorrect, and largeDictCorrect. Then your task becomes comparing the correct dictionary with your dictionary that is returned by your makeUnjumbleDictionary.

It turns out it is trivial in Python to compare two dictionaries dict1 and dict2 for equality: just use ==. So dict1 == dict2 is True if and only if:

  1. the dictionaries have exactly the same keys; and

  2. the values at each of the keys is the same (as determined by ==). In other words, it is the case that for all keys k in the dictionary, dict1[k] == dict2[k].

But when two unjumble dictionaries differ, you don't just want to know that they differ, you want to know how they differ, because this provides valuable information on possible bugs in makeUnjumbleDictionary. After all, you would be upset if Codder just reported that two huge dictionaries were different, but didn't give any indication of ways in which they differed.

Your Task

In this task, you will flesh out a function compareUnjumbleDicts that takes two arguments:

  1. the expected (correct) unjumble dictionary

  2. an actual unjumble dictionary returned by makeUnjumbleDictionary

This function should print out information highlighting any differences between the dictionaries, as detailed below.

When it is done, this function should also return True if the dictionaries are the same and False if they differ.

Here is the information that compareUnjumbleDicts should print:

1. Number of Keys:

If the number of keys in the two dictionaries differs, a message like this should be printed (where of course the correct number of keys should be used):

The expected dict has 483 keys but the actual dict has 509 keys.

But if the dictionaries have the same number of keys, a message like this should be printed:

Both the expected and actual dicts have 483 keys.

2. Differences in Keys:

If the two sets of keys are exactly the same, then this message should be printed:

Both the expected and actual dicts have exactly the same keys.

But if there are any expected keys that are not found in the actual dict, a message like this should be printed, which should list up to the first five such keys (in sorted order):

Some keys in the expected dict but not the actual dict:
["'aehor", "'aehos", "'aelory", "'aillnosuv", "'beinor"]

Additionally if there are any actual keys that are not found in the expected dict, a message like this should be printed, which should list up to the first five such keys (in sorted order):

Some keys in the actual dict but not the expected one:
["'boeinr", "'coellnno", "'connoor", "'doell", "'doellnno"]

3. Differences in Values:

If there are any keys in both dictionaries at which the two unjumble dictionaries have different values, then a message like the following should be printed for they first such key in sorted order:

Both dicts have key "abel", but the sets at this key differ:
* the expected set has sorted values
  ['Abel', 'Bela', 'Elba', 'abel', 'able', 'albe', 'bael', 'bale', 'beal', 'bela', 'blae', 'blea']
* the actual set has sorted values
  ['abel', 'able', 'albe', 'bael', 'bale', 'beal', 'bela', 'blae', 'blea']

But if the dictionaries have exactly the same values at all keys that they share, then this message should be printed

Both dicts have exactly the same sets for all keys that are in both dicts.

More Set Operations

Implementing the above behavior is simplified by using set operations on the keys and values in the unjumble dictionary. Here we illustrate these opertions in the context of some sample sets:

In [1]: s1 = {'cat', 'ferret', 'ant', 'dog'}

In [2]: s2 = {'bunny', 'dog', 'cat', 'emu'}

Even though sets are inherently unordered, Canopy will show their elements in sorted order when they are the result of an expression in the interaction pane:

In [3]: s1
Out[3]: {'cat', 'ferret', 'ant', 'dog'}

However, when print is used on a set in Canopy, the order of the elements is unpredictable:

In [4]: print(s1)
set(['ant', 'ferret', 'dog', 'cat'])

The union of sets s1 and s2, written s1.union(s2) or s1 | s2, returns a new set that has all the elements that are in either set:

In [5]: s1.union(s2)
Out[5]: {'ant', 'bunny', 'cat', 'dog', 'emu', 'ferret'}

In [6]: s1 | s2
Out[6]: {'ant', 'bunny', 'cat', 'dog', 'emu', 'ferret'}

The intersection of sets s1 and s2, written s1.intersection(s2) or s1 & s2, returns a new set that has all the elements that are in both sets:

In [7]: s1.intersection(s2)
Out[7]: {'cat', 'dog'}

In [8]: s1 & s2
Out[8]: {'cat', 'dog'}

The difference of sets s1 and s2, written s1.difference(s2) or s1 - s2, returns a all the elements that are in the first set but not the second:

In [9]: s1.difference(s2)
Out[9]: {'ant', 'ferret'}

In [10]: s1 - s2
Out[10]: {'ant', 'ferret'}

In [11]: s2.difference(s1)
Out[11]: {'bunny', 'emu'}

In [12]: s2 - s1
Out[12]: {'bunny', 'emu'}

Sets are tested for equality using ==. The order in which elemets are written in a set is irrelevant for determining equality

In [13]: s1 == s2
Out[13]: False

In [14]: {'cat', 'bunny', 'dog', 'ant'} == {'dog', 'ant', 'cat', 'bunny'} 
Out[14]: True

A set can be created from a list by use the set function, and a list can be create from a set via the list function:

In [15]: set(['cat', 'emu', 'ant'])
Out[15]: {'ant', 'cat', 'emu'}

In [16]: set(['emu', 'cat', 'emu', 'ant', 'emu', 'cat'])
Out[16]: {'ant', 'cat', 'emu'}

In [17]: list({'cat', 'bunny', 'dog', 'ant'})
Out[17]: ['ant', 'cat', 'dog', 'bunny']

Note that set removes duplicates from a list, and the order of elements returned by list from a set is unpredicatable.

Testing compareUnjumbleDicts

unjumble.py imports six versions of a small dictionary named smallDict1 through smallDict6. Some of these are the same as the correct dictionary smallDictCorrect, while others differ from the correct dictionary in their keys and/or values.

You can test your compareUnjumbleDicts on these test dictionaries. For example:


In [18]: from unjumbleDicts import*  # Need this line to import unjumble dictionaries. 

In [19]: compareUnjumbleDicts(smallDictCorrect, smallDictCorrect)

Both the expected and actual dicts have 483 keys.

Both the expected and actual dicts have exactly the same keys.

Both dicts have exactly the same sets for all keys that are in both dicts.

Out[19]: True

In [20] compareUnjumbleDicts(smallDictCorrect, smallDict1)

The expected dict has 483 keys but the actual dict has 509 keys.

Some keys in the expected dict but not the actual dict:
["'aehor", "'aehos", "'aelory", "'aillnosuv", "'beinor"]

Some keys in the actual dict but not the expected one:
["'boeinr", "'coellnno", "'connoor", "'doell", "'doellnno"]

Both dicts have key "abel", but the sets at this key differ:
* the expected set has sorted values
  ['Abel', 'Bela', 'Elba', 'abel', 'able', 'albe', 'bael', 'bale', 'beal', 'bela', 'blae', 'blea']
* the actual set has sorted values
  ['abel', 'able', 'albe', 'bael', 'bale', 'beal', 'bela', 'blae', 'blea']

Out[20]: False

In unjumble.py, the if __name__ == '__main__': block contains the following code that you can uncomment and run to test all six test dictionaries.

    def testCompareUnjumbleDicts(testCases):
        for (actualName, actualDict, expectedName, expectedDict) in testCases:
            print('-'*60)
            print('compareUnjumbleDicts({}, {})'.format(expectedName, actualName))
            result = compareUnjumbleDicts(expectedDict, actualDict)
            print('compareUnjumbleDicts returns {}'.format(result))

    smallDictTests = [
        ('smallDict1', smallDict1, 'smallDictCorrect', smallDictCorrect),
        ('smallDict2', smallDict2, 'smallDictCorrect', smallDictCorrect),
        ('smallDict3', smallDict3, 'smallDictCorrect', smallDictCorrect),
        ('smallDict4', smallDict4, 'smallDictCorrect', smallDictCorrect),
        ('smallDict5', smallDict5, 'smallDictCorrect', smallDictCorrect),
        ('smallDict6', smallDict6, 'smallDictCorrect', smallDictCorrect),
        ]

    testCompareUnjumbleDicts(smallDictTests)

Using compareUnjumbleDicts to test your makeUnjumbleDictionary

The whole point of defining compareUnjumbleDicts is to test if the dictioaries returned by your makeUnjumbleDictionary function are correct.

Let's assume that you first uncomment the following lines from the if __name__ == '__main__': block in unjumble.py:

    tinyDict = makeUnjumbleDictionary(tinyWordList)
    smallDict = makeUnjumbleDictionary(smallWordList)
    mediumDict = makeUnjumbleDictionary(mediumWordList)
    largeDict = makeUnjumbleDictionary(largeWordList)

    print 'loading unjumbleDictionaries for testing ...'
    from unjumbleDicts import *

Then you can uncomment the following lines to compare the expected dictionaries against your actual ones:

    testCompareUnjumbleDicts(smallDictTests)

    myDictTests = [
        ('tinyDict', tinyDict, 'tinyDictCorrect', tinyDictCorrect),
        ('smallDict', smallDict, 'smallDictCorrect', smallDictCorrect),
        ('mediumDict', mediumDict, 'mediumDictCorrect', mediumDictCorrect),
        ('largeDict', largeDict, 'largeDictCorrect', largeDictCorrect),
        ]

Task 3: Titanic Analysis

This task is a Partner-recommended problem in which it is recommended that you work with a partner as part of a two-person team. Use this shared Google Doc to find a pair programming partner and record your partner.

Background

In this task, you will analyze data about passengers from the Titanic, a large ship that tragically sank in the waters of the North Atlantic Ocean in 1912.

Passenger data can be found in the titanicdata.txt file provided in the ps07 folder. Study the structure of the data in titanicdata.txt. Then, study the code in the readtitanic.py file, which reads the contents of titanicdata.txt and transforms it into a list of dictionaries, where every list element is a passenger represented by a dictionary. Because the information for the passengers is incomplete, not all passenger dictionaries have the same keys. Try the examples below in your interactive console after running readtitanic.py. Notice how passenger 2168 has six keys in the dictionary, but passenger 2203 has only four keys.

In [25]: len(passengerList)
Out[25]: 2208

In[26]: passengerList[2168]
Out[26]:
{'age': '31.0',
 'class': '1st Class',
 'group': 'Servant',
 'job': 'Personal Maid',
 'name': 'Miss Helen Alice Wilson',
 'status': 'survivor'}

In[27]: passsengerList[2203]
Out[27]:
{'age': '22.0',
 'class': '3rd Class',
 'name': 'Mr Mapriededer Zakarian',
 'status': 'victim'}

There is even a passenger who has only three keys in the dictionary. UNGRADED CHALLENGE: Can you find this passenger using a single line list comprehension statement? UNGRADED EXTRA CHALLENGE: Can you find both this passenger and his index in the list with one line of code?

Open up titanic.py in your ps07 folder. This file imports the passengerList variable from readtitanic.py. You will define the functions described in the subtasks below in titanic.py.

Subtask 3a createSurvivalDictionary

In this task, you will define a createSurvivalDictionary function that takes a list of passenger dictionaries (such as passengerList) and creates a dictionary that keeps track of the number of survivors and victims in every cabin class, as well as the calculated survival rate by cabin class. Below is the contract of the function you'll need to write. (The header is already in titanic.py).

def createSurvivalDictionary(passengers):
   """Given a list of passenger dictionaries, return a dictionary
   whose keys are all the cabin classes that appear in passengers.
   The value associated with each cabin class name should
   itself be another dictionary that has three key/value pairs:
     (1) The key "survivors" maps to the number of survivors in
         the cabin class;
     (2) The key "victims" maps to the number of victims in the
         cabin class;
     (3) The key "survivalRate" maps to the survival rate in the
         cabin class (a floating point number rounded to 3 decimal digits)
    """

Testing:

In[30]: createSurvivalDictionary(passengerList)
Out[30]:
{'1st Class': {'survivalRate': 0.62, 'survivors': 201, 'victims': 123},
'2nd Class': {'survivalRate': 0.418, 'survivors': 119, 'victims': 166},
'3rd Class': {'survivalRate': 0.254, 'survivors': 180, 'victims': 528},
'A la Carte': {'survivalRate': 0.043, 'survivors': 3, 'victims': 66},
'Deck': {'survivalRate': 0.652, 'survivors': 43, 'victims': 23},
'Engine': {'survivalRate': 0.222, 'survivors': 72, 'victims': 253},
'Victualling': {'survivalRate': 0.218, 'survivors': 94, 'victims': 337}}

Algorithm Notes

Here we sketch two slightly different algorithms for solving this subtask. We recommend that you implement whichever one you find more intuitive. (And if you prefer yet another approach that works, that's fine, too.)

Algorithm 1

  1. Find all the cabin classes (using the function set). Hint, accumulate all the cabin classes, then apply set.

    • NOTE: if you feel comfortable with list comprehension, it's a great place to use it here.

    • DO NOT hard-code the keys by looking up what the cabin classes are. Imagine that we give you another data set like this, for example about the sinking of SS Hong Mo, where the cabin names are different.

  2. Make a survival dictionary that associates each cabin class with a mini-dictionary for cabins , {'survivors': 0, 'victims': 0}.

    • DO NOT create a single mini-dictionary miniDct = {'survivors': 0, 'victims': 0}, and then use the variable miniDict to map it to every key. This will give incorrect results due to the aliasing issues we've been warning you about, because all of your keys will be pointing to the same dictionary value.

      Instead, within a for loop assign the value {'survivors': 0, 'victims': 0} directly to each cabin class (from the set you generated in step 1.)

  3. Iterate through the passenger list, updating the mini-dictionary for the cabin class of each passenger.

  4. Once all the passengers have been processed, the correct survivalRate value can be assigned to each mini-dictionary based on the number of survivors and victims (adding a new key:value item to the existing mini-dicts).

Algorithm 2: Algorithm 1 iterates through all passengers in the passenger list twice (once to find the names of the cabin classes and create mini-dictionaries for each class; and a second time to populate the mini-dictionaries). Algorithm 2 iterates through all passengers only once, creating a mini-dictionary for each cabin class the first time it encounters that cabin class.

  1. Create an initially empty survival dictionary that will eventually associate each cabin class with a mini-dictionary for cabins.

  2. Iterate through the passenger list, performing the following substeps for each passenger:

    a. Determine the cabin class of the passenger.

    b. If this cabin class does not already exists in the survival dictionary, associate it in that dictionary with a new mini-dictionary {'survivors': 0, 'victims': 0}

    c. Update mini-dict for cabin class with survival status of passenger.

  3. Once all the passengers have been processed, the correct survivalRate value can be assigned to each mini-dictionary based on the number of survivors and victims (adding a new key:value item to the existing mini-dicts).

Other Notes

Subtask 3b: Bar Chart of Survival Rates by Class

In this subtask, you will generate this horizontal bar chart showing the percentage of survivors by cabin class, in decreasing order by survival rate:

To generate this bar chart, you will flesh out the function skeleton barChartOfSurvivalRates in titanic.py:

def barChartOfSurvivalRates(passengers):
    """Given a list of passenger dictionaries, displays a horizontal bar chart
    of the sorted percentage of survival rates for each cabin class.
    A percentage value is a floating point number between 0.0 and 1.0.
    """
    plt.figure(figsize=(6,4)) # Don't change this line                                        

    # flesh this out, writing all your plotting code here.                                    

    # do not change the following two lines:
    plt.tight_layout()
    plt.savefig('survivalBarChart.png') # saves plot in file                                  

    # Be sure to comment out following testing line before submitting!                        
    plt.show()

Notes:

Subtask 3c: Pie Chart of Victims' Cabin Classes

Implement the function pieChartOfVictimsCabinClasses which will create this pie chart showing the relative number of Titanic victims from each cabin class:

To generate this bar chart, you will flesh out the function skeleton pieChartOfVictimsCabinClasses in titanic.py:

def pieChartOfVictimsCabinClasses(passengers):
    """Given a list of passenger dictionaries, displays a pie chart
    showing the relative number of victims from each cabin class.
    Pie slices should be labeled with the cabin names, and ordered
    clockwise from smallest pie slice to largest pie slice
    starting at 12 o'clock.
    """

    """Given a list of passenger dictionaries, displays a pie chart showing                   
    the relative number of victims from each cabin class.                                     
    """
    plt.figure(figsize=(6,6), facecolor='white') # Don't change this line                     

    # flesh this out, writing all your plotting code here.                                    

    # do not change the following line:                                                       
    plt.savefig('cabinClassesPieChart.png') # saves plot in file                              

    # Be sure to comment out following testing line before submitting!                        
    plt.show()

Notes:


Task 4: Honor Code Form

As in the previous psets, your honor code submission for this pset will involve defining entering values for the variables in the honorcode.py file.


How to turn in this Problem Set

Softcopy (electronic) submission

Hardcopy (paper) submission

There is no hardcopy submission. Problem set submission is entirely electronic.