Problem Set 10 - Due Thursday, May 4 at 4:00pm EST

Reading

  1. Lecture slides and notebook: Lec 19 Python Programming with First-class Functions
  2. Lab notes and exercises: Lab 12 Functional Programming
  3. Lecture slides and notebook: Lec 20 Objects, Classes, and Inheritance
  4. Lecture slides and code folder: Lec 21 Animation
  5. Lab notes and exercises: Lab 13 Animation

About this Problem Set

This problem set is intended to give you practice with higher-order functions, and with classes, objects and inheritance in the context of animation.

This assignment will not have a reflection or a quiz. All points on this assignment will be allocated to the code.

There are two tasks in this pset.

  1. In Task 1 (Individual Task) you will compute word frequencies to characterize texts and generate random sentences in similar style.

  2. In Task 2 (Partner Task) you will create an open-ended animation using the animation framework presented in class. Use this shared Google Doc to find a pair programming partner.

    • Please DO NOT post to the cs111-spring17 google group looking for a partner; use the Google Doc instead.

    • Remember that you can work with the same partner a maximum of TWICE this semester.

    • We will deduct 2 points per day for any students who have not recorded their names by the end of Fri, Apr 28th

Carefully study the lecture and lab materials, which will help you do the two tasks.

The CS111 Problem Set Guide gives an overview of psets, including a detailed description of individual and partner tasks.

In Fall 2016, students spent on average 4 hours (min = 1 hour, max = 8 hours) for Task 2 (open-ended animation). Task 1 was not part of the assignment in Fall 2016 and we don't have data for it.


All code for this assignment is available in the ps10 folder in the cs111/download directory within your cs server account.


Task 1: Higher Order Fun with Word Frequencies

In this task, you will use higher-order functions to split a text by sentences and words (Subtask 1a), compute the frequency with which each word appears in a text file (Subtask 1b), and build a model of word usage to generate random sentences in the style of a given text (Subtask 1c).

This task provides less testing support (no Otter Inspect) than you have used on some previous tasks. It is time to take off the training wheels and learn how to test your own code.

Task 1 Rules: No Loops, No Recursion, No List Comprehensions

To help you embrace the power of higher-order functions, you must not write any loops, recursion, or list comprehensions in any part of this task. Counting all uses in all parts together, you must use all of map, filter, reduce, and sorted at least once.

Subtask 1a. Extracting Sentences

Define a function named sentencesFromFile that takes the name of a file as its single argument. The result returned by this function should be a list of non-empty sentences, where each sentence is represented as a non-empty list of words (strings). For example, the first 7 sentences produced by inspecting texts/green-eggs.txt are:

[Updated example 10:15pm 27 April]

In [2]: sentencesFromFile('texts/green-eggs.txt')[:7]
Out[2]: [[u'Sam', u'I', u'am'],
         [u'I', u'am', u'Sam'],
         [u'I', u'am', u'Sam'],
         [u'Sam', u'I', u'am'],
         [u'That', u'Sam-I-am'],
         [u'That', u'Sam-I-am'],
         [u'I', u'do', u'not', u'like', u'that', u'Sam-I-am']]

Open up texts/green-eggs.txt to see the text that we parsed. Note that sentencesFromFile should split the text first by sentence terminators (e.g., '. ' and pairs of consecutive newlines, not just single newlines), and then by words, stripping punctuation and spaces from the start and end of words.

As another example, consider the first 3 sentences from texts/constitution.txt (where we've reformatted Canopy's output to take less space):

In [3]: sentencesFromFile('texts/constitution.txt')[:3]
Out[3]: 
[[u'We', u'the', u'People', u'of', u'the', u'United', u'States', u'in', u'Order', 
 u'to', u'form', u'a', u'more', u'perfect', u'Union', u'establish', u'Justice',
  u'insure', u'domestic', u'Tranquility', u'provide', u'for', u'the', u'common',
  u'defence', u'promote', u'the', u'general', u'Welfare', u'and', u'secure', 
  u'the', u'Blessings', u'of', u'Liberty', u'to', u'ourselves', u'and', u'our',
  u'Posterity', u'do', u'ordain', u'and', u'establish', u'this', u'Constitution',
  u'for', u'the', u'United', u'States', u'of', u'America'],
 [u'Article', u'1'],
 [u'Section', u'1', u'All', u'legislative', u'Powers', u'herein', u'granted',
  u'shall', u'be', u'vested', u'in', u'a', u'Congress', u'of', u'the', u'United',
  u'States', u'which', u'shall', u'consist', u'of', u'a', u'Senate', u'and',
  u'House', u'of', u'Representatives']]

To implement sentencesFromFile, you will:

Subtask 1b. Ranking Word Frequencies (a.k.a. you and reduce become friends)

Define a function wordFrequenciesFromSentences that takes a sentence list of the form discussed above as its single argument. The function should return a list of (word, frequency) pairs, ordered from most frequent to least frequent. Each word appearing in the sentence list should appear in exactly one such pair, associated with its frequency or count --- i.e., the number of times that word appears in all the sentences. These pairs should be sorted from high frequency to low frequency, but the order of pairs with the same frequency is not specified. All words should use lowercase form for the frequency counts.

In many examples, for easier reading we have (1) omitted the unicode prefix u'...' and (2) reformatted list results.

[Updated texts/green-eggs.txt at 10:15pm 27 April to fix missing lines.]

[Updated following result at 5:09pm 30 April to change ('samiam', 13) to ('sam-i-am', 13).]

In [1]: wordFrequenciesFromSentences(sentencesFromFile('texts/green-eggs.txt'))
Out[1]: 
[('not', 84), ('i', 73), ('them', 61), ('a', 59), ('like', 44), ('in', 40), 
 ('do', 37), ('you', 34), ('would', 26), ('and', 25), ('eat', 25), ('will', 21), 
 ('with', 19), ('could', 14), ('sam-i-am', 13), ('eggs', 11), ('here', 11), 
 ('ham', 11), ('green', 11), ('the', 11), ('there', 9), ('train', 9), 
 ('house', 8), ('mouse', 8), ('anywhere', 8), ('or', 8), ('sam', 7), ('box', 7), 
 ('fox', 7), ('dark', 7), ('on', 7), ('car', 7), ('tree', 6), ('so', 5), 
 ('say', 5), ('be', 4), ('am', 4), ('see', 4), ('try', 4), ('goat', 4), 
 ('may', 4), ('me', 4), ('let', 4), ('rain', 4), ('boat', 3), ('that', 3), 
 ('are', 2), ('thank', 2), ('good', 2), ('they', 2), ('if', 1)]

To implement wordFrequnciesFromSentences, you will:

Hints:

Subtask 1c. Random Sentence Generator

Finally, you will write a function to generate a simple n-gram model of a text. You will use this model to generate random sentences in the writing style of the text.

N-gram Models

An n-gram is a sequence of n words appearing in sequence in a text that is very useful in characterizing the text. We provide a function ngrams, that extracts a list of all the n-grams of a sentence. This function takes two arguments, a sentence (a list of word strings) and n, the length of each n-gram. For example:

In [2]: ngrams(2, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', 'I'), ('I', 'like'), ('like', 'green'),
         ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'),
         ('ham', '<end>')]

Notice that these bigrams are overlapping. For example, both ('I', 'like') and ('like', 'green') appear. Also notice the special markers for the beginning and end of the sentence. These help understand what words appear at or near the beginning or end of sentences, versus those words that appear within sentences.

Remember, n can be chosen as any non-negative number. Here are trigrams of the same sentence.

In [2]: ngrams(3, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', '<start>', 'I'), ('<start>', 'I', 'like'),
    ('I', 'like', 'green'), ('like', 'green', 'eggs'),
    ('green', 'eggs', 'and'), ('eggs', 'and', 'ham'),
    ('and', 'ham', '<end>')]

For this task, an n-gram model of a collection of n-grams is a dictionary. Keys are tuples of length n-1, representing the prefixes of all n-grams included in the model. Values are lists of words that appear as the last word in n-grams whose first n-1 words are the corresponding key. The following examples show a single sentence and a pair of sentences, their collections of bigrams (2-grams), and their corresponding bigram (2-gram) models.

# sentence
sentence1   = ['I', 'am', 'Sam']
# list of bigrams (2-grams) for this sentence
bigramList1 = [('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')]
# bigram (2-gram) model of this sentence
# keys are 1-tuples, since 2-1 = 1
model1 = {
  ('<start>',): ['I'],    # First word was 'I'
  ('I',):       ['am'],   # 'am' followed 'I'
  ('am',):      ['Sam'],  # 'Sam' followed 'am'
  ('Sam',):     ['<end>'] # Sentence end followed 'Sam'
}

# list of sentences
sentences2  = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of bigrams for each sentence
bigramList2 = [[('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')],
               [('<start>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '<end>')]]
# bigram model for this pair of sentences
# keys are 1-tuples, since 2-1 = 1
model2 = {
  ('<start>',): ['I', 'Sam'],     # First word was either 'I' or 'Sam'
  ('I',):       ['am', 'am'],     # 'am' follows 'I' in two places
  ('am',):      ['Sam', '<end>'], # Both 'Sam' and sentence end followed 'am'
  ('Sam',):     ['<end>', 'I'],   # Both sentence end and 'I' followed 'Sam'
}

An n-gram model summarizes what words follow a given series of n-1 words in the texts from which the model is built. For example, the second entry in model2 indicates that we observed two words that followed 'I' in the text and both were the word 'am'. Note that 'am' is listed twice. By preserving duplicates, the model represents how often each following word appeared.

The next entry indicates that after the word 'am', we observed the word 'Sam' once and we observed the end of a sentence once. The first entry indicates that we observed both 'I' and 'Sam' as the first word of sentences.

As a second example, consider the same sentences with trigrams (3-grams). Considering words in threes, we can reason about what words follow a given pair of words in sequence.

# list of sentences
sentences3   = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of trigrams (3-grams) for each sentence
trigramList3 = [[('<start>', '<start>', 'I'), ('<start>', 'I', 'am'),
                 ('I', 'am', 'Sam'), ('am', 'Sam', '<end>')],
                [('<start>', '<start>', 'Sam'), ('<start>', 'Sam', 'I'),
                 ('Sam', 'I', 'am'), ('I', 'am', '<end>')]]
# trigram (3-gram) model for this pair of sentences
# keys are 2-tuples, since 3-1 = 2
model3 = {
    ('Sam', 'I'): ['am'],
    ('I', 'am'): ['Sam', '<end>'],
    ('am', 'Sam'): ['<end>'],
    ('<start>', 'I'): ['am'],
    ('<start>', 'Sam'): ['I'],
    ('<start>', '<start>'): ['I', 'Sam']
}

n-gram models have a wide range of applications in fields of computing such as natural language processing, computational biology, and machine learning. In this task, you will use very simple n-gram models for a less noble goal: generating silly sentences!

Your Task: Building the Model

Define a function buildModel that takes two arguments: the length of n-grams to use, and a list of sentences. buildModel will build a simple n-gram model dictionary from the given list of sentences using higher-order functions (especially reduce).

To implement buildModel, you will translate the function buildModelLoops, which uses loops for the same task. You must translate each loop to a reduce call, including a new helper function, anonymous or named.

In [3]: buildModel(2, [['I', 'am', 'Sam'], ['Sam', 'I', 'am']])
Out[3]: {
  ('<start>',): ['I', 'Sam'],
  ('I',):       ['am', 'am'],
  ('am',):      ['Sam', '<end>'],
  ('Sam',):     ['<end>', 'I'],
}

To implement buildModel, you will:

Hints:

Your Reward: Generating Sentences for Testing, Fun, and Profit

Since the n-gram model represents how often words tend to follow other words in the text we observed, we can also use it to predict what word should come next, given the words in a sentence so far. In other words, we can generate random sentences that tend to have the same kinds of work sequences (n-grams) as the text we observed.

To start a random sentence using the bigram model model2 above, we choose a first word by randomly choosing one of the words that has been observed to follow the '<start>' symbol. For example: 'I'. From there, we continue by randomly choosing a word that has followed 'I': 'am'. We continue this process choosing a word to follow the recent words until we eventually choose the '<end>' symbol, when we will conclude the sentence. For example, we might generate the sentence "I am Sam I am" by choosing randomly until choosing '<end>':

['I']
['I', 'am']
['I', 'am', 'Sam']
['I', 'am', 'Sam', 'I']
['I', 'am', 'Sam', 'I', 'am']

Generalizing from 2-grams to n-grams, we use the preceding n-1 words to predict the next word, rather than just the preceding one word.

How to Generate Silly Sentences

To help test and enjoy your implementation of buildModel, we provide a function buildGenerator that takes an n-gram length and a list of sentences and returns a function. The function that is returned can be called with no arguments to generate a random sentence. Read generate to see how this is accomplished. For example:

In [4]: sentences = sentencesFromFile('texts/grimm.txt')
In [5]: gen = buildGenerator(3, sentences)
In [6]: u' '.join(gen())
Out[6]: u'The wretch often disguises himself but you must cease using
and return or destroy all copies of this And the old woman was a fly
and in the middle of winter when the war to begin the willow-wren sent
down the steps'
In [7]: u' '.join(gen())
Out[7]: 'The bird delighted with its head hanging down as if an evil
spirit in the twinkling of an innocent boy who wanted to punish his
wicked sons but they made a bite and to try and set her free from care'
In [7]: u' '.join(gen())
Out[8]: u'This is a young fox is here At this the best of all until flames
burst forth throughout the castle all was heard and a bottle of wine'

The generated sentences lack punctuation, but can be quite entertaining. When your generator is working, try different values for n and explore other texts in the texts directory, your own writings (saved as plain text), or publicly available texts. Try building models from sentences from more than one text to blend their styles together. Share your favorites here.



Task 2: Create Your Own Animation

This task is a partner problem in which you are required to work with a partner as part of a two-person team.

In the ps10/CS111Animation folder, we have included a copy of the core CS111 animation framework files you saw in Lec 21 and Lab 13. In this task, you and your partner should work together to create a new animation from scratch in the CS111Animation folder.

For your animation, you should design and implement new sprites. Feel free to reuse parts of the graphic scenes you designed for PS1.

Your animation must satisfy the following criteria:

Other Notes:



Task 3: Honor Code Form and Final Checks

As in the previous problem sets, your honor code submission for this pset will involve defining values for the variables in the honorcode.py file. This is a Python file, so your values must be valid Python code (strings or numbers).

Be sure to test your code thoroughly by calling your functions on new arguments of your own invention and verifying that the results are as expected. We have removed the Otter Inspect training wheels on this assignment, so it is your responsibility to test your code.

If you wrote any function invocations or print statements in your Python files to test your code, please remove them, comment them out before you submit, or wrap them in a if __name__=='__main__' block. Points will be deducted for isolated function invocations or superfluous print statements. Be sure to run the final version to make sure there are no syntax errors resulting from commenting these pieces.



How to turn in this Problem Set