Problem Set 10 - Due Thursday, May 4 at 4:00pm EST

Reading

Lecture slides and notebook: Lec 19 Python Programming with First-class Functions
Lab notes and exercises: Lab 12 Functional Programming
Lecture slides and notebook: Lec 20 Objects, Classes, and Inheritance
Lecture slides and code folder: Lec 21 Animation
Lab notes and exercises: Lab 13 Animation

About this Problem Set

This problem set is intended to give you practice with higher-order functions, and with classes, objects and inheritance in the context of animation.

This assignment will not have a reflection or a quiz. All points on this assignment will be allocated to the code.

There are two tasks in this pset.

In Task 1 (Individual Task) you will compute word frequencies to characterize texts and generate random sentences in similar style.
In Task 2 (Partner Task) you will create an open-ended animation using the animation framework presented in class. Use this shared Google Doc to find a pair programming partner.
- Please DO NOT post to the cs111-spring17 google group looking for a partner; use the Google Doc instead.
- Remember that you can work with the same partner a maximum of TWICE this semester.
- We will deduct 2 points per day for any students who have not recorded their names by the end of Fri, Apr 28th

Carefully study the lecture and lab materials, which will help you do the two tasks.

The CS111 Problem Set Guide gives an overview of psets, including a detailed description of individual and partner tasks.

In Fall 2016, students spent on average 4 hours (min = 1 hour, max = 8 hours) for Task 2 (open-ended animation). Task 1 was not part of the assignment in Fall 2016 and we don't have data for it.

All code for this assignment is available in the ps10 folder in the cs111/download directory within your cs server account.

Task 1: Higher Order Fun with Word Frequencies

In this task, you will use higher-order functions to split a text by sentences and words (Subtask 1a), compute the frequency with which each word appears in a text file (Subtask 1b), and build a model of word usage to generate random sentences in the style of a given text (Subtask 1c).

This task provides less testing support (no Otter Inspect) than you have used on some previous tasks. It is time to take off the training wheels and learn how to test your own code.

Task 1 Rules: No Loops, No Recursion, No List Comprehensions

To help you embrace the power of higher-order functions, you must not write any loops, recursion, or list comprehensions in any part of this task. Counting all uses in all parts together, you must use all of map, filter, reduce, and sorted at least once.

Subtask 1a. Extracting Sentences

Define a function named sentencesFromFile that takes the name of a file as its single argument. The result returned by this function should be a list of non-empty sentences, where each sentence is represented as a non-empty list of words (strings). For example, the first 7 sentences produced by inspecting texts/green-eggs.txt are:

[Updated example 10:15pm 27 April]

In [2]: sentencesFromFile('texts/green-eggs.txt')[:7]
Out[2]: [[u'Sam', u'I', u'am'],
         [u'I', u'am', u'Sam'],
         [u'I', u'am', u'Sam'],
         [u'Sam', u'I', u'am'],
         [u'That', u'Sam-I-am'],
         [u'That', u'Sam-I-am'],
         [u'I', u'do', u'not', u'like', u'that', u'Sam-I-am']]

Open up texts/green-eggs.txt to see the text that we parsed. Note that sentencesFromFile should split the text first by sentence terminators (e.g., '. ' and pairs of consecutive newlines, not just single newlines), and then by words, stripping punctuation and spaces from the start and end of words.

As another example, consider the first 3 sentences from texts/constitution.txt (where we've reformatted Canopy's output to take less space):

In [3]: sentencesFromFile('texts/constitution.txt')[:3]
Out[3]: 
[[u'We', u'the', u'People', u'of', u'the', u'United', u'States', u'in', u'Order', 
 u'to', u'form', u'a', u'more', u'perfect', u'Union', u'establish', u'Justice',
  u'insure', u'domestic', u'Tranquility', u'provide', u'for', u'the', u'common',
  u'defence', u'promote', u'the', u'general', u'Welfare', u'and', u'secure', 
  u'the', u'Blessings', u'of', u'Liberty', u'to', u'ourselves', u'and', u'our',
  u'Posterity', u'do', u'ordain', u'and', u'establish', u'this', u'Constitution',
  u'for', u'the', u'United', u'States', u'of', u'America'],
 [u'Article', u'1'],
 [u'Section', u'1', u'All', u'legislative', u'Powers', u'herein', u'granted',
  u'shall', u'be', u'vested', u'in', u'a', u'Congress', u'of', u'the', u'United',
  u'States', u'which', u'shall', u'consist', u'of', u'a', u'Senate', u'and',
  u'House', u'of', u'Representatives']]

To implement sentencesFromFile, you will:

Follow the general rules for this Task.
Implement your solution incrementally, stopping to test and inspect each stage with print statements as you go. Trying to implement this all at once will take much more time in the long run.
IMPORTANT CHANGE (16:00 02 May): You may use as many map and filter calls as you need in this part, but no other higher order functions. (This loosens the previous requirement of exactly two map calls and two filter calls.)
Use the read method of files along with the decode method to read the whole file as a single unicode string like this: openedFile.read().decode('utf-8-sig', 'ignore').
- Update:: If you saw an odd character like \ufeff in your output you can ignore it or use 'utf-8-sig' instead of 'utf-8' here.
Use the provided helper function splitBySentenceStop to split the entire text string by sentence terminators. This function returns a list of strings, one per candidate sentence, but may include empty sentences. This function uses a feature called regular expressions to do basic sentence boundary detection using patterns. You do not need to understand how it works, but you will learn in a later CS course!
Use the provided helper function stripPunctuation to remove punctuation from the start and end of a word.
Include only non-empty words (not '') in sentences.
Include only non-empty sentences (not []) in the result.
Test incrementally!
See testing code at the end of wordFrequency.py to test on various files in the texts directory.

Subtask 1b. Ranking Word Frequencies (a.k.a. you and `reduce` become friends)

Define a function wordFrequenciesFromSentences that takes a sentence list of the form discussed above as its single argument. The function should return a list of (word, frequency) pairs, ordered from most frequent to least frequent. Each word appearing in the sentence list should appear in exactly one such pair, associated with its frequency or count --- i.e., the number of times that word appears in all the sentences. These pairs should be sorted from high frequency to low frequency, but the order of pairs with the same frequency is not specified. All words should use lowercase form for the frequency counts.

In many examples, for easier reading we have (1) omitted the unicode prefix u'...' and (2) reformatted list results.

[Updated texts/green-eggs.txt at 10:15pm 27 April to fix missing lines.]

[Updated following result at 5:09pm 30 April to change ('samiam', 13) to ('sam-i-am', 13).]

In [1]: wordFrequenciesFromSentences(sentencesFromFile('texts/green-eggs.txt'))
Out[1]: 
[('not', 84), ('i', 73), ('them', 61), ('a', 59), ('like', 44), ('in', 40), 
 ('do', 37), ('you', 34), ('would', 26), ('and', 25), ('eat', 25), ('will', 21), 
 ('with', 19), ('could', 14), ('sam-i-am', 13), ('eggs', 11), ('here', 11), 
 ('ham', 11), ('green', 11), ('the', 11), ('there', 9), ('train', 9), 
 ('house', 8), ('mouse', 8), ('anywhere', 8), ('or', 8), ('sam', 7), ('box', 7), 
 ('fox', 7), ('dark', 7), ('on', 7), ('car', 7), ('tree', 6), ('so', 5), 
 ('say', 5), ('be', 4), ('am', 4), ('see', 4), ('try', 4), ('goat', 4), 
 ('may', 4), ('me', 4), ('let', 4), ('rain', 4), ('boat', 3), ('that', 3), 
 ('are', 2), ('thank', 2), ('good', 2), ('they', 2), ('if', 1)]

To implement wordFrequnciesFromSentences, you will:

Follow the general rules for this Task.
Use multiple standard higher-order functions and multiple lambdas or other functions.
Use a dictionary as an accumulator for word frequencies, with words as keys and frequencies as values.
Use the provided combiner function, countWord, that takes a dictionary and a word as arguments, counts the word's entry in the dictionary and returns the dictionary.
Define and use either a named helper function, countSentenceWords, or a lambda, that takes a dictionary and a sentence as arguments, adds the the sentence's word frequencies to the dictionary and returns the dictionary.
Use the reduce function at least twice to accumulate a word frequency table (dictionary).
Return a list with the word-frequency pairs in descending frequency order.

Hints:

Build your solution incrementally. Use print statements often to test and inspect the intermediate results as you add each stage to your solution. Remove those prints before submitting.
Follow this work plan:
1. Begin by attacking only the first sentence in the sentences list, sentences[0]. Use a single reduce call with the provided countWord function to accumulate a word frequency dictionary over the words of a single sentence.
2. Next, package your support for counting word frequencies of one sentence in a combiner function called countSentenceWords or a lambda. Use another reduce layer to apply this combiner and accumulate a word frequency dictionary over all sentences in the list.
3. Finally, extract a list of key-value pairs from the dictionary to present in sorted order.
Test incrementally!
See testing code at the end of wordFrequency.py to test on various files in the texts directory.
If you are stuck, think about how you would implement this with loops that do accumulation. Then transform each loop into a call to reduce, where the body of the loop becomes the body of the combiner function that is applied to reduce the given list.

Subtask 1c. Random Sentence Generator

Finally, you will write a function to generate a simple n-gram model of a text. You will use this model to generate random sentences in the writing style of the text.

N-gram Models

An n-gram is a sequence of n words appearing in sequence in a text that is very useful in characterizing the text. We provide a function ngrams, that extracts a list of all the n-grams of a sentence. This function takes two arguments, a sentence (a list of word strings) and n, the length of each n-gram. For example:

In [2]: ngrams(2, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', 'I'), ('I', 'like'), ('like', 'green'),
         ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'),
         ('ham', '<end>')]

Notice that these bigrams are overlapping. For example, both ('I', 'like') and ('like', 'green') appear. Also notice the special markers for the beginning and end of the sentence. These help understand what words appear at or near the beginning or end of sentences, versus those words that appear within sentences.

Remember, n can be chosen as any non-negative number. Here are trigrams of the same sentence.

In [2]: ngrams(3, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', '<start>', 'I'), ('<start>', 'I', 'like'),
    ('I', 'like', 'green'), ('like', 'green', 'eggs'),
    ('green', 'eggs', 'and'), ('eggs', 'and', 'ham'),
    ('and', 'ham', '<end>')]

For this task, an n-gram model of a collection of n-grams is a dictionary. Keys are tuples of length n-1, representing the prefixes of all n-grams included in the model. Values are lists of words that appear as the last word in n-grams whose first n-1 words are the corresponding key. The following examples show a single sentence and a pair of sentences, their collections of bigrams (2-grams), and their corresponding bigram (2-gram) models.

# sentence
sentence1   = ['I', 'am', 'Sam']
# list of bigrams (2-grams) for this sentence
bigramList1 = [('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')]
# bigram (2-gram) model of this sentence
# keys are 1-tuples, since 2-1 = 1
model1 = {
  ('<start>',): ['I'],    # First word was 'I'
  ('I',):       ['am'],   # 'am' followed 'I'
  ('am',):      ['Sam'],  # 'Sam' followed 'am'
  ('Sam',):     ['<end>'] # Sentence end followed 'Sam'
}

# list of sentences
sentences2  = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of bigrams for each sentence
bigramList2 = [[('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')],
               [('<start>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '<end>')]]
# bigram model for this pair of sentences
# keys are 1-tuples, since 2-1 = 1
model2 = {
  ('<start>',): ['I', 'Sam'],     # First word was either 'I' or 'Sam'
  ('I',):       ['am', 'am'],     # 'am' follows 'I' in two places
  ('am',):      ['Sam', '<end>'], # Both 'Sam' and sentence end followed 'am'
  ('Sam',):     ['<end>', 'I'],   # Both sentence end and 'I' followed 'Sam'
}

An n-gram model summarizes what words follow a given series of n-1 words in the texts from which the model is built. For example, the second entry in model2 indicates that we observed two words that followed 'I' in the text and both were the word 'am'. Note that 'am' is listed twice. By preserving duplicates, the model represents how often each following word appeared.

The next entry indicates that after the word 'am', we observed the word 'Sam' once and we observed the end of a sentence once. The first entry indicates that we observed both 'I' and 'Sam' as the first word of sentences.

As a second example, consider the same sentences with trigrams (3-grams). Considering words in threes, we can reason about what words follow a given pair of words in sequence.

# list of sentences
sentences3   = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of trigrams (3-grams) for each sentence
trigramList3 = [[('<start>', '<start>', 'I'), ('<start>', 'I', 'am'),
                 ('I', 'am', 'Sam'), ('am', 'Sam', '<end>')],
                [('<start>', '<start>', 'Sam'), ('<start>', 'Sam', 'I'),
                 ('Sam', 'I', 'am'), ('I', 'am', '<end>')]]
# trigram (3-gram) model for this pair of sentences
# keys are 2-tuples, since 3-1 = 2
model3 = {
    ('Sam', 'I'): ['am'],
    ('I', 'am'): ['Sam', '<end>'],
    ('am', 'Sam'): ['<end>'],
    ('<start>', 'I'): ['am'],
    ('<start>', 'Sam'): ['I'],
    ('<start>', '<start>'): ['I', 'Sam']
}

n-gram models have a wide range of applications in fields of computing such as natural language processing, computational biology, and machine learning. In this task, you will use very simple n-gram models for a less noble goal: generating silly sentences!

Your Task: Building the Model

Define a function buildModel that takes two arguments: the length of n-grams to use, and a list of sentences. buildModel will build a simple n-gram model dictionary from the given list of sentences using higher-order functions (especially reduce).

To implement buildModel, you will translate the function buildModelLoops, which uses loops for the same task. You must translate each loop to a reduce call, including a new helper function, anonymous or named.

In [3]: buildModel(2, [['I', 'am', 'Sam'], ['Sam', 'I', 'am']])
Out[3]: {
  ('<start>',): ['I', 'Sam'],
  ('I',):       ['am', 'am'],
  ('am',):      ['Sam', '<end>'],
  ('Sam',):     ['<end>', 'I'],
}

To implement buildModel, you will:

Follow the general rules for this Task.
Follow the same general structure of nested reduce calls and the same incremental development style as you did in Subtask 1b.
Filter out sentences shorter than n words and avoid recording their n-grams in the model, as in buildModelLoops. This helps make more interesting generated sentences.
Use the provided ngrams function, which takes two argument: a length n, and a list of sentences, as in buildModelLoops. ngrams extracts the sequences of n-length n-grams from the given set of sentences.
- If reading the code for ngrams, recall that f(*[a,b,c]) is equivalent to f(a,b,c) in Python.
Define and use a combiner function countGram that takes a model dictionary and a single n-gram as arguments. This function's body will be the same as the body of the inner for loop in buildModelLoops, but it will additionally return the model dictionary.
Define and use a combiner function countSentenceGrams (or a lambda instead of a named function) that takes a model dictionary and a sentence as arguments. This function's body will be a conversion of the body of the outer for loop in buildModelLoops, using reduce and countWord in place of the inner loop.
Use nested calls to reduce with these "body" functions to accumulate a model over the n-grams of all sentences.

Hints:

As in earlier tasks, start by converting the inner loop to countGram and use a single reduce to accumulate a model dictionary for the n-grams of sentences[0] starting from an empty initial accumulator dictionary.
Once this is working, package it in countSentenceGrams, using parameters for the model and sentence instead of hardcoded values. Reuse it to reduce over the n-grams of each sentence in turn, building a model dictionary capturing the n-grams of all sentences.
Note that there will be a nice one-to-one mapping from each loop in buildModelLoops to each reduce/combiner pair in buildModel.
Test incrementally!
See testing code at the end of wordFrequency.py to test your buildModel implementation on the first few sentences of files in the texts directory. Remember, you can always compare to buildModelLoops to double check your result is correct.

Your Reward: Generating Sentences for Testing, Fun, and Profit

Since the n-gram model represents how often words tend to follow other words in the text we observed, we can also use it to predict what word should come next, given the words in a sentence so far. In other words, we can generate random sentences that tend to have the same kinds of work sequences (n-grams) as the text we observed.

To start a random sentence using the bigram model model2 above, we choose a first word by randomly choosing one of the words that has been observed to follow the '<start>' symbol. For example: 'I'. From there, we continue by randomly choosing a word that has followed 'I': 'am'. We continue this process choosing a word to follow the recent words until we eventually choose the '<end>' symbol, when we will conclude the sentence. For example, we might generate the sentence "I am Sam I am" by choosing randomly until choosing '<end>':

['I']
['I', 'am']
['I', 'am', 'Sam']
['I', 'am', 'Sam', 'I']
['I', 'am', 'Sam', 'I', 'am']

Generalizing from 2-grams to n-grams, we use the preceding n-1 words to predict the next word, rather than just the preceding one word.

How to Generate Silly Sentences

To help test and enjoy your implementation of buildModel, we provide a function buildGenerator that takes an n-gram length and a list of sentences and returns a function. The function that is returned can be called with no arguments to generate a random sentence. Read generate to see how this is accomplished. For example:

In [4]: sentences = sentencesFromFile('texts/grimm.txt')
In [5]: gen = buildGenerator(3, sentences)
In [6]: u' '.join(gen())
Out[6]: u'The wretch often disguises himself but you must cease using
and return or destroy all copies of this And the old woman was a fly
and in the middle of winter when the war to begin the willow-wren sent
down the steps'
In [7]: u' '.join(gen())
Out[7]: 'The bird delighted with its head hanging down as if an evil
spirit in the twinkling of an innocent boy who wanted to punish his
wicked sons but they made a bite and to try and set her free from care'
In [7]: u' '.join(gen())
Out[8]: u'This is a young fox is here At this the best of all until flames
burst forth throughout the castle all was heard and a bottle of wine'

The generated sentences lack punctuation, but can be quite entertaining. When your generator is working, try different values for n and explore other texts in the texts directory, your own writings (saved as plain text), or publicly available texts. Try building models from sentences from more than one text to blend their styles together. Share your favorites here.

Task 2: Create Your Own Animation

This task is a partner problem in which you are required to work with a partner as part of a two-person team.

In the ps10/CS111Animation folder, we have included a copy of the core CS111 animation framework files you saw in Lec 21 and Lab 13. In this task, you and your partner should work together to create a new animation from scratch in the CS111Animation folder.

For your animation, you should design and implement new sprites. Feel free to reuse parts of the graphic scenes you designed for PS1.

Your animation must satisfy the following criteria:

Your animation file must be named userName1_userName2_Animation.py. For example: sbuck_slee_Animation.py.
Your animation file must appropriately import the cs1graphics, Animation, and Sprite modules, as well as the modules in which you define your sprite classes.
You should define at least 3 different sprite classes. At least two of these sprite classes should be related through inheritance (i.e. at least one class should inherits from another).
Each sprite class should have state that changes over time and its own step method which implements a unique behavior (e.g. bouncing, flipping, dancing, dropping objects, etc.). If a class is inherited it should override the step class of the parent class.
Each sprite class should define an appropriate stateString method that shows the state of the sprite for debugging purposes.
Your animation should include at least 3 different kinds of sprites, but can also include multiple instances of the same sprite classes.
Important: After completing this task, you must take a screen capture video of your animation. For making your screen capture video, use the QuickTime Player software (which is installed on all lab computers). Here are instructions for how to use this software to make a screen capture video. Name your video userName1_userName2_Animation.mov, and upload it to this shared google drive folder.

Other Notes:

Study the code in the lecture 21 animation materials you used in lecture for examples of how to define sprite classes and an animation class.
By default, the state of all the sprites at each step of an animation will be displayed in the Canopy interaction pane. If you wish to turn off this display, invoke the .setDebug(False) method on your animation before invoking .start() on the animation.

Task 3: Honor Code Form and Final Checks

As in the previous problem sets, your honor code submission for this pset will involve defining values for the variables in the honorcode.py file. This is a Python file, so your values must be valid Python code (strings or numbers).

Be sure to test your code thoroughly by calling your functions on new arguments of your own invention and verifying that the results are as expected. We have removed the Otter Inspect training wheels on this assignment, so it is your responsibility to test your code.

If you wrote any function invocations or print statements in your Python files to test your code, please remove them, comment them out before you submit, or wrap them in a if __name__=='__main__' block. Points will be deducted for isolated function invocations or superfluous print statements. Be sure to run the final version to make sure there are no syntax errors resulting from commenting these pieces.

How to turn in this Problem Set

For Task 1, save your final wordFrequency.py in the ps10 folder.
For Task 2:
- Each team member should save all final sprite and animation files you developed for your animation in the CS111Animation subfolder.
- Follow the instructions in Task 2 for creating and uploading the screen-capture video of your animation.
For the Honor code, save your filled-up honorcode.py file in ps10 folder.
Note: It is critical that the name of the folder you submit is ps10, and your submitted files/folders are wordFrequency.py, CS111Animation, and honorcode.py. In other words, do not rename the folder that you downloaded, do not create new files or folders (except your sprite and animation files for Task 2) and do not delete or rename any of the existing files in this folder. Improperly named files or functions will incur penalties.
Drop your entire ps10 folder in your drop folder on the cs server using Cyberduck by 4:00pm on Thursday, May 4th, 2017.
Failure to submit your code before the deadline will result in zero credit for the code portion of PS10.