Problem Set 10 - Due Thursday, May 4 at 4:00pm EST
Reading
- Lecture slides and notebook: Lec 19 Python Programming with First-class Functions
- Lab notes and exercises: Lab 12 Functional Programming
- Lecture slides and notebook: Lec 20 Objects, Classes, and Inheritance
- Lecture slides and code folder: Lec 21 Animation
- Lab notes and exercises: Lab 13 Animation
About this Problem Set
This problem set is intended to give you practice with higher-order functions, and with classes, objects and inheritance in the context of animation.
This assignment will not have a reflection or a quiz. All points on this assignment will be allocated to the code.
There are two tasks in this pset.
-
In Task 1 (Individual Task) you will compute word frequencies to characterize texts and generate random sentences in similar style.
-
In Task 2 (Partner Task) you will create an open-ended animation using the animation framework presented in class. Use this shared Google Doc to find a pair programming partner.
-
Please DO NOT post to the cs111-spring17 google group looking for a partner; use the Google Doc instead.
-
Remember that you can work with the same partner a maximum of TWICE this semester.
- We will deduct 2 points per day for any students who have not recorded their names by the end of Fri, Apr 28th
-
Carefully study the lecture and lab materials, which will help you do the two tasks.
The CS111 Problem Set Guide gives an overview of psets, including a detailed description of individual and partner tasks.
In Fall 2016, students spent on average 4 hours (min = 1 hour, max = 8 hours) for Task 2 (open-ended animation). Task 1 was not part of the assignment in Fall 2016 and we don't have data for it.
All code for this assignment is available in the ps10
folder in
the cs111/download
directory within your cs
server account.
Task 1: Higher Order Fun with Word Frequencies
In this task, you will use higher-order functions to split a text by sentences and words (Subtask 1a), compute the frequency with which each word appears in a text file (Subtask 1b), and build a model of word usage to generate random sentences in the style of a given text (Subtask 1c).
This task provides less testing support (no Otter Inspect) than you have used on some previous tasks. It is time to take off the training wheels and learn how to test your own code.
Task 1 Rules: No Loops, No Recursion, No List Comprehensions
To help you embrace the power of higher-order functions, you must
not write any loops, recursion, or list comprehensions in any part of
this task. Counting all uses in all parts together, you must use
all of map
, filter
, reduce
, and sorted
at least once.
Subtask 1a. Extracting Sentences
Define a function named sentencesFromFile
that takes the name of a
file as its single argument. The result returned by this function
should be a list of non-empty sentences, where each sentence is
represented as a non-empty list of words (strings). For example, the
first 7 sentences produced by inspecting
texts/green-eggs.txt
are:
[Updated example 10:15pm 27 April]
In [2]: sentencesFromFile('texts/green-eggs.txt')[:7]
Out[2]: [[u'Sam', u'I', u'am'],
[u'I', u'am', u'Sam'],
[u'I', u'am', u'Sam'],
[u'Sam', u'I', u'am'],
[u'That', u'Sam-I-am'],
[u'That', u'Sam-I-am'],
[u'I', u'do', u'not', u'like', u'that', u'Sam-I-am']]
Open up texts/green-eggs.txt
to see the text that we parsed. Note
that sentencesFromFile
should split the text first by sentence
terminators (e.g., '. '
and pairs of consecutive newlines, not just single newlines),
and then by words,
stripping punctuation and spaces from the start and end of words.
As another example, consider the first 3 sentences from
texts/constitution.txt
(where we've reformatted Canopy's output to take less space):
In [3]: sentencesFromFile('texts/constitution.txt')[:3]
Out[3]:
[[u'We', u'the', u'People', u'of', u'the', u'United', u'States', u'in', u'Order',
u'to', u'form', u'a', u'more', u'perfect', u'Union', u'establish', u'Justice',
u'insure', u'domestic', u'Tranquility', u'provide', u'for', u'the', u'common',
u'defence', u'promote', u'the', u'general', u'Welfare', u'and', u'secure',
u'the', u'Blessings', u'of', u'Liberty', u'to', u'ourselves', u'and', u'our',
u'Posterity', u'do', u'ordain', u'and', u'establish', u'this', u'Constitution',
u'for', u'the', u'United', u'States', u'of', u'America'],
[u'Article', u'1'],
[u'Section', u'1', u'All', u'legislative', u'Powers', u'herein', u'granted',
u'shall', u'be', u'vested', u'in', u'a', u'Congress', u'of', u'the', u'United',
u'States', u'which', u'shall', u'consist', u'of', u'a', u'Senate', u'and',
u'House', u'of', u'Representatives']]
To implement sentencesFromFile
, you will:
- Follow the general rules for this Task.
- Implement your solution incrementally, stopping to test and inspect
each stage with
print
statements as you go. Trying to implement this all at once will take much more time in the long run. - IMPORTANT CHANGE (16:00 02 May): You may use as many
map
andfilter
calls as you need in this part, but no other higher order functions. (This loosens the previous requirement of exactly twomap
calls and twofilter
calls.) - Use the
read
method of files along with thedecode
method to read the whole file as a single unicode string like this:openedFile.read().decode('utf-8-sig', 'ignore')
.- Update:: If you saw an odd character like
\ufeff
in your output you can ignore it or use'utf-8-sig'
instead of'utf-8'
here.
- Update:: If you saw an odd character like
- Use the provided helper function
splitBySentenceStop
to split the entire text string by sentence terminators. This function returns a list of strings, one per candidate sentence, but may include empty sentences. This function uses a feature called regular expressions to do basic sentence boundary detection using patterns. You do not need to understand how it works, but you will learn in a later CS course! - Use the provided helper function
stripPunctuation
to remove punctuation from the start and end of a word. - Include only non-empty words (not
''
) in sentences. - Include only non-empty sentences (not
[]
) in the result. - Test incrementally!
- See testing code at the end of
wordFrequency.py
to test on various files in thetexts
directory.
Subtask 1b. Ranking Word Frequencies (a.k.a. you and reduce
become friends)
Define a function wordFrequenciesFromSentences
that takes a
sentence list of the form discussed above as its single argument. The
function should return a list of (word, frequency)
pairs, ordered
from most frequent to least frequent. Each word appearing in the
sentence list should appear in exactly one such pair, associated with
its frequency or count --- i.e., the number of times that word
appears in all the sentences. These pairs should be sorted from high
frequency to low frequency, but the order of pairs with the same
frequency is not specified. All words should use lowercase form for
the frequency counts.
In many examples, for easier reading we have (1) omitted
the unicode prefix u'...'
and (2) reformatted list results.
[Updated texts/green-eggs.txt
at 10:15pm 27 April to fix missing lines.]
[Updated following result at 5:09pm 30 April to change ('samiam', 13) to ('sam-i-am', 13).]
In [1]: wordFrequenciesFromSentences(sentencesFromFile('texts/green-eggs.txt'))
Out[1]:
[('not', 84), ('i', 73), ('them', 61), ('a', 59), ('like', 44), ('in', 40),
('do', 37), ('you', 34), ('would', 26), ('and', 25), ('eat', 25), ('will', 21),
('with', 19), ('could', 14), ('sam-i-am', 13), ('eggs', 11), ('here', 11),
('ham', 11), ('green', 11), ('the', 11), ('there', 9), ('train', 9),
('house', 8), ('mouse', 8), ('anywhere', 8), ('or', 8), ('sam', 7), ('box', 7),
('fox', 7), ('dark', 7), ('on', 7), ('car', 7), ('tree', 6), ('so', 5),
('say', 5), ('be', 4), ('am', 4), ('see', 4), ('try', 4), ('goat', 4),
('may', 4), ('me', 4), ('let', 4), ('rain', 4), ('boat', 3), ('that', 3),
('are', 2), ('thank', 2), ('good', 2), ('they', 2), ('if', 1)]
To implement wordFrequnciesFromSentences
, you will:
- Follow the general rules for this Task.
- Use multiple standard higher-order functions and multiple
lambda
s or other functions. - Use a dictionary as an accumulator for word frequencies, with words as keys and frequencies as values.
- Use the provided combiner function,
countWord
, that takes a dictionary and a word as arguments, counts the word's entry in the dictionary and returns the dictionary. - Define and use either a named helper function,
countSentenceWords
, or alambda
, that takes a dictionary and a sentence as arguments, adds the the sentence's word frequencies to the dictionary and returns the dictionary. - Use the
reduce
function at least twice to accumulate a word frequency table (dictionary). - Return a list with the word-frequency pairs in descending frequency order.
Hints:
- Build your solution incrementally. Use
print
statements often to test and inspect the intermediate results as you add each stage to your solution. Remove thoseprint
s before submitting. - Follow this work plan:
- Begin by attacking only the first sentence in the sentences list,
sentences[0]
. Use a singlereduce
call with the providedcountWord
function to accumulate a word frequency dictionary over the words of a single sentence. - Next, package your support for counting word frequencies of one
sentence in a combiner function called
countSentenceWords
or alambda
. Use anotherreduce
layer to apply this combiner and accumulate a word frequency dictionary over all sentences in the list. - Finally, extract a list of key-value pairs from the dictionary to present in sorted order.
- Begin by attacking only the first sentence in the sentences list,
- Test incrementally!
- See testing code at the end of
wordFrequency.py
to test on various files in thetexts
directory. - If you are stuck, think about how you would implement this with
loops that do accumulation. Then transform each loop into a call to
reduce
, where the body of the loop becomes the body of the combiner function that is applied to reduce the given list.
Subtask 1c. Random Sentence Generator
Finally, you will write a function to generate a simple n-gram model of a text. You will use this model to generate random sentences in the writing style of the text.
N-gram Models
An n-gram is a sequence of n words appearing in sequence in a text
that is very useful in characterizing the text. We provide a function
ngrams
, that extracts a list of all the n-grams of a
sentence. This function takes two arguments, a sentence (a list of
word strings) and n, the length of each n-gram. For example:
In [2]: ngrams(2, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', 'I'), ('I', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'),
('ham', '<end>')]
Notice that these bigrams are overlapping. For example, both ('I', 'like')
and ('like', 'green')
appear. Also notice the special
markers for the beginning and end of the sentence. These help
understand what words appear at or near the beginning or end of
sentences, versus those words that appear within sentences.
Remember, n can be chosen as any non-negative number. Here are trigrams of the same sentence.
In [2]: ngrams(3, ['I', 'like', 'green', 'eggs', 'and', 'ham'])
Out[2]: [('<start>', '<start>', 'I'), ('<start>', 'I', 'like'),
('I', 'like', 'green'), ('like', 'green', 'eggs'),
('green', 'eggs', 'and'), ('eggs', 'and', 'ham'),
('and', 'ham', '<end>')]
For this task, an n-gram model of a collection of n-grams is a dictionary. Keys are tuples of length n-1, representing the prefixes of all n-grams included in the model. Values are lists of words that appear as the last word in n-grams whose first n-1 words are the corresponding key. The following examples show a single sentence and a pair of sentences, their collections of bigrams (2-grams), and their corresponding bigram (2-gram) models.
# sentence
sentence1 = ['I', 'am', 'Sam']
# list of bigrams (2-grams) for this sentence
bigramList1 = [('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')]
# bigram (2-gram) model of this sentence
# keys are 1-tuples, since 2-1 = 1
model1 = {
('<start>',): ['I'], # First word was 'I'
('I',): ['am'], # 'am' followed 'I'
('am',): ['Sam'], # 'Sam' followed 'am'
('Sam',): ['<end>'] # Sentence end followed 'Sam'
}
# list of sentences
sentences2 = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of bigrams for each sentence
bigramList2 = [[('<start>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '<end>')],
[('<start>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '<end>')]]
# bigram model for this pair of sentences
# keys are 1-tuples, since 2-1 = 1
model2 = {
('<start>',): ['I', 'Sam'], # First word was either 'I' or 'Sam'
('I',): ['am', 'am'], # 'am' follows 'I' in two places
('am',): ['Sam', '<end>'], # Both 'Sam' and sentence end followed 'am'
('Sam',): ['<end>', 'I'], # Both sentence end and 'I' followed 'Sam'
}
An n-gram model summarizes what words follow a given series of n-1
words in the texts from which the model is built. For example, the
second entry in model2
indicates that we observed two words that
followed 'I'
in the text and both were the word 'am'
. Note that
'am'
is listed twice. By preserving duplicates, the model
represents how often each following word appeared.
The next entry indicates that after the word 'am'
, we observed the
word 'Sam'
once and we observed the end of a sentence once. The
first entry indicates that we observed both 'I'
and 'Sam'
as the
first word of sentences.
As a second example, consider the same sentences with trigrams (3-grams). Considering words in threes, we can reason about what words follow a given pair of words in sequence.
# list of sentences
sentences3 = [['I', 'am', 'Sam'], ['Sam', 'I', 'am']]
# list of trigrams (3-grams) for each sentence
trigramList3 = [[('<start>', '<start>', 'I'), ('<start>', 'I', 'am'),
('I', 'am', 'Sam'), ('am', 'Sam', '<end>')],
[('<start>', '<start>', 'Sam'), ('<start>', 'Sam', 'I'),
('Sam', 'I', 'am'), ('I', 'am', '<end>')]]
# trigram (3-gram) model for this pair of sentences
# keys are 2-tuples, since 3-1 = 2
model3 = {
('Sam', 'I'): ['am'],
('I', 'am'): ['Sam', '<end>'],
('am', 'Sam'): ['<end>'],
('<start>', 'I'): ['am'],
('<start>', 'Sam'): ['I'],
('<start>', '<start>'): ['I', 'Sam']
}
n-gram models have a wide range of applications in fields of computing such as natural language processing, computational biology, and machine learning. In this task, you will use very simple n-gram models for a less noble goal: generating silly sentences!
Your Task: Building the Model
Define a function buildModel
that takes two arguments: the length of
n-grams to use, and a list of sentences. buildModel
will build a
simple n-gram model dictionary from the given list of sentences
using higher-order functions (especially reduce
).
To implement buildModel
, you will translate the function
buildModelLoops
, which uses loops for the same task. You must
translate each loop to a reduce
call, including a new helper
function, anonymous or named.
In [3]: buildModel(2, [['I', 'am', 'Sam'], ['Sam', 'I', 'am']])
Out[3]: {
('<start>',): ['I', 'Sam'],
('I',): ['am', 'am'],
('am',): ['Sam', '<end>'],
('Sam',): ['<end>', 'I'],
}
To implement buildModel
, you will:
- Follow the general rules for this Task.
- Follow the same general structure of nested
reduce
calls and the same incremental development style as you did in Subtask 1b. - Filter out sentences shorter than n words and avoid recording
their n-grams in the model, as in
buildModelLoops
. This helps make more interesting generated sentences. - Use the provided
ngrams
function, which takes two argument: a length n, and a list of sentences, as inbuildModelLoops
.ngrams
extracts the sequences of n-length n-grams from the given set of sentences.- If reading the code for
ngrams
, recall thatf(*[a,b,c])
is equivalent tof(a,b,c)
in Python.
- If reading the code for
- Define and use a combiner function
countGram
that takes a model dictionary and a single n-gram as arguments. This function's body will be the same as the body of the innerfor
loop inbuildModelLoops
, but it will additionally return the model dictionary. - Define and use a combiner function
countSentenceGrams
(or alambda
instead of a named function) that takes a model dictionary and a sentence as arguments. This function's body will be a conversion of the body of the outerfor
loop inbuildModelLoops
, usingreduce
andcountWord
in place of the inner loop. - Use nested calls to
reduce
with these "body" functions to accumulate a model over the n-grams of all sentences.
Hints:
- As in earlier tasks, start by converting the inner loop to
countGram
and use a singlereduce
to accumulate a model dictionary for the n-grams ofsentences[0]
starting from an empty initial accumulator dictionary. - Once this is working, package it in
countSentenceGrams
, using parameters for the model and sentence instead of hardcoded values. Reuse it toreduce
over the n-grams of each sentence in turn, building a model dictionary capturing the n-grams of all sentences. - Note that there will be a nice one-to-one mapping from each loop in
buildModelLoops
to eachreduce
/combiner pair inbuildModel
. - Test incrementally!
- See testing code at the end of
wordFrequency.py
to test yourbuildModel
implementation on the first few sentences of files in thetexts
directory. Remember, you can always compare tobuildModelLoops
to double check your result is correct.
Your Reward: Generating Sentences for Testing, Fun, and Profit
Since the n-gram model represents how often words tend to follow other words in the text we observed, we can also use it to predict what word should come next, given the words in a sentence so far. In other words, we can generate random sentences that tend to have the same kinds of work sequences (n-grams) as the text we observed.
To start a random sentence using the bigram model model2
above, we
choose a first word by randomly choosing one of the words that has
been observed to follow the '<start>'
symbol. For example: 'I'
.
From there, we continue by randomly choosing a word that has followed
'I'
: 'am'
. We continue this process choosing a word to follow the
recent words until we eventually choose the '<end>'
symbol, when we
will conclude the sentence. For example, we might generate the
sentence "I am Sam I am" by choosing randomly until choosing
'<end>'
:
['I']
['I', 'am']
['I', 'am', 'Sam']
['I', 'am', 'Sam', 'I']
['I', 'am', 'Sam', 'I', 'am']
Generalizing from 2-grams to n-grams, we use the preceding n-1 words to predict the next word, rather than just the preceding one word.
How to Generate Silly Sentences
To help test and enjoy your implementation of buildModel
, we provide
a function buildGenerator
that takes an n-gram length and a list
of sentences and returns a function. The function that is returned
can be called with no arguments to generate a random sentence. Read
generate
to see how this is accomplished. For example:
In [4]: sentences = sentencesFromFile('texts/grimm.txt')
In [5]: gen = buildGenerator(3, sentences)
In [6]: u' '.join(gen())
Out[6]: u'The wretch often disguises himself but you must cease using
and return or destroy all copies of this And the old woman was a fly
and in the middle of winter when the war to begin the willow-wren sent
down the steps'
In [7]: u' '.join(gen())
Out[7]: 'The bird delighted with its head hanging down as if an evil
spirit in the twinkling of an innocent boy who wanted to punish his
wicked sons but they made a bite and to try and set her free from care'
In [7]: u' '.join(gen())
Out[8]: u'This is a young fox is here At this the best of all until flames
burst forth throughout the castle all was heard and a bottle of wine'
The generated sentences lack punctuation, but can be quite
entertaining. When your generator is working, try different values
for n and explore other texts in the texts
directory, your own
writings (saved as plain text), or publicly available texts. Try
building models from sentences from more than one text to blend their
styles together. Share your favorites
here.
Task 2: Create Your Own Animation
This task is a partner problem in which you are required to work with a partner as part of a two-person team.
In the ps10/CS111Animation
folder, we have included a copy of the core CS111 animation framework files
you saw in Lec 21 and Lab 13.
In this task, you and your partner should work together
to create a new animation from scratch in the CS111Animation
folder.
For your animation, you should design and implement new sprites. Feel free to reuse parts of the graphic scenes you designed for PS1.
Your animation must satisfy the following criteria:
-
Your animation file must be named
userName1_userName2_Animation.py
. For example:sbuck_slee_Animation.py
. -
Your animation file must appropriately import the
cs1graphics
,Animation
, andSprite
modules, as well as the modules in which you define your sprite classes. -
You should define at least 3 different sprite classes. At least two of these sprite classes should be related through inheritance (i.e. at least one class should inherits from another).
-
Each sprite class should have state that changes over time and its own
step
method which implements a unique behavior (e.g. bouncing, flipping, dancing, dropping objects, etc.). If a class is inherited it should override thestep
class of the parent class. -
Each sprite class should define an appropriate
stateString
method that shows the state of the sprite for debugging purposes. -
Your animation should include at least 3 different kinds of sprites, but can also include multiple instances of the same sprite classes.
- Important: After completing this task, you must take a screen capture video of your animation.
For making your screen capture video, use the QuickTime Player software (which is installed on all lab computers).
Here are instructions for how to use this software to make a screen capture video. Name your video userName1
_
userName2_Animation.mov
, and upload it to this shared google drive folder.
Other Notes:
-
Study the code in the lecture 21 animation materials you used in lecture for examples of how to define sprite classes and an animation class.
- By default, the state of all the sprites at each step of an animation will be displayed
in the Canopy interaction pane. If you wish to turn off this display, invoke the
.setDebug(False)
method on your animation before invoking.start()
on the animation.
Task 3: Honor Code Form and Final Checks
As in the previous problem sets, your honor code submission for this pset will involve
defining values for the variables in the honorcode.py
file.
This is a Python file, so your values must be valid Python code (strings or numbers).
Be sure to test your code thoroughly by calling your functions on new arguments of your own invention and verifying that the results are as expected. We have removed the Otter Inspect training wheels on this assignment, so it is your responsibility to test your code.
If you wrote any function invocations or print statements in your Python files to test your code,
please remove them, comment them out before you submit, or wrap them in a if __name__=='__main__'
block.
Points will be deducted for isolated function invocations or superfluous print statements.
Be sure to run the final version to make sure there are no syntax errors resulting from commenting these pieces.
How to turn in this Problem Set
- For Task 1, save your final
wordFrequency.py
in theps10
folder. - For Task 2:
- Each team member should save all final sprite and animation
files you developed for your animation in the
CS111Animation
subfolder. - Follow the instructions in Task 2 for creating and uploading the screen-capture video of your animation.
- Each team member should save all final sprite and animation
files you developed for your animation in the
- For the Honor code, save your filled-up
honorcode.py
file inps10
folder. - Note: It is critical that the name of the folder you submit is
ps10
, and your submitted files/folders arewordFrequency.py
,CS111Animation
, andhonorcode.py
. In other words, do not rename the folder that you downloaded, do not create new files or folders (except your sprite and animation files for Task 2) and do not delete or rename any of the existing files in this folder. Improperly named files or functions will incur penalties. - Drop your entire
ps10
folder in yourdrop
folder on thecs
server using Cyberduck by 4:00pm on Thursday, May 4th, 2017. - Failure to submit your code before the deadline will result in zero credit for the code portion of PS10.