CS 111 Lecture: Accumulation pattern with dictionaries and lists¶
Table of Contents
- Fun with English words
- English word length distribution
- Bonus: Visualize the length distribution
- Fun with statistics
- Building n-grams
- Creating the bigram distribution
- Creating more n-gram distributions
- [Optional]: Exercise 1: Sort the n-gram distributions
- [Optional]: Dictionary Comprehension
- A dictionary of dictionaries
- A dictionary of lists
- Exercise 2: Group n-length words by first letter
1. Fun with English words¶
englishwords
is a list of words from the English language.
from vocabulary import englishwords
type(englishwords)
len(englishwords)
englishwords[10000:10010]
QUESTION 1: Generate a list called wordLengths
of the length of each word in englishwords
.
# Your code here
wordLengths = []
for word in englishwords:
wordLengths.append(len(word))
# Can also use a list comprehension
wordLengths = [len(word) for word in englishwords]
What is the longest length?
print(max(wordLengths))
QUESTION 2: Generate a list called upperWords
that converts all words to uppercase (use the string method upper
).
# Your code here
upperWords = []
for word in englishwords:
wordLengths.append(word.upper())
# With a list comprehension
upperWords = [word.upper() for word in englishwords]
print(upperWords[:10])
QUESTION 3: Generate a list called startWithVowel
of all words that start with a vowel.
(there are 12119 such words in the list)
# Your code here
startWithVowel = []
for word in englishwords:
if word.lower()[0] in 'aeiou':
startWithVowel.append(word)
# With a list comprehension
startWithVowel = [word for word in englishwords if word.lower()[0] in 'aeiou']
print(len(startWithVowel))
print(startWithVowel[:10])
QUESTION 4: Generate a list called wordLength5
of all words with length 5. There 4684 such words.
# Your code here
wordLength5 = []
for word in englishwords:
if len(word) == 5:
wordLength5.append(word)
# With a list comprehension
wordLength5 = [word for word in englishwords if len(word) == 5]
print(len(wordLength5))
print(wordLength5[:10])
QUESTION 5: Generate the list of all words called sameStartEnd
that start and end with the same letter. There are 4450 such words, e.g., yearly, scripts, etc.
# Your code here
sameStartEnd = []
for word in englishwords:
if word[0] == word[-1]:
sameStartEnd.append(word)
# With a list comprehension
sameStartEnd = [word for word in englishwords if word[0] == word[-1]]
print(len(sameStartEnd))
print(sameStartEnd[:10])
2. English words length distribution¶
We can do this with two steps:
# STEP 1 - create a list of each word's length
lengthsList = []
for word in englishwords:
lengthsList.append(len(word))
print(lengthsList[:20])
Create a dictionary to keep track of the count of each unique word length in lengthsList. The dictionary's keys should be word lengths and the values should be the total number of words in englishwords
that have each respective length. Use lengthsList
in your solution:
# STEP 2 - create a dictionary to keep track of the count of each unique word length in lengthsList
# Your code here
lengthsDct = {}
for length in lengthsList:
lengthsDct[length] = lengthsDct.get(length, 0) + 1
lengthsDct
Note: The keys of the dictionary are displayed in the order they have been inserted. For example, words of length 4 took a while to occur in the list of words, this is that key is not shown right after the key 3.
Alternative and more efficient solution
The previous solution iterates twice over all words. With a big list this takes time and space in memory (the additional list of lengths that we create).
This is why a solution that does everything in one single loop is more efficient:
lengthsDct2 = {}
for word in englishwords:
length = len(word)
lengthsDct2[length] = lengthsDct2.get(length, 0) + 1
# Check that we get the same result
print(lengthsDct2 == lengthsDct)
3. Bonus: Visualize the frequency distribution¶
We will not learn about visualization in this class, but if you are interested, talk to one of the instructors to get access to materials about visualization.
import matplotlib.pyplot as plt
# If you get an error saying that the matplotlib library is not found,
# in Thonny, use Tools>Manage Packages to install the matplotlib library,
# and then relaunch launchNotebook.py
plt.style.use('ggplot')
%matplotlib inline
# use the function histogram, specify how many bars
plt.hist(lengthsList, bins=len(lengthsDct))
# decorate the plot with labels
plt.title("English words frequency distribution")
plt.ylabel("Counts")
plt.xlabel("Word length")
plt.show()
4. Fun with statistics¶
Write code to find the following descriptive statistics:
- average value (or mean),
- the median,
- the mode,
- the variance, and
- the standard deviation
for the list lengthsList
.
Built-in function sum
¶
sum(range(10))
This function takes a list of numbers and returns their sum. It makes your programming life easier.
# the average (or mean): the sum of all elements divided by total number
mean = sum(lengthsList)/len(lengthsList)
print(mean)
# the median: the middle value in a sorted list.
# Note: The .sort method on list will mutate the list to make the elements be in sorted order.
# The sorted function returns a new list with the elements in sorted order.
# Both of these can take an optional 'key' keyword parameter that specifies how the
# elements should be sorted.
lengthsList.sort() # sort the list - this mutates the list in-place
middleIndex = len(lengthsList)//2 # find the index in the middle of the sorted list
lengthsList[middleIndex]
# the mode: the word length that is most frequent (sort the items of the freq dict)
def sortingOrder(item):
"""Helper function: for a tuple return the element at index 1."""
return item[1]
sortedPairs = sorted(lengthsDct.items(),
key=sortingOrder,
reverse=True)
sortedPairs
Note: Let's look at the top five items sorted above:
[(8, 10672),
(7, 10098),
(9, 9663),
(10, 7710),
(6, 7366),
(11, 5393),
]
Notice that the pairs of (word_length, frequency_word) is sorted by the second item of the tuple (with index 1), in the reserve order, from the largest to the smallest.
# variance: (with list comprehension) take the square of the
# differences of items with the mean and find their sum.
variance = sum([(value - mean)**2 for value in lengthsList]) / len(lengthsList)
variance
# the standard deviation: the square root of the variance. Use math.sqrt
import math
stdev = math.sqrt(variance)
stdev
5. Building n-grams¶
n-grams are sequences of one (unigram), two (bigram) or up to n tokens. These tokens can be characters or words. n-grams are important information in the use of natural language processing and computational linguistics. See the Wikipedia article on n-grams. N-grams are useful models for predictive text.
word = 'boston'
print(list(word))
Explore the zip
function:
zip(word, word)
Like many of the objects we have seen, the zip function returns a zip object. To see the contents of the zip object, pass the zip object into list()
. You will see that the zip object yields a tuple.
zipObject = zip(word, word)
list(zipObject)
Zip objects can also take iterables of uneven length. Observe the output below.
list(zip(word, word[1:]))
Explore the join
method for a sequence of strings:
"".join(('b', 'o'))
for pair in zip(word, word[1:]):
print("".join(pair))
Let's write functions that create unigrams, bigrams, and trigrams:
def unigrams(word):
"""Returns a list of unigrams for the given word."""
return list(word)
def bigrams(word):
"""Returns a list of bigrams for the given word."""
grams = []
for pair in zip(word, word[1:]):
grams.append("".join(pair))
return grams
def trigrams(word):
"""Returns a list of trigrams for the given word."""
grams = []
for triple in zip(word, word[1:], word[2:]):
grams.append("".join(triple))
return grams
Try them out:
word = 'boston'
print(unigrams(word))
print(bigrams(word))
print(trigrams(word))
6. Create the bigram distribution¶
Suppose we have several words: 'boston' 'bogota', 'bolivia', 'bosphorus', 'bottomly'.
How many times does each bigram occurrs?
for word in ['boston', 'bogota', 'bolivia', 'bosphorus', 'bottomly']:
print(bigrams(word))
We can use the .extend()
method to create one list of all the bigrams:
allBigrams = []
for word in ['boston', 'boson', 'bosnia']:
allBigrams.extend(bigrams(word)) # list1.extend(list2) mutates list1 by adding all the
# elements in list2 to list1. list2 remains unchanged.
print(allBigrams)
print(len(allBigrams))
Let's see how to create the frequency distribution for a small number of words and watch the dictionary grow:
fewWords = ['boston', 'boson', 'bosnia']
bigramsDct = {} # accumulator dictionary
for word in fewWords:
print("working with the word '{}'".format(word))
bigramsList = bigrams(word)
for bigram in bigramsList:
bigramsDct[bigram] = bigramsDct.get(bigram, 0) + 1
print(bigram, bigramsDct)
Now that we understand how this works for a few words, we can go and create a function that will work with all English words:
def createBigramFrequency():
"""Create and return the bigram frequency distribution of
all words in ‘englishwords’.
"""
bigramsDct = {} # accumulator dictionary
for word in englishwords:
bigramsList = bigrams(word) # create ngrams as a list
# add new bigrams or update counts of existing ones
for bigram in bigramsList:
bigramsDct[bigram] = bigramsDct.get(bigram, 0) + 1
return bigramsDct
bigramsDct = createBigramFrequency()
print(len(bigramsDct))
print(list(bigramsDct.items())[:10])
7. Creating more n-gram distributions¶
The process of adding/updating a frequency distribution is always the same, thus, we can capture it in a function:
def storeNgrams(ngramsList, ngramsDict):
"""Given a list of items and a dictionary,
update the counts of the dictionary keys.
"""
for ngram in ngramsList:
ngramsDict[ngram] = ngramsDict.get(ngram, 0) + 1
Now that we have this function, we can create all n-grams at the same time:
unigramsDct = {}
bigramsDct = {}
trigramsDct = {}
for word in englishwords:
# create ngrams
ngrams1 = unigrams(word)
ngrams2 = bigrams(word)
ngrams3 = trigrams(word)
# store ngrams in freq dict
storeNgrams(ngrams1, unigramsDct)
storeNgrams(ngrams2, bigramsDct)
storeNgrams(ngrams3, trigramsDct)
print(len(unigramsDct))
print(len(bigramsDct))
print(len(trigramsDct))
Since the unigramsDct has the smallest number of keys (equal to the letters in the English alphabet), let's print it out:
unigramsDct
If we were to look at bigrams and unigrams as combinations of all alphabet letters in groups of 2 and 3 letters, we should expect: 26x26 = 676 bigrams and 26x26x26=17576 trigrams. But, we see less than these expected numbers.
This is because certain combinations of letters never occur in English, for example: 'bbb', or 'cbz', etc.
8. Optional: Sort the n-gram distributions¶
We want to know what are the top 5 n-grams from each distribution.
If we naively use sorted
with a dictionary, it will sort it alphabetically by the keys:
myDct = {'are': 10, 'zap': 2, 'bla': 14, 'ten': 8}
sorted(myDct)
Remember the method items
:
myDct.items()
Sort the list of items of the dictionary in ascending order. The .items() method when iterated over will yield a tuple whose first element is the key and whose second element is the value.
def sortItemsInFreqDict(freqDict):
"""Takes a dictionary and returns a list of items sorted by the value,
in ascending order. Uses helper function 'sortingOrder'.
"""
return sorted(freqDict.items(), key=sortingOrder, reverse=True)
# show the top 5 unigrams
sortItemsInFreqDict(unigramsDct)[:5]
# show the top 5 bigrams
sortItemsInFreqDict(bigramsDct)[:5]
# show the top 5 trigrams
sortItemsInFreqDict(trigramsDct)[:5]
9. Optional: Dictionary Comprehension¶
This is very similar to list comprehension, except that it creates a dictionary of key/value pairs.
wordsLst = 'the autumn is dragging its feet'.split()
wordsLst
# the mapping pattern
{word: len(word) for word in wordsLst}
# the filtering pattern
{word: len(word) for word in wordsLst if word[0] in 'aeiou'}
YOUR TURN: Create a dictionary that maps ODD numbers to their square values like the one shown below:
{1: 1, 3: 9, 5: 25, 7: 49, 9: 81, 11: 121}
# your code goes here
{num: num**2 for num in range(1, 12) if num%2==1}
10. A dictionary of dictionaries¶
Let's group the bigrams by unigrams:
{'a': {'aa': 19,
'ab': 1665,
'ac': 2387,
'ad': 1685,
...},
'b': {'ba': 1431,
'bb': 417,
'bc': 25,
'bd': 35,
...},
...
}
# ascii_lowercase is a string constant defined in the module string. It is simply 'abcdefghijklmnopqrstuvwxyz'.
# The keyword as is used to give ascii_lowercase the simpler name of lowercase.
from string import ascii_lowercase as lowercase
ngramsByFirstLetter = {char: {} for char in lowercase}
print(ngramsByFirstLetter)
for word in englishwords:
ngrams2 = bigrams(word)
for ngram in ngrams2:
firstLetter = ngram[0]
letterDict = ngramsByFirstLetter[firstLetter]
letterDict[ngram] = letterDict.get(ngram, 0) + 1
print(len(ngramsByFirstLetter))
ngramsByFirstLetter['q']
Print out how many bigrams are associated with each alphabet letter. We saw above that the letter 'q' has only 7 bigrams.
# your code here
{letter: len(ngramsByFirstLetter[letter]) for letter in ngramsByFirstLetter}
CHALLENGE YOURSELF: Can you organize the trigrams by bigrams? That is, repeat what we did for unigram/bigrams for bigram/trigrams.
# your code here
biAndTriDct = {}
for word in englishwords:
ngrams3 = trigrams(word) # creates trigrams such as 'bos', 'ost', 'sto'
for ngram in ngrams3:
bigram = ngram[:2] # get first letters, 'bo', 'os', 'st'
trigramDct = biAndTriDct.get(bigram, {}) # get the dict we want to update
trigramDct[ngram] = trigramDct.get(ngram, 0) + 1 # update the dict with counts
biAndTriDct[bigram] = trigramDct # assign it to the key, because the first time there is nothing
print("length of dict:", len(biAndTriDct))
biAndTriDct['qu']
biAndTriDct['pr']
11. Challenge Exercise: A dictionary of lists¶
We want to group all words based on their two last letters (including words that have one letter as well), for example:
{'ed': ['abandoned',
'abased',
'abashed',
'abated',
...],
'ly': ['abjectly',
'ably',
'abnormally',
'abominably’,
...],
...
wordsByEnding = {}
# Accumulate the proper dictionary for wordsByEnding
# Your code here
for word in englishwords:
end = word[-2:] # could potentially have single letters
wordsByEnding[end] = wordsByEnding.get(end, [])
wordsByEnding[end].append(word)
Check what endings were generated:
wordsByEnding.keys()
wordsByEnding['ky']
12. Challenge Exercise: group n-length words by first letter¶
Write a function organizeWordsByLength
that, given a word length and a list of words, returns a dictionary of lists keyed by lowercase letters. Each list contains all n-length words that start with the same letter, for example:
{'a': ['abcs',
'abed',
'abet',
'able',
...],
'b': ['baas',
'babe',
'babu',
'baby',
...],
...
}
The function should work for any length.
Example of calling the function:
In []: fourLetterWords = organizeWordsByLength(englishwords, 4)
In []: fourLetterWords['q']
Out []: ['quad', 'quay', 'quid', 'quin', 'quip', 'quit', 'quiz', 'quod']
from string import ascii_lowercase as lowercase
def organizeWordsByLength(wordsList, length):
# Your code here
# Step 1: We start by creating an empty dictionary of lists
# that will have as keys all the letters of the alphabet.
wordsDct = {char: [] for char in lowercase}
# Step 2: We iterate through the list of words and check their length
# words that fit the criteria are appended to the list corresponding
# to their first letter.
for word in wordsList:
if len(word) == length:
firstLetter = word[0]
wordsDct[firstLetter].append(word)
return wordsDct
fourLetterWords = organizeWordsByLength(englishwords, 4)
fourLetterWords['q']