1. Fun with English words

englishwords is a list of words from the English language.

In [1]:
from vocabulary import englishwords
In [2]:
type(englishwords)
Out[2]:
list
In [3]:
len(englishwords)
Out[3]:
66230
In [4]:
englishwords[10000:10010]
Out[4]:
['cleans',
 'cleanse',
 'cleansed',
 'cleanser',
 'cleansers',
 'cleanses',
 'cleansing',
 'cleanup',
 'cleanups',
 'clear']

QUESTION 1: Generate a list called wordLengths of the length of each word in englishwords.

In [5]:
# Your code here
wordLengths = []
for word in englishwords:
    wordLengths.append(len(word))

# Can also use a list comprehension
wordLengths = [len(word) for word in englishwords]

What is the longest length?

In [6]:
print(max(wordLengths))
22

QUESTION 2: Generate a list called upperWords that converts all words to uppercase (use the string method upper).

In [7]:
# Your code here
upperWords = []
for word in englishwords:
    wordLengths.append(word.upper())

# With a list comprehension
upperWords = [word.upper() for word in englishwords]
In [8]:
print(upperWords[:10])
['A', 'AA', 'AAS', 'ABACI', 'ABACK', 'ABACUS', 'ABACUSES', 'ABAFT', 'ABANDON', 'ABANDONED']

QUESTION 3: Generate a list called startWithVowel of all words that start with a vowel. (there are 12119 such words in the list)

In [9]:
# Your code here
startWithVowel = []
for word in englishwords:
    if word.lower()[0] in 'aeiou':
        startWithVowel.append(word)

# With a list comprehension
startWithVowel = [word for word in englishwords if word.lower()[0] in 'aeiou']
In [10]:
print(len(startWithVowel))
print(startWithVowel[:10])
12119
['a', 'aa', 'aas', 'abaci', 'aback', 'abacus', 'abacuses', 'abaft', 'abandon', 'abandoned']

QUESTION 4: Generate a list called wordLength5 of all words with length 5. There 4684 such words.

In [11]:
# Your code here
wordLength5 = []
for word in englishwords:
    if len(word) == 5:
        wordLength5.append(word)
        
# With a list comprehension
wordLength5 = [word for word in englishwords if len(word) == 5]
In [12]:
print(len(wordLength5))
print(wordLength5[:10])
4684
['abaci', 'aback', 'abaft', 'abase', 'abash', 'abate', 'abbey', 'abbot', 'abeam', 'abets']

QUESTION 5: Generate the list of all words called sameStartEnd that start and end with the same letter. There are 4450 such words, e.g., yearly, scripts, etc.

In [13]:
# Your code here
sameStartEnd = []
for word in englishwords:
    if word[0] == word[-1]:
        sameStartEnd.append(word)

# With a list comprehension
sameStartEnd = [word for word in englishwords if word[0] == word[-1]]
In [14]:
print(len(sameStartEnd))
print(sameStartEnd[:10])
4450
['a', 'aa', 'abracadabra', 'acacia', 'addenda', 'africa', 'agenda', 'agora', 'agoraphobia', 'aha']

2. English words length distribution

We can do this with two steps:

In [15]:
# STEP 1 - create a list of each word's length
lengthsList = []
for word in englishwords:
    lengthsList.append(len(word))
In [16]:
print(lengthsList[:20])
[1, 2, 3, 5, 5, 6, 8, 5, 7, 9, 10, 11, 8, 5, 6, 9, 6, 5, 7, 7]

Create a dictionary to keep track of the count of each unique word length in lengthsList. The dictionary's keys should be word lengths and the values should be the total number of words in englishwords that have each respective length. Use lengthsList in your solution:

In [17]:
# STEP 2 - create a dictionary to keep track of the count of each unique word length in lengthsList
# Your code here
lengthsDct = {}
for length in lengthsList:
    lengthsDct[length] = lengthsDct.get(length, 0) + 1
In [18]:
lengthsDct
Out[18]:
{1: 26,
 2: 135,
 3: 707,
 5: 4684,
 6: 7366,
 8: 10672,
 7: 10098,
 9: 9663,
 10: 7710,
 11: 5393,
 12: 3491,
 13: 1996,
 4: 2474,
 14: 987,
 15: 459,
 16: 218,
 17: 99,
 18: 34,
 19: 12,
 20: 3,
 21: 2,
 22: 1}

Note: The keys of the dictionary are displayed in the order they have been inserted. For example, words of length 4 took a while to occur in the list of words, this is that key is not shown right after the key 3.

Alternative and more efficient solution
The previous solution iterates twice over all words. With a big list this takes time and space in memory (the additional list of lengths that we create).

This is why a solution that does everything in one single loop is more efficient:

In [19]:
lengthsDct2 = {}
for word in englishwords:
    length = len(word)
    lengthsDct2[length] = lengthsDct2.get(length, 0) + 1

# Check that we get the same result
print(lengthsDct2 == lengthsDct)
True

3. Bonus: Visualize the frequency distribution

We will not learn about visualization in this class, but if you are interested, talk to one of the instructors to get access to materials about visualization.

In [20]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# use the function histogram, specify how many bars
plt.hist(lengthsList, bins=len(lengthsDct))

# decorate the plot with labels
plt.title("English words frequency distribution")
plt.ylabel("Counts")
plt.xlabel("Word length")
plt.show()

4. Fun with statistics

Write code to find the following descriptive statistics:

  • average value (or mean),
  • the median,
  • the mode,
  • the variance, and
  • the standard deviation

for the list lengthsList.

Built-in function sum

In [21]:
sum(range(10))
Out[21]:
45

This function takes a list of numbers and returns their sum. It makes your programming life easier.

In [22]:
# the average (or mean): the sum of all elements divided by total number

mean = sum(lengthsList)/len(lengthsList)
print(mean)
8.36568020534501
In [23]:
# the median: the middle value in a sorted list. 

# Note: The .sort method on list will mutate the list to make the elements be in sorted order.
# The sorted function returns a new list with the elements in sorted order. 
# Both of these can take an optional 'key' keyword parameter that specifies how the 
# elements should be sorted. 

lengthsList.sort() # sort the list - this mutates the list in-place
middleIndex = len(lengthsList)//2 # find the index in the middle of the sorted list
lengthsList[middleIndex] 
Out[23]:
8
In [24]:
# the mode: the word length that is most frequent (sort the items of the freq dict)

def sortingOrder(item):
    """Helper function: for a tuple return the element at index 1."""
    return item[1]

sortedPairs = sorted(lengthsDct.items(), 
                     key=sortingOrder, 
                     reverse=True)
sortedPairs
Out[24]:
[(8, 10672),
 (7, 10098),
 (9, 9663),
 (10, 7710),
 (6, 7366),
 (11, 5393),
 (5, 4684),
 (12, 3491),
 (4, 2474),
 (13, 1996),
 (14, 987),
 (3, 707),
 (15, 459),
 (16, 218),
 (2, 135),
 (17, 99),
 (18, 34),
 (1, 26),
 (19, 12),
 (20, 3),
 (21, 2),
 (22, 1)]

Note: Let's look at the top five items sorted above:

[(8, 10672),
 (7, 10098),
 (9, 9663),
 (10, 7710),
 (6, 7366),
 (11, 5393),
]

Notice that the pairs of (word_length, frequency_word) is sorted by the second item of the tuple (with index 1), in the reserve order, from the largest to the smallest.

In [25]:
# variance: (with list comprehension) take the square of the 
# differences of items with the mean and find their sum.

variance = sum([(value - mean)**2 for value in lengthsList]) / len(lengthsList)
variance
Out[25]:
6.294165651611816
In [26]:
# the standard deviation: the square root of the variance. Use math.sqrt

import math
stdev = math.sqrt(variance)
stdev
    
Out[26]:
2.50881758037762

5. Building n-grams

n-grams are sequences of one (unigram), two (bigram) or up to n tokens. These tokens can be characters or words. n-grams are important information in the use of natural language processing and computational linguistics. See the Wikipedia article on n-grams. N-grams are useful models for predictive text.

In [27]:
word = 'boston'
print(list(word))
['b', 'o', 's', 't', 'o', 'n']

Explore the zip function:

In [28]:
zip(word, word)
Out[28]:
<zip at 0x7fcd483a1320>

Like many of the objects we have seen, the zip function returns a zip object. To see the contents of the zip object, pass the zip object into list(). You will see that the zip object yields a tuple.

In [29]:
zipObject = zip(word, word)
list(zipObject)
Out[29]:
[('b', 'b'), ('o', 'o'), ('s', 's'), ('t', 't'), ('o', 'o'), ('n', 'n')]

Zip objects can also take iterables of uneven length. Observe the output below.

In [30]:
list(zip(word, word[1:]))
Out[30]:
[('b', 'o'), ('o', 's'), ('s', 't'), ('t', 'o'), ('o', 'n')]

Explore the join method for a sequence of strings:

In [31]:
"".join(('b', 'o'))
Out[31]:
'bo'
In [32]:
for pair in zip(word, word[1:]):
    print("".join(pair))
bo
os
st
to
on

Let's write functions that create unigrams, bigrams, and trigrams:

In [33]:
def unigrams(word):
    """Returns a list of unigrams for the given word."""
    return list(word)

def bigrams(word):
    """Returns a list of bigrams for the given word."""
    grams = []
    for pair in zip(word, word[1:]):
        grams.append("".join(pair))
    return grams

def trigrams(word):
    """Returns a list of trigrams for the given word."""
    grams = []
    for triple in zip(word, word[1:], word[2:]):
        grams.append("".join(triple))
    return grams

Try them out:

In [34]:
word = 'boston'
print(unigrams(word))
print(bigrams(word))
print(trigrams(word))
['b', 'o', 's', 't', 'o', 'n']
['bo', 'os', 'st', 'to', 'on']
['bos', 'ost', 'sto', 'ton']

6. Create the bigram distribution

Suppose we have several words: 'boston' 'bogota', 'bolivia', 'bosphorus', 'bottomly'.
How many times does each bigram occurrs?

In [35]:
for word in ['boston', 'bogota', 'bolivia', 'bosphorus', 'bottomly']:
    print(bigrams(word))
['bo', 'os', 'st', 'to', 'on']
['bo', 'og', 'go', 'ot', 'ta']
['bo', 'ol', 'li', 'iv', 'vi', 'ia']
['bo', 'os', 'sp', 'ph', 'ho', 'or', 'ru', 'us']
['bo', 'ot', 'tt', 'to', 'om', 'ml', 'ly']

We can use the .extend() method to create one list of all the bigrams:

In [36]:
allBigrams = []
for word in ['boston', 'boson', 'bosnia']:
    allBigrams.extend(bigrams(word)) # list1.extend(list2) mutates list1 by adding all the 
                                     # elements in list2 to list1. list2 remains unchanged. 
    
print(allBigrams)
print(len(allBigrams))
['bo', 'os', 'st', 'to', 'on', 'bo', 'os', 'so', 'on', 'bo', 'os', 'sn', 'ni', 'ia']
14

Let's see how to create the frequency distribution for a small number of words and watch the dictionary grow:

In [37]:
fewWords = ['boston', 'boson', 'bosnia']

bigramsDct = {}              # accumulator dictionary

for word in fewWords:
    print("working with the word '{}'".format(word))
    bigramsList = bigrams(word) 
    
    for bigram in bigramsList:
        
        bigramsDct[bigram] = bigramsDct.get(bigram, 0) + 1
        print(bigram, bigramsDct)
working with the word 'boston'
bo {'bo': 1}
os {'bo': 1, 'os': 1}
st {'bo': 1, 'os': 1, 'st': 1}
to {'bo': 1, 'os': 1, 'st': 1, 'to': 1}
on {'bo': 1, 'os': 1, 'st': 1, 'to': 1, 'on': 1}
working with the word 'boson'
bo {'bo': 2, 'os': 1, 'st': 1, 'to': 1, 'on': 1}
os {'bo': 2, 'os': 2, 'st': 1, 'to': 1, 'on': 1}
so {'bo': 2, 'os': 2, 'st': 1, 'to': 1, 'on': 1, 'so': 1}
on {'bo': 2, 'os': 2, 'st': 1, 'to': 1, 'on': 2, 'so': 1}
working with the word 'bosnia'
bo {'bo': 3, 'os': 2, 'st': 1, 'to': 1, 'on': 2, 'so': 1}
os {'bo': 3, 'os': 3, 'st': 1, 'to': 1, 'on': 2, 'so': 1}
sn {'bo': 3, 'os': 3, 'st': 1, 'to': 1, 'on': 2, 'so': 1, 'sn': 1}
ni {'bo': 3, 'os': 3, 'st': 1, 'to': 1, 'on': 2, 'so': 1, 'sn': 1, 'ni': 1}
ia {'bo': 3, 'os': 3, 'st': 1, 'to': 1, 'on': 2, 'so': 1, 'sn': 1, 'ni': 1, 'ia': 1}

Now that we understand how this works for a few words, we can go and create a function that will work with all English words:

In [38]:
def createBigramFrequency():
    """Create and return the bigram frequency distribution of 
    all words in ‘englishwords’.
    """
    bigramsDct = {}              # accumulator dictionary

    for word in englishwords:
        bigramsList = bigrams(word) # create ngrams as a list

        # add new bigrams or update counts of existing ones
        for bigram in bigramsList:
            bigramsDct[bigram] = bigramsDct.get(bigram, 0) + 1

    return bigramsDct
In [39]:
bigramsDct = createBigramFrequency()
print(len(bigramsDct))
568
In [40]:
print(list(bigramsDct.items())[:10])
[('aa', 19), ('as', 2607), ('ab', 1665), ('ba', 1431), ('ac', 2387), ('ci', 1628), ('ck', 1730), ('cu', 1151), ('us', 2623), ('se', 3442)]

7. Creating more n-gram distributions

The process of adding/updating a frequency distribution is always the same, thus, we can capture it in a function:

In [41]:
def storeNgrams(ngramsList, ngramsDict):
    """Given a list of items and a dictionary, 
    update the counts of the dictionary keys.
    """
    for ngram in ngramsList:
        ngramsDict[ngram] = ngramsDict.get(ngram, 0) + 1

Now that we have this function, we can create all n-grams at the same time:

In [42]:
unigramsDct = {}
bigramsDct = {}
trigramsDct = {}

for word in englishwords:
    # create ngrams
    ngrams1 = unigrams(word)
    ngrams2 = bigrams(word)
    ngrams3 = trigrams(word)

    # store ngrams in freq dict
    storeNgrams(ngrams1, unigramsDct)
    storeNgrams(ngrams2, bigramsDct)
    storeNgrams(ngrams3, trigramsDct)

print(len(unigramsDct))
print(len(bigramsDct))
print(len(trigramsDct))
26
568
5864

Since the unigramsDct has the smallest number of keys (equal to the letters in the English alphabet), let's print it out:

In [43]:
unigramsDct
Out[43]:
{'a': 41799,
 's': 50785,
 'b': 10584,
 'c': 22556,
 'i': 48339,
 'k': 5076,
 'u': 18442,
 'e': 63258,
 'f': 7842,
 't': 38175,
 'n': 38837,
 'd': 21161,
 'o': 33261,
 'g': 16665,
 'm': 14717,
 'h': 12377,
 'r': 39292,
 'y': 9063,
 'v': 5628,
 'l': 29310,
 'j': 1069,
 'z': 2118,
 'p': 16195,
 'w': 4939,
 'q': 1086,
 'x': 1485}

If we were to look at bigrams and unigrams as combinations of all alphabet letters in groups of 2 and 3 letters, we should expect: 26x26 = 676 bigrams and 26x26x26=17576 trigrams. But, we see less than these expected numbers.

This is because certain combinations of letters never occur in English, for example: 'bbb', or 'cbz', etc.

8. Optional: Sort the n-gram distributions

We want to know what are the top 5 n-grams from each distribution.

If we naively use sorted with a dictionary, it will sort it alphabetically by the keys:

In [44]:
myDct = {'are': 10, 'zap': 2, 'bla': 14, 'ten': 8}
sorted(myDct)
Out[44]:
['are', 'bla', 'ten', 'zap']

Remember the method items:

In [45]:
myDct.items()
Out[45]:
dict_items([('are', 10), ('zap', 2), ('bla', 14), ('ten', 8)])

Sort the list of items of the dictionary in ascending order. The .items() method when iterated over will yield a tuple whose first element is the key and whose second element is the value.

In [46]:
def sortItemsInFreqDict(freqDict):
    """Takes a dictionary and returns a list of items sorted by the value,
    in ascending order. Uses helper function 'sortingOrder'.
    """
    return sorted(freqDict.items(), key=sortingOrder, reverse=True)
In [47]:
# show the top 5 unigrams
sortItemsInFreqDict(unigramsDct)[:5]
Out[47]:
[('e', 63258), ('s', 50785), ('i', 48339), ('a', 41799), ('r', 39292)]
In [48]:
# show the top 5 bigrams
sortItemsInFreqDict(bigramsDct)[:5]
Out[48]:
[('in', 12888), ('es', 11797), ('er', 11017), ('ti', 8266), ('ng', 7998)]
In [49]:
# show the top 5 trigrams
sortItemsInFreqDict(trigramsDct)[:5]
Out[49]:
[('ing', 7036), ('ion', 2963), ('ati', 2812), ('tio', 2489), ('ate', 2459)]

9. Optional: Dictionary Comprehension

This is very similar to list comprehension, except that it creates a dictionary of key/value pairs.

In [50]:
wordsLst = 'the autumn is dragging its feet'.split()
wordsLst
Out[50]:
['the', 'autumn', 'is', 'dragging', 'its', 'feet']
In [51]:
# the mapping pattern
{word: len(word) for word in wordsLst}
Out[51]:
{'the': 3, 'autumn': 6, 'is': 2, 'dragging': 8, 'its': 3, 'feet': 4}
In [52]:
# the filtering pattern
{word: len(word) for word in wordsLst if word[0] in 'aeiou'}
Out[52]:
{'autumn': 6, 'is': 2, 'its': 3}

YOUR TURN: Create a dictionary that maps ODD numbers to their square values like the one shown below:

{1: 1, 3: 9, 5: 25, 7: 49, 9: 81, 11: 121}

In [53]:
# your code goes here

{num: num**2 for num in range(1, 12) if num%2==1}
Out[53]:
{1: 1, 3: 9, 5: 25, 7: 49, 9: 81, 11: 121}

10. A dictionary of dictionaries

Let's group the bigrams by unigrams:

{'a': {'aa': 19,
       'ab': 1665,
       'ac': 2387,
       'ad': 1685,
       ...},

 'b': {'ba': 1431,
       'bb': 417,
       'bc': 25,
       'bd': 35,
       ...},
  ...
}
In [54]:
# ascii_lowercase is a string constant defined in the module string.  It is simply 'abcdefghijklmnopqrstuvwxyz'.
# The keyword as is used to give ascii_lowercase the simpler name of lowercase.

from string import ascii_lowercase as lowercase 

ngramsByFirstLetter = {char: {} for char in lowercase}

print(ngramsByFirstLetter)

for word in englishwords:
    ngrams2 = bigrams(word)
    for ngram in ngrams2:
        firstLetter = ngram[0]
        letterDict = ngramsByFirstLetter[firstLetter]
        letterDict[ngram] = letterDict.get(ngram, 0) + 1
        
print(len(ngramsByFirstLetter))
{'a': {}, 'b': {}, 'c': {}, 'd': {}, 'e': {}, 'f': {}, 'g': {}, 'h': {}, 'i': {}, 'j': {}, 'k': {}, 'l': {}, 'm': {}, 'n': {}, 'o': {}, 'p': {}, 'q': {}, 'r': {}, 's': {}, 't': {}, 'u': {}, 'v': {}, 'w': {}, 'x': {}, 'y': {}, 'z': {}}
26
In [55]:
ngramsByFirstLetter['q']
Out[55]:
{'qu': 1069, 'qi': 2, 'qa': 3, 'qe': 1, 'qq': 1, 'qr': 1, 'qt': 2}

Print out how many bigrams are associated with each alphabet letter. We saw above that the letter 'q' has only 7 bigrams.

In [56]:
# your code here

{letter: len(ngramsByFirstLetter[letter]) for letter in ngramsByFirstLetter}
Out[56]:
{'a': 26,
 'b': 23,
 'c': 22,
 'd': 25,
 'e': 26,
 'f': 21,
 'g': 23,
 'h': 24,
 'i': 26,
 'j': 8,
 'k': 23,
 'l': 24,
 'm': 24,
 'n': 26,
 'o': 26,
 'p': 24,
 'q': 7,
 'r': 26,
 's': 24,
 't': 24,
 'u': 26,
 'v': 15,
 'w': 22,
 'x': 19,
 'y': 23,
 'z': 11}

CHALLENGE YOURSELF: Can you organize the trigrams by bigrams? That is, repeat what we did for unigram/bigrams for bigram/trigrams.

In [57]:
# your code here

biAndTriDct = {}

for word in englishwords:
    ngrams3 = trigrams(word) # creates trigrams such as 'bos', 'ost', 'sto'
    for ngram in ngrams3:
        bigram = ngram[:2] # get first letters, 'bo', 'os', 'st'
        trigramDct = biAndTriDct.get(bigram, {}) # get the dict we want to update
        trigramDct[ngram] = trigramDct.get(ngram, 0) + 1 # update the dict with counts
        biAndTriDct[bigram] = trigramDct # assign it to the key, because the first time there is nothing
In [58]:
print("length of dict:", len(biAndTriDct))
length of dict: 555
In [59]:
biAndTriDct['qu']
Out[59]:
{'qua': 314, 'qui': 371, 'que': 341, 'quy': 3, 'quo': 39}
In [60]:
biAndTriDct['pr']
Out[60]:
{'pra': 122, 'pre': 795, 'pri': 354, 'pro': 882, 'pru': 40, 'pry': 8}

11. Challenge Exercise: A dictionary of lists

We want to group all words based on their two last letters (including words that have one letter as well), for example:

{'ed': ['abandoned',
    'abased',
    'abashed',
    'abated',
        ...],

 'ly': ['abjectly',
        'ably',
        'abnormally',
        'abominably’,
        ...],
...
In [66]:
wordsByEnding = {}

# Accumulate the proper dictionary for wordsByEnding
# Your code here
for word in englishwords:
    end = word[-2:] # could potentially have single letters 
    wordsByEnding[end] = wordsByEnding.get(end, [])
    wordsByEnding[end].append(word)

Check what endings were generated:

In [67]:
wordsByEnding.keys()
Out[67]:
dict_keys(['a', 'aa', 'as', 'ci', 'ck', 'us', 'es', 'ft', 'on', 'ed', 'ng', 'nt', 'ns', 'se', 'sh', 'te', 'ir', 'rs', 'ss', 'ey', 'ys', 'ot', 'ts', 'bc', 'cs', 'en', 'al', 'ct', 'am', 'et', 'or', 'ce', 'de', 'ty', 'ly', 're', 've', 'ut', 'ze', 'le', 'er', 'st', 'bo', 'rd', 'ls', 'ne', 'rt', 'os', 'nd', 'ds', 'ra', 'ge', 'ad', 'pt', 'ee', 'sm', 'th', 'rb', 'bs', 'in', 'ia', 'ic', 'an', 'my', 'do', 'ry', 'im', 'ms', 'ny', 'cy', 'it', 'ue', 'om', 'yl', 'he', 'oo', 'id', 'fy', 'me', 'rn', 'ip', 'ps', 'at', 'ym', 'is', 'io', 'dc', 'dd', 'da', 'um', 'ph', 'hs', 'eu', 'ux', 'dj', 'ix', 'be', 'lt', 'od', 'dv', 'ol', 'ex', 'gy', 'ar', 'rm', 'ay', 'ni', 'ld', 'ht', 'ul', 'ca', 'ro', 'mp', 'ow', 'fe', 'pe', 'il', 'go', 'og', 'ae', 'ah', 'ha', 'em', 'oy', 'ke', 'ew', 'ws', 'op', 'un', 'gs', 'ks', 'hy', 'ch', 'lb', 'no', 'fa', 'co', 'ga', 'hm', 'bi', 'gn', 'li', 'll', 'to', 'ac', 'oe', 'of', 'ud', 'lp', 'dy', 'so', 'gh', 'na', 'ur', 'mo', 'ba', 'ok', 'ta', 'ma', 'vy', 'el', 'la', 'xe', 'ak', 'ax', 'ua', 'nk', 'if', 'fs', 'mb', 'xy', 'sy', 'gm', 'ox', 'ab', 'rc', 'ea', 'rk', 'lo', 'sk', 'ep', 'sp', 'ai', 'uk', 'ie', 'ky', 'hn', 'mn', 'we', 'wl', 'tl', 'ye', 'b', 'bu', 'by', 'lk', 'up', 'ag', 'ff', 'va', 'ka', 'lm', 'sa', 'au', 'hi', 'jo', 'tu', 'ub', 'ik', 'wd', 'ou', 'ug', 'ef', 'eg', 'lf', 'rg', 'ri', 'wn', 'xt', 'ib', 'ig', 'wy', 'iz', 'tz', 'ob', 'oc', 'zy', 'oa', 'oh', 'za', 'oi', 'ki', 'tt', 'br', 'vo', 'ln', 'nx', 'ei', 'hl', 'mf', 'py', 'ap', 'rp', 'rr', 'di', 'uy', 'zz', 'aw', 'c', 'ao', 'ti', 'yx', 'eo', 'rl', 'pi', 'pa', 'rh', 'cc', 'nc', 'ek', 'af', 'cm', 'eb', 'si', 'po', 'nj', 'uo', 'gi', 'pu', 'cp', 'wm', 'wt', 'd', 'dt', 'bt', 'ko', 'rv', 'dl', 'sc', 'iy', 'nn', 'dr', 'mt', 'rf', 'e', 'bb', 'cg', 'ho', 'ec', 'gg', 'eh', 'kg', 'yo', 'mu', 'ui', 'f', 'ez', 'hu', 'ji', 'lu', 'fm', 'fo', 'fr', 'g', 'wk', 'wp', 'ii', 'hq', 'lc', 'np', 'nu', 'gp', 'ru', 'uv', 'vs', 'yp', 'h', 'ku', 'du', 'mi', 'hp', 'hr', 'uh', 'hz', 'i', 'bm', 'zi', 'zo', 'iq', 'aq', 'qi', 'tn', 'tv', 'j', 'nr', 'jp', 'jr', 'su', 'ju', 'k', 'ya', 'wi', 'je', 'kt', 'kw', 'l', 'kh', 'av', 'bw', 'sd', 'm', 'yr', 'rx', 'xi', 'mc', 'md', 'mg', 'mm', 'ov', 'pg', 'mr', 'n', 'fi', 'nb', 'nw', 'o', 'ja', 'oz', 'p', 'hd', 'pm', 'uf', 'pp', 'pr', 'vc', 'px', 'q', 'qq', 'qr', 'qt', 'qu', 'r', 'aj', 'cd', 'td', 'ev', 'vp', 's', 'wa', 'sf', 'gt', 'hh', 'sj', 'iv', 'sq', 'sr', 'sw', 't', 'tb', 'fu', 'az', 'wo', 'u', 'hf', 'v', 'vd', 'w', 'wc', 'x', 'y', 'yd', 'z'])
In [68]:
wordsByEnding['ky']
Out[68]:
['autarky',
 'bosky',
 'bulky',
 'chalky',
 'cheeky',
 'choky',
 'chunky',
 'cocky',
 'colicky',
 'cranky',
 'creaky',
 'darky',
 'dicky',
 'dinky',
 'droshky',
 'ducky',
 'dusky',
 'finicky',
 'flaky',
 'fluky',
 'freaky',
 'frisky',
 'funky',
 'gawky',
 'gimmicky',
 'hanky',
 'honky',
 'hooky',
 'husky',
 'inky',
 'jerky',
 'kinky',
 'lanky',
 'leaky',
 'lucky',
 'milky',
 'mucky',
 'murky',
 'musky',
 'narky',
 'panicky',
 'parky',
 'pawky',
 'peaky',
 'perky',
 'pesky',
 'picky',
 'plucky',
 'poky',
 'porky',
 'rheumaticky',
 'risky',
 'rocky',
 'sarky',
 'shaky',
 'silky',
 'sky',
 'slinky',
 'smoky',
 'snaky',
 'sneaky',
 'spiky',
 'spooky',
 'spunky',
 'squeaky',
 'sticky',
 'stocky',
 'streaky',
 'sulky',
 'swanky',
 'tacky',
 'tricky',
 'trotsky',
 'wacky',
 'whisky',
 'wonky']

12. Challenge Exercise: group n-length words by first letter

Write a function organizeWordsByLength that, given a word length and a list of words, returns a dictionary of lists keyed by lowercase letters. Each list contains all n-length words that start with the same letter, for example:

{'a': ['abcs',
       'abed',
       'abet',
       'able',
       ...],
 'b': ['baas',
       'babe',
       'babu',
       'baby',
       ...],
 ...
 }

The function should work for any length.

Example of calling the function:

In []: fourLetterWords = organizeWordsByLength(englishwords, 4)
In []: fourLetterWords['q']
Out []: ['quad', 'quay', 'quid', 'quin', 'quip', 'quit', 'quiz', 'quod']
In [69]:
from string import ascii_lowercase as lowercase


def organizeWordsByLength(wordsList, length):
    
    # Your code here

    # Step 1: We start by creating an empty dictionary of lists
    # that will have as keys all the letters of the alphabet.
    wordsDct = {char: [] for char in lowercase}

    # Step 2: We iterate through the list of words and check their length
    # words that fit the criteria are appended to the list corresponding
    # to their first letter. 
    for word in wordsList:
        if len(word) == length:
            firstLetter = word[0]
            wordsDct[firstLetter].append(word)
            
    return wordsDct
In [70]:
fourLetterWords = organizeWordsByLength(englishwords, 4)

fourLetterWords['q']
Out[70]:
['quad', 'quay', 'quid', 'quin', 'quip', 'quit', 'quiz', 'quod']