CS 111 Lecture: Accumulation Pattern with Dicts, Lists, and JSON files¶

Table of Contents

Review: Accumulation across types

1.1 Accumulating with an integer
1.2 Accumulating with a string
1.3 Accumulating with a list
1.4 Accumulating with a dict
1.5 Excercise: Count file extensions
2. Accumulation with dicts and lists 2.1 Group names by their length
2.2 Exercise: Group files by their extension
2.3 Group names by the vowels they contain
2.4 Dict Comprehension
2.5 Challenge: Rewrite the solution of 2.3 using as the starting point vowelsDct 3. The JSON format
3.1 Load a JSON object
3.2 Dump a JSON object into a file
3.3 Exercise: From Text to JSON
4. Challenge Problem: Work with JSON files
4.1 Convert a tweet into a list of hashtags
4.2 Write a new JSON file with the tweets transformed as hashtag lists
4.3 Create a frequency dictionary of all hashtags
4.4 Save the sorted hashtags into a JSON file

1. Review: Accumulation across types¶

In the following, we summarize accumulation problems that we have seen throughout the semester.

1.1. Accumulating with an integer¶

This is an example that we have seen often:

def isVowel(char):
    """Predicate function that returns true/false."""
    return char.lower() in list('aeiou')

# Code snippet to count vowels in a string
phrase = "the sun is shining"
count = 0 # accumulator variable
for letter in phrase:
    if isVowel(letter):
        count += 1
        
print(f"'{phrase}' has {count} vowels.")

'the sun is shining' has 5 vowels.

1.2. Accumulating with a string¶

Create a new string based on certain characteristics of a given one:

# Code snippet to encrypt a phrase
phrase = "the sun is shining"
newPhrase = ''
for letter in phrase:
    if isVowel(letter):
        newPhrase += '*'
    elif letter == ' ':
        newPhrase += ' '
    else:
        newPhrase += '_'

print(newPhrase)

__* _*_ *_ __*_*__

1.3 Accumulating with a list¶

This is also an example that we have seen often:

# Code snippet to find odd numbers in a list
numList = [-35, -16, -3, 0, 1, 5, 8, 11, 18, 25]
oddList = []
for num in numList:
    if num % 2 == 1:
        oddList.append(num)
        
print(oddList)

[-35, -3, 1, 5, 11, 25]

As you know, this can also be written in a more concise form with list comprehension:

oddList = [num for num in numList if num % 2 == 1]
oddList

[-35, -3, 1, 5, 11, 25]

1.4 Accumulating with a dict¶

Instead of accumulating only one variable (for example, the number of vowels in a phrase), we can accumulate multiple things at once using a dictionary. For example, we can count the frequency of all letters in a string.

# count all letters in a string
word = "abracadabra"
lettersDict = {} # accumulator variable
for letter in word:
    if letter not in lettersDict:
        lettersDict[letter] = 1
    else:
        lettersDict[letter] += 1
        
print(lettersDict)

{'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1}

1.5 Exercise: Count file extensions¶

You are given a list of files:

['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

Write code that will count how many files of different types are there. You should expect the following result:

{'txt': 4, 
'csv': 3, 
'py': 2}

filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

# Your code here
extensions = {}
for file in filesList:
    ext = file.split('.')[1]
    if ext not in extensions:
        extensions[ext] = 1
    else:
        extensions[ext] += 1
        
extensions

{'txt': 4, 'csv': 3, 'py': 2}

2. Accumulation with dicts and lists¶

With dictionaries we can perform more complex accumulations. This section contains some examples.

2.1 Group names by their length¶

Given the following list:

names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']

group the elements by length, by writing code that creates the following dictionary:

{4: ['Andy'], 
 7: ['Carolyn'], 
 3: ['Eni', 'Lyn'], 
 5: ['Peter', 'Sohie']
}

names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']

nameLengthDct = {} # accumulator variable

for name in names:
    nameLen = len(name)
    if nameLen not in nameLengthDct:
        nameLengthDct[nameLen] = [name] 
    else:
        nameLengthDct[nameLen].append(name)
    
nameLengthDct

{4: ['Andy'], 7: ['Carolyn'], 3: ['Eni', 'Lyn'], 5: ['Peter', 'Sohie']}

2.2 Exercise: Group files by their extensions¶

Using the same list of files as in Section 1.5, let's group files by their extensions, instead of simply counting them. Here is what your code should produce:

{'txt': ['counts1.txt', 'counts2.txt', 'poem1.txt', 'poem2.txt'],
 'csv': ['students.csv', 'cities.csv', 'states.csv'],
 'py': ['optimism.py', 'wordsearch.py']}

filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

# Your code here
extensions2 = {}
for file in filesList:
    ext = file.split('.')[1]
    if ext not in extensions2:
        extensions2[ext] = [file]
    else:
        extensions2[ext].append(file)
        
extensions2

{'txt': ['counts1.txt', 'counts2.txt', 'poem1.txt', 'poem2.txt'],
 'csv': ['students.csv', 'cities.csv', 'states.csv'],
 'py': ['optimism.py', 'wordsearch.py']}

2.3 Group names by the vowels they contain¶

Given our (expanded and modfied) list of names:

names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter', 'Sohie', 
'Ada', 'Smaranda', 'Catherine']

how would we go about grouping them according to the vowels they contain? Here is the dictionary we want to create:

{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'], 
'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'], 
'i': ['Eniana', 'Sohie', 'Catherine'], 
'o': ['Carolyn', 'Sohie']
}

Note: This is not a trivial problem. What makes it hard is the fact that some names (for example, Peter, Ada, etc.) have several occurrences of the same letter, and we want to avoid listing them multiple times.

names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter', 
         'Sohie', 'Ada', 'Smaranda', 'Catherine']

vowelIndexDct = {} # accumulator variable

# Notice the nested for loops
for name in names:
    for letter in name.lower():
        if isVowel(letter):
            if letter not in vowelIndexDct:
                vowelIndexDct[letter] = [name]
            else:
                # notice that before appending, we first check if the name was added before
                if name not in vowelIndexDct[letter]:
                    vowelIndexDct[letter].append(name)
vowelIndexDct

{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'],
 'o': ['Carolyn', 'Sohie'],
 'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'],
 'i': ['Eniana', 'Sohie', 'Catherine']}

2.4 Dictionary comprehension¶

Similarly to list comprehension, we can do dictionary comprehension. The syntax is very similar, concretely:

{ aKey: aValue for aKey in sequence}

Below are some examples:

# Find the lengths of words in a list

wordList = ['red', 'yellow', 'blue', 'violet', 'black']

lengthsDct = {word: len(word) for word in wordList}
lengthsDct

{'red': 3, 'yellow': 6, 'blue': 4, 'violet': 6, 'black': 5}

# Shorten each day name into its first three letters

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

shortDays = {day: day[:3] for day in days}
shortDays

{'Monday': 'Mon',
 'Tuesday': 'Tue',
 'Wednesday': 'Wed',
 'Thursday': 'Thu',
 'Friday': 'Fri',
 'Saturday': 'Sat',
 'Sunday': 'Sun'}

A good use for the dict comprehension is to initialize data structures that then can be used for accumulation. For example, the following dict comprehension:

vowelsDct = {vowel: [] for vowel in 'aeiou'}
vowelsDct

{'a': [], 'e': [], 'i': [], 'o': [], 'u': []}

creates a data structure that we can use as a starting point for an alternative solution for the exercise 2.3.

2.5 Challenge: Rewrite the solution of 2.3 using as the starting point `vowelsDct`¶

Try it out, the solution is provided in the slides.

# Your code here
for vowel in vowelsDct:
    for name in names:
        if vowel in name.lower():
            if name not in vowelsDct[vowel]:
                vowelsDct[vowel].append(name)
                
vowelsDct

{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'],
 'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'],
 'i': ['Eniana', 'Sohie', 'Catherine'],
 'o': ['Carolyn', 'Sohie'],
 'u': []}

Because we used all vowels as a starting point, but there is no name in the list that contains the vowel 'u', we can remove it from the dictionary:

vowelsDct.pop('u')

[]

vowelsDct.keys()

dict_keys(['a', 'e', 'i', 'o'])

3. The JSON format¶

Refer to the slides 7 and 8 in the lecture notes for a discussion of why the JSON format is useful. Python has its own library json that deals with the complexities of reading/writing JSON objects. We will use two functions from this library, load and dump.

If fileObj is a file object opened for reading or writing, then the syntax for using these two functions is the following:

json.load(fileObj) # read the content of a file into a Python object
json.dump(someData, fileObj) # write the content of a variable into a file

3.1 Load a JSON object¶

Let's load the content from a file that contains a tweet stored in the JSON format:

import json

with open("tweet.json", 'r') as inFile:
    tweetDct = json.load(inFile)
    
tweetDct

{'id': 1079460557160247297,
 'source': 'Twitter for Android',
 'text': 'Looking fwd to 2019 working w @mediaaction to advance racial justice &amp; #mediajustice in a digital age! This looks like #NoDigitalPrisons, hold FB accountable &amp; ensure POC/ low-income communities can access basic necessities like phone &amp; web. Donate today! https://t.co/9iU1jRSLmt https://t.co/d2kb2Y9EHI',
 'public_metrics': {'retweet_count': 3,
  'reply_count': 0,
  'like_count': 5,
  'quote_count': 0},
 'entities': {'urls': [{'start': 268,
    'end': 291,
    'url': 'https://t.co/9iU1jRSLmt',
    'expanded_url': 'https://www.classy.org/fundraiser/1809542',
    'display_url': 'classy.org/fundraiser/180…',
    'status': 200,
    'unwound_url': 'https://support.mediajustice.org/fundraiser/1809542'},
   {'start': 292,
    'end': 315,
    'url': 'https://t.co/d2kb2Y9EHI',
    'expanded_url': 'https://twitter.com/mediajustice/status/1079436958667919361',
    'display_url': 'twitter.com/mediajustice/s…'}],
  'mentions': [{'start': 30,
    'end': 42,
    'username': 'mediaaction',
    'id': '14881478'}],
  'hashtags': [{'start': 75, 'end': 88, 'tag': 'mediajustice'},
   {'start': 123, 'end': 140, 'tag': 'NoDigitalPrisons'}]},
 'author_id': 80507653,
 'created_at': '2018-12-30 19:33:46+00:00'}

3.2 Dump JSON object into a file¶

Let's store the content of a dictionary into a JSON file:

with open("tweet2.json", 'w') as outFile:
    json.dump(tweetDct, outFile)

Verify operation

We can read the content of the second file and compare it with the original tweet, to see if the command worked correctly.

with open("tweet2.json", 'r') as inFile:
    tweetDct2 = json.load(inFile)
    
tweetDct == tweetDct2

True

3.3 Exercise: From text to JSON¶

You are given the file 'fruitPrices.txt' that contains prices for various fruits. Write the function fruitConversion that will:

Read the content of the file 'fruitPrices.txt' and store it as a dictionary, making sure the prices are converted into floats.
Save the dictionary into the JSON file 'fruitPrices.json'.

fruitPrices.txt

apples: 2.99
oranges: 3.29
bananas: 1.49
grapes: 5.99
kiwis: 2.59

fruitPrices.json

{"apples": 2.99, "oranges": 3.29, "bananas": 1.49, "grapes": 5.99, "kiwis": 2.59}

def fruitConversion(filename):
    """Write a function that:
    1. reads the content from filename
    2. splits the lines and inserts the key:value pairs into a dictionary
    3. saves the dictionary into a JSON file (that shares the name with the original file)
    """
    # Your code here
    with open(filename, 'r') as inputData:
        fruitsDct = {}
        for line in inputData:
            fruit, price = line.split(':')
            fruitsDct[fruit] = float(price)
     
    newFileName = filename.split('.')[0] + '.json'
    with open(newFileName, 'w') as outFile:
        json.dump(fruitsDct, outFile)

# Test the function
fruitConversion('fruitPrices.txt')

Let's view the file:

more fruitPrices.json

If you were successful, you should be able to see the dictionary in the JSON file.

4. Challenge Problem: Work with JSON files¶

You are given a JSON file that contains tweets about the movement Black Lives Matters. These tweets were collected for a research project by a Wellesley research lab.

The JSON file, blmTweets.json contains a list of 1000 tweets as simple dictionaries with two keys, an ID and the text of the tweet:

[{'id': 1072284009122586625, 'text': 'The case of Jacob Walter Anderson from @Baylor is the perfect amalgamation between the #MeToo and #BlackLivesMatter movements. #ThisIsWhyWeAreAngry’}, 
{'id': 1071990529448075264, 'text': 'Now, that you all have some background information to this short story, please go read it at 👉👉👉 https://t.co/KRGkjbNJbY 👈👈👈 #NoJusticeNoPeace #BlackLivesMatter #MissionFree #DefendOurFreedom 😎'},
...
]

Let's load the file to check it out:

import json

with open('blmTweets.json', 'r') as inputFile:
    blmTweets = json.load(inputFile)
    
len(blmTweets)

1000

blmTweets[0]

{'id': 1072284009122586625,
 'text': 'The case of Jacob Walter Anderson from @Baylor is the perfect amalgamation between the #MeToo and #BlackLivesMatter movements. #ThisIsWhyWeAreAngry'}

Here are some questions that we can ask of these data:

Which are the most frequently mentioned hashtags?
Do certain hashtags co-occur together more often than others?

To simplify working with data, we will transform our data by storing the lowercased hashtags for each tweet. For example, the two tweets above will become:

[{'id': 1072284009122586625, 'hashtags': ['#metoo', '#blacklivesmatter',
#thisiswhyweareangry’}, 
{'id': 1071990529448075264, 'hashtags': '#nojusticenopeace', '#blacklivesmatter',
'#missionfree', '#defendourfreedom'},
...
]

We will break down the problem to solve it step-by-step.

4.1 Convert a tweet into a list of hashtags¶

Write a function that given a tweet as a string, returns a list of all (lowercased) hashtags from the tweet. If you want to challenge yourself, try to write the solution with list comprehension (this is optional).

def tweetToHashtags(tweetText):
    """This is the version without list comprehension.
    """
    # Your solution here
    hashtags = []
    words = tweetText.lower().split() # get the list of words
    for word in words:
        if word.startswith('#'):
            hashtags.append(word)
        
    return hashtags

tweetToHashtags(blmTweets[0]['text'])

['#metoo', '#blacklivesmatter', '#thisiswhyweareangry']

Optional: Write the same function using list comprehension.

def tweetToHashtagsLC(tweetText):
    """This is the version with list comprehension.
    """
    # Your solution here
    return [word for word in tweetText.lower().split() if word.startswith('#')]

tweetToHashtagsLC(blmTweets[0]['text'])

['#metoo', '#blacklivesmatter', '#thisiswhyweareangry']

4.2 Write a new JSON file with the tweets transformed as hashtag lists¶

Using the helper function tweetToHashtags, iterate over all the tweets and create a new list of dictionaries {'id': someId, 'hashtags': [some hashtags]} and save in into a JSON file to use to answer our questions.

Note: The JSON file, tweetsWithHashtags.json, can be found in the notebook folder. This way, you can continue with the other subsection, without having to solve this step.

def storeNewTweetFormat(tweetsList, filename):
    """Given a list of tweet dicts, do the following:
    1. Create a new list which will store new dicts, each of them having an 
       id and list of hashtags.
    2. Dump this list into a JSON file, using the provided filename.
    """
    # Your code here
    newTweetList = [] # accumulator variable
    
    for tweet in tweetsList:
        hashtags = tweetToHashtags(tweet['text'])
        newDict = {'id': tweet['id'], 'hashtags': hashtags}
        newTweetList.append(newDict)
        
    with open(filename, 'w') as outputFile:
        json.dump(newTweetList, outputFile)

# Test the code
storeNewTweetFormat(blmTweets, 'tweetsWithHashtags.json')

more tweetsWithHashtags.json

4.3 Create a frequency dictionary of all hashtags¶

Using the file 'tweetsWithHashtags.json', you'll write a function that iterates through all the tweet dicts and accumulates a new frequency dictionary that keeps track of all hashtag occurrences.

def findFrequencyHashtags(filename):
    """Given a JSON file that has a list of dicts, each representing a tweet,
    do the following:
    1. Read the list of tweets (load the JSON file)
    2. Iterate over the tweets and keep track of all hashtag occurrences
    into a dictionary.
    3. Return the dictionary.
    """
    # Your code here
    with open(filename) as inputFile:
        tweetsList = json.load(inputFile)
        
    hashtagFreq = {} # accumulator variable
    for tweet in tweetsList:
        hashtags = tweet['hashtags']
        for ht in hashtags:
            hashtagFreq[ht] = hashtagFreq.get(ht, 0) + 1 
            
    return hashtagFreq

hashtagsDct = findFrequencyHashtags('tweetsWithHashtags.json')
len(hashtagsDct)

1285

Let's explore this dictionary:

list(hashtagsDct.keys())[:5]

['#metoo',
 '#blacklivesmatter',
 '#thisiswhyweareangry',
 '#interview',
 '#ferguson']

list(hashtagsDct.items())[:5]

[('#metoo', 31),
 ('#blacklivesmatter', 629),
 ('#thisiswhyweareangry', 1),
 ('#interview', 1),
 ('#ferguson', 22)]

4.4 Save the sorted hashtags into a JSON file¶

A dictionary, cannot be sorted easily, thus, instead of sorting the dictionary, we can sort the list of its items, that is, the list of tuples (key,value):

[('#metoo', 31),
 ('#blacklivesmatter', 629),
 ('#thisiswhyweareangry', 1),
 ('#interview', 1),
 ('#ferguson', 22)]

In the following, you'll write a function that first sorts the items of the dictionary, then saves the sorted list into a file.

def byFrequency(pair):
    """Helper function for the sorted function.
    Given a tuple such as ('#metoo', 31), it will return number 31.
    """
    return pair[1]

def sortAndSaveFrequentHashtags(hashtagsDict, filename):
    """
    First sort the items of the dictionary in the descending order (largest to lowest),
    then dump the list of tuples into a JSON file named 'filenmae'.
    """
    # Your code here
    sortedPairs = sorted(hashtagsDict.items(), 
                         key=byFrequency, 
                         reverse=True)
    
    with open(filename, 'w') as outputFile:
        json.dump(sortedPairs, outputFile)

# Let's test the function
sortAndSaveFrequentHashtags(hashtagsDct, 'sortedHashtags.json')

more sortedHashtags.json

This is the end of the notebook.