1. Review: Accumulation across types

In the following, we summarize accumulation problems that we have seen throughout the semester.

1.1. Accumulating with an integer

This is an example that we have seen often:

In [1]:
def isVowel(char):
    """Predicate function that returns true/false."""
    return char.lower() in list('aeiou')

# Code snippet to count vowels in a string
phrase = "the sun is shining"
count = 0 # accumulator variable
for letter in phrase:
    if isVowel(letter):
        count += 1
        
print(f"'{phrase}' has {count} vowels.")
'the sun is shining' has 5 vowels.

1.2. Accumulating with a string

Create a new string based on certain characteristics of a given one:

In [2]:
# Code snippet to encrypt a phrase
phrase = "the sun is shining"
newPhrase = ''
for letter in phrase:
    if isVowel(letter):
        newPhrase += '*'
    elif letter == ' ':
        newPhrase += ' '
    else:
        newPhrase += '_'

print(newPhrase)
__* _*_ *_ __*_*__

1.3 Accumulating with a list

This is also an example that we have seen often:

In [3]:
# Code snippet to find odd numbers in a list
numList = [-35, -16, -3, 0, 1, 5, 8, 11, 18, 25]
oddList = []
for num in numList:
    if num % 2 == 1:
        oddList.append(num)
        
print(oddList)
[-35, -3, 1, 5, 11, 25]

As you know, this can also be written in a more concise form with list comprehension:

In [4]:
oddList = [num for num in numList if num % 2 == 1]
oddList
Out[4]:
[-35, -3, 1, 5, 11, 25]

1.4 Accumulating with a dict

Instead of accumulating only one variable (for example, the number of vowels in a phrase), we can accumulate multiple things at once using a dictionary. For example, we can count the frequency of all letters in a string.

In [5]:
# count all letters in a string
word = "abracadabra"
lettersDict = {} # accumulator variable
for letter in word:
    if letter not in lettersDict:
        lettersDict[letter] = 1
    else:
        lettersDict[letter] += 1
        
print(lettersDict)
{'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1}

1.5 Exercise: Count file extensions

You are given a list of files:

['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

Write code that will count how many files of different types are there. You should expect the following result:

{'txt': 4, 
'csv': 3, 
'py': 2}
In [6]:
filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

# Your code here
extensions = {}
for file in filesList:
    ext = file.split('.')[1]
    if ext not in extensions:
        extensions[ext] = 1
    else:
        extensions[ext] += 1
        
extensions
Out[6]:
{'txt': 4, 'csv': 3, 'py': 2}

2. Accumulation with dicts and lists

With dictionaries we can perform more complex accumulations. This section contains some examples.

2.1 Group names by their length

Given the following list:

names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']

group the elements by length, by writing code that creates the following dictionary:

{4: ['Andy'], 
 7: ['Carolyn'], 
 3: ['Eni', 'Lyn'], 
 5: ['Peter', 'Sohie']
}
In [7]:
names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']

nameLengthDct = {} # accumulator variable

for name in names:
    nameLen = len(name)
    if nameLen not in nameLengthDct:
        nameLengthDct[nameLen] = [name] 
    else:
        nameLengthDct[nameLen].append(name)
    
nameLengthDct
Out[7]:
{4: ['Andy'], 7: ['Carolyn'], 3: ['Eni', 'Lyn'], 5: ['Peter', 'Sohie']}

2.2 Exercise: Group files by their extensions

Using the same list of files as in Section 1.5, let's group files by their extensions, instead of simply counting them. Here is what your code should produce:

{'txt': ['counts1.txt', 'counts2.txt', 'poem1.txt', 'poem2.txt'],
 'csv': ['students.csv', 'cities.csv', 'states.csv'],
 'py': ['optimism.py', 'wordsearch.py']}
In [8]:
filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py', 
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']

# Your code here
extensions2 = {}
for file in filesList:
    ext = file.split('.')[1]
    if ext not in extensions2:
        extensions2[ext] = [file]
    else:
        extensions2[ext].append(file)
        
extensions2
Out[8]:
{'txt': ['counts1.txt', 'counts2.txt', 'poem1.txt', 'poem2.txt'],
 'csv': ['students.csv', 'cities.csv', 'states.csv'],
 'py': ['optimism.py', 'wordsearch.py']}

2.3 Group names by the vowels they contain

Given our (expanded and modfied) list of names:

names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter', 'Sohie', 
'Ada', 'Smaranda', 'Catherine']

how would we go about grouping them according to the vowels they contain? Here is the dictionary we want to create:

{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'], 
'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'], 
'i': ['Eniana', 'Sohie', 'Catherine'], 
'o': ['Carolyn', 'Sohie']
} 

Note: This is not a trivial problem. What makes it hard is the fact that some names (for example, Peter, Ada, etc.) have several occurrences of the same letter, and we want to avoid listing them multiple times.

In [9]:
names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter', 
         'Sohie', 'Ada', 'Smaranda', 'Catherine']

vowelIndexDct = {} # accumulator variable

# Notice the nested for loops
for name in names:
    for letter in name.lower():
        if isVowel(letter):
            if letter not in vowelIndexDct:
                vowelIndexDct[letter] = [name]
            else:
                # notice that before appending, we first check if the name was added before
                if name not in vowelIndexDct[letter]:
                    vowelIndexDct[letter].append(name)
vowelIndexDct
Out[9]:
{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'],
 'o': ['Carolyn', 'Sohie'],
 'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'],
 'i': ['Eniana', 'Sohie', 'Catherine']}

2.4 Dictionary comprehension

Similarly to list comprehension, we can do dictionary comprehension. The syntax is very similar, concretely:

{ aKey: aValue for aKey in sequence}

Below are some examples:

In [10]:
# Find the lengths of words in a list

wordList = ['red', 'yellow', 'blue', 'violet', 'black']

lengthsDct = {word: len(word) for word in wordList}
lengthsDct
Out[10]:
{'red': 3, 'yellow': 6, 'blue': 4, 'violet': 6, 'black': 5}
In [11]:
# Shorten each day name into its first three letters

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

shortDays = {day: day[:3] for day in days}
shortDays
Out[11]:
{'Monday': 'Mon',
 'Tuesday': 'Tue',
 'Wednesday': 'Wed',
 'Thursday': 'Thu',
 'Friday': 'Fri',
 'Saturday': 'Sat',
 'Sunday': 'Sun'}

A good use for the dict comprehension is to initialize data structures that then can be used for accumulation. For example, the following dict comprehension:

In [12]:
vowelsDct = {vowel: [] for vowel in 'aeiou'}
vowelsDct
Out[12]:
{'a': [], 'e': [], 'i': [], 'o': [], 'u': []}

creates a data structure that we can use as a starting point for an alternative solution for the exercise 2.3.

2.5 Challenge: Rewrite the solution of 2.3 using as the starting point vowelsDct

Try it out, the solution is provided in the slides.

In [13]:
# Your code here
for vowel in vowelsDct:
    for name in names:
        if vowel in name.lower():
            if name not in vowelsDct[vowel]:
                vowelsDct[vowel].append(name)
                
vowelsDct
Out[13]:
{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'],
 'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'],
 'i': ['Eniana', 'Sohie', 'Catherine'],
 'o': ['Carolyn', 'Sohie'],
 'u': []}

Because we used all vowels as a starting point, but there is no name in the list that contains the vowel 'u', we can remove it from the dictionary:

In [14]:
vowelsDct.pop('u')
Out[14]:
[]
In [15]:
vowelsDct.keys()
Out[15]:
dict_keys(['a', 'e', 'i', 'o'])

3. The JSON format

Refer to the slides 7 and 8 in the lecture notes for a discussion of why the JSON format is useful. Python has its own library json that deals with the complexities of reading/writing JSON objects. We will use two functions from this library, load and dump.

If fileObj is a file object opened for reading or writing, then the syntax for using these two functions is the following:

json.load(fileObj) # read the content of a file into a Python object
json.dump(someData, fileObj) # write the content of a variable into a file

3.1 Load a JSON object

Let's load the content from a file that contains a tweet stored in the JSON format:

In [16]:
import json

with open("tweet.json", 'r') as inFile:
    tweetDct = json.load(inFile)
    
tweetDct
Out[16]:
{'id': 1079460557160247297,
 'source': 'Twitter for Android',
 'text': 'Looking fwd to 2019 working w @mediaaction to advance racial justice & #mediajustice in a digital age! This looks like #NoDigitalPrisons, hold FB accountable & ensure POC/ low-income communities can access basic necessities like phone & web. Donate today! https://t.co/9iU1jRSLmt https://t.co/d2kb2Y9EHI',
 'public_metrics': {'retweet_count': 3,
  'reply_count': 0,
  'like_count': 5,
  'quote_count': 0},
 'entities': {'urls': [{'start': 268,
    'end': 291,
    'url': 'https://t.co/9iU1jRSLmt',
    'expanded_url': 'https://www.classy.org/fundraiser/1809542',
    'display_url': 'classy.org/fundraiser/180…',
    'status': 200,
    'unwound_url': 'https://support.mediajustice.org/fundraiser/1809542'},
   {'start': 292,
    'end': 315,
    'url': 'https://t.co/d2kb2Y9EHI',
    'expanded_url': 'https://twitter.com/mediajustice/status/1079436958667919361',
    'display_url': 'twitter.com/mediajustice/s…'}],
  'mentions': [{'start': 30,
    'end': 42,
    'username': 'mediaaction',
    'id': '14881478'}],
  'hashtags': [{'start': 75, 'end': 88, 'tag': 'mediajustice'},
   {'start': 123, 'end': 140, 'tag': 'NoDigitalPrisons'}]},
 'author_id': 80507653,
 'created_at': '2018-12-30 19:33:46+00:00'}

3.2 Dump JSON object into a file

Let's store the content of a dictionary into a JSON file:

In [17]:
with open("tweet2.json", 'w') as outFile:
    json.dump(tweetDct, outFile)

Verify operation

We can read the content of the second file and compare it with the original tweet, to see if the command worked correctly.

In [18]:
with open("tweet2.json", 'r') as inFile:
    tweetDct2 = json.load(inFile)
    
tweetDct == tweetDct2
Out[18]:
True

3.3 Exercise: From text to JSON

You are given the file 'fruitPrices.txt' that contains prices for various fruits. Write the function fruitConversion that will:

  1. Read the content of the file 'fruitPrices.txt' and store it as a dictionary, making sure the prices are converted into floats.
  2. Save the dictionary into the JSON file 'fruitPrices.json'.

fruitPrices.txt

apples: 2.99
oranges: 3.29
bananas: 1.49
grapes: 5.99
kiwis: 2.59

fruitPrices.json

{"apples": 2.99, "oranges": 3.29, "bananas": 1.49, "grapes": 5.99, "kiwis": 2.59}
In [19]:
def fruitConversion(filename):
    """Write a function that:
    1. reads the content from filename
    2. splits the lines and inserts the key:value pairs into a dictionary
    3. saves the dictionary into a JSON file (that shares the name with the original file)
    """
    # Your code here
    with open(filename, 'r') as inputData:
        fruitsDct = {}
        for line in inputData:
            fruit, price = line.split(':')
            fruitsDct[fruit] = float(price)
     
    newFileName = filename.split('.')[0] + '.json'
    with open(newFileName, 'w') as outFile:
        json.dump(fruitsDct, outFile)
In [20]:
# Test the function
fruitConversion('fruitPrices.txt')

Let's view the file:

In [21]:
more fruitPrices.json

If you were successful, you should be able to see the dictionary in the JSON file.

4. Challenge Problem: Work with JSON files

You are given a JSON file that contains tweets about the movement Black Lives Matters. These tweets were collected for a research project by a Wellesley research lab.

The JSON file, blmTweets.json contains a list of 1000 tweets as simple dictionaries with two keys, an ID and the text of the tweet:

[{'id': 1072284009122586625, 'text': 'The case of Jacob Walter Anderson from @Baylor is the perfect amalgamation between the #MeToo and #BlackLivesMatter movements. #ThisIsWhyWeAreAngry’}, 
{'id': 1071990529448075264, 'text': 'Now, that you all have some background information to this short story, please go read it at 👉👉👉 https://t.co/KRGkjbNJbY 👈👈👈 #NoJusticeNoPeace #BlackLivesMatter #MissionFree #DefendOurFreedom 😎'},
...
]

Let's load the file to check it out:

In [22]:
import json

with open('blmTweets.json', 'r') as inputFile:
    blmTweets = json.load(inputFile)
    
len(blmTweets)
Out[22]:
1000
In [23]:
blmTweets[0]
Out[23]:
{'id': 1072284009122586625,
 'text': 'The case of Jacob Walter Anderson from @Baylor is the perfect amalgamation between the #MeToo and #BlackLivesMatter movements. #ThisIsWhyWeAreAngry'}

Here are some questions that we can ask of these data:

  1. Which are the most frequently mentioned hashtags?
  2. Do certain hashtags co-occur together more often than others?

To simplify working with data, we will transform our data by storing the lowercased hashtags for each tweet. For example, the two tweets above will become:

[{'id': 1072284009122586625, 'hashtags': ['#metoo', '#blacklivesmatter',
#thisiswhyweareangry’}, 
{'id': 1071990529448075264, 'hashtags': '#nojusticenopeace', '#blacklivesmatter',
'#missionfree', '#defendourfreedom'},
...
]

We will break down the problem to solve it step-by-step.

4.1 Convert a tweet into a list of hashtags

Write a function that given a tweet as a string, returns a list of all (lowercased) hashtags from the tweet. If you want to challenge yourself, try to write the solution with list comprehension (this is optional).

In [24]:
def tweetToHashtags(tweetText):
    """This is the version without list comprehension.
    """
    # Your solution here
    hashtags = []
    words = tweetText.lower().split() # get the list of words
    for word in words:
        if word.startswith('#'):
            hashtags.append(word)
        
    return hashtags
In [25]:
tweetToHashtags(blmTweets[0]['text'])
Out[25]:
['#metoo', '#blacklivesmatter', '#thisiswhyweareangry']

Optional: Write the same function using list comprehension.

In [26]:
def tweetToHashtagsLC(tweetText):
    """This is the version with list comprehension.
    """
    # Your solution here
    return [word for word in tweetText.lower().split() if word.startswith('#')]
In [27]:
tweetToHashtagsLC(blmTweets[0]['text'])
Out[27]:
['#metoo', '#blacklivesmatter', '#thisiswhyweareangry']

4.2 Write a new JSON file with the tweets transformed as hashtag lists

Using the helper function tweetToHashtags, iterate over all the tweets and create a new list of dictionaries {'id': someId, 'hashtags': [some hashtags]} and save in into a JSON file to use to answer our questions.

Note: The JSON file, tweetsWithHashtags.json, can be found in the notebook folder. This way, you can continue with the other subsection, without having to solve this step.

In [28]:
def storeNewTweetFormat(tweetsList, filename):
    """Given a list of tweet dicts, do the following:
    1. Create a new list which will store new dicts, each of them having an 
       id and list of hashtags.
    2. Dump this list into a JSON file, using the provided filename.
    """
    # Your code here
    newTweetList = [] # accumulator variable
    
    for tweet in tweetsList:
        hashtags = tweetToHashtags(tweet['text'])
        newDict = {'id': tweet['id'], 'hashtags': hashtags}
        newTweetList.append(newDict)
        
    with open(filename, 'w') as outputFile:
        json.dump(newTweetList, outputFile)
In [29]:
# Test the code
storeNewTweetFormat(blmTweets, 'tweetsWithHashtags.json')
In [30]:
more tweetsWithHashtags.json

4.3 Create a frequency dictionary of all hashtags

Using the file 'tweetsWithHashtags.json', you'll write a function that iterates through all the tweet dicts and accumulates a new frequency dictionary that keeps track of all hashtag occurrences.

In [31]:
def findFrequencyHashtags(filename):
    """Given a JSON file that has a list of dicts, each representing a tweet,
    do the following:
    1. Read the list of tweets (load the JSON file)
    2. Iterate over the tweets and keep track of all hashtag occurrences
    into a dictionary.
    3. Return the dictionary.
    """
    # Your code here
    with open(filename) as inputFile:
        tweetsList = json.load(inputFile)
        
    hashtagFreq = {} # accumulator variable
    for tweet in tweetsList:
        hashtags = tweet['hashtags']
        for ht in hashtags:
            hashtagFreq[ht] = hashtagFreq.get(ht, 0) + 1 
            
    return hashtagFreq
In [32]:
hashtagsDct = findFrequencyHashtags('tweetsWithHashtags.json')
len(hashtagsDct)
Out[32]:
1285

Let's explore this dictionary:

In [33]:
list(hashtagsDct.keys())[:5]
Out[33]:
['#metoo',
 '#blacklivesmatter',
 '#thisiswhyweareangry',
 '#interview',
 '#ferguson']
In [34]:
list(hashtagsDct.items())[:5]
Out[34]:
[('#metoo', 31),
 ('#blacklivesmatter', 629),
 ('#thisiswhyweareangry', 1),
 ('#interview', 1),
 ('#ferguson', 22)]

4.4 Save the sorted hashtags into a JSON file

A dictionary, cannot be sorted easily, thus, instead of sorting the dictionary, we can sort the list of its items, that is, the list of tuples (key,value):

[('#metoo', 31),
 ('#blacklivesmatter', 629),
 ('#thisiswhyweareangry', 1),
 ('#interview', 1),
 ('#ferguson', 22)]

In the following, you'll write a function that first sorts the items of the dictionary, then saves the sorted list into a file.

In [35]:
def byFrequency(pair):
    """Helper function for the sorted function.
    Given a tuple such as ('#metoo', 31), it will return number 31.
    """
    return pair[1]
In [36]:
def sortAndSaveFrequentHashtags(hashtagsDict, filename):
    """
    First sort the items of the dictionary in the descending order (largest to lowest),
    then dump the list of tuples into a JSON file named 'filenmae'.
    """
    # Your code here
    sortedPairs = sorted(hashtagsDict.items(), 
                         key=byFrequency, 
                         reverse=True)
    
    with open(filename, 'w') as outputFile:
        json.dump(sortedPairs, outputFile)
In [37]:
# Let's test the function
sortAndSaveFrequentHashtags(hashtagsDct, 'sortedHashtags.json')
In [38]:
more sortedHashtags.json

This is the end of the notebook.