Table of Contents
1.1 Accumulating with an integer
1.2 Accumulating with a string
1.3 Accumulating with a list
1.4 Accumulating with a dict
1.5 Excercise: Count file extensions
2. Accumulation with dicts and lists
2.1 Group names by their length
2.2 Exercise: Group files by their extension
2.3 Group names by the vowels they contain
2.4 Dict Comprehension
2.5 Challenge: Rewrite the solution of 2.3 using as the starting point vowelsDct
3. The JSON format
3.1 Load a JSON object
3.2 Dump a JSON object into a file
3.3 Exercise: From Text to JSON
4. Challenge Problem: Work with JSON files
4.1 Convert a tweet into a list of hashtags
4.2 Write a new JSON file with the tweets transformed as hashtag lists
4.3 Create a frequency dictionary of all hashtags
4.4 Save the sorted hashtags into a JSON file
def isVowel(char):
"""Predicate function that returns true/false."""
return char.lower() in list('aeiou')
# Code snippet to count vowels in a string
phrase = "the sun is shining"
count = 0 # accumulator variable
for letter in phrase:
if isVowel(letter):
count += 1
print(f"'{phrase}' has {count} vowels.")
Create a new string based on certain characteristics of a given one:
# Code snippet to encrypt a phrase
phrase = "the sun is shining"
newPhrase = ''
for letter in phrase:
if isVowel(letter):
newPhrase += '*'
elif letter == ' ':
newPhrase += ' '
else:
newPhrase += '_'
print(newPhrase)
# Code snippet to find odd numbers in a list
numList = [-35, -16, -3, 0, 1, 5, 8, 11, 18, 25]
oddList = []
for num in numList:
if num % 2 == 1:
oddList.append(num)
print(oddList)
As you know, this can also be written in a more concise form with list comprehension:
oddList = [num for num in numList if num % 2 == 1]
oddList
Instead of accumulating only one variable (for example, the number of vowels in a phrase), we can accumulate multiple things at once using a dictionary. For example, we can count the frequency of all letters in a string.
# count all letters in a string
word = "abracadabra"
lettersDict = {} # accumulator variable
for letter in word:
if letter not in lettersDict:
lettersDict[letter] = 1
else:
lettersDict[letter] += 1
print(lettersDict)
You are given a list of files:
['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py',
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']
Write code that will count how many files of different types are there. You should expect the following result:
{'txt': 4,
'csv': 3,
'py': 2}
filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py',
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']
# Your code here
extensions = {}
for file in filesList:
ext = file.split('.')[1]
if ext not in extensions:
extensions[ext] = 1
else:
extensions[ext] += 1
extensions
With dictionaries we can perform more complex accumulations. This section contains some examples.
Given the following list:
names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']
group the elements by length, by writing code that creates the following dictionary:
{4: ['Andy'],
7: ['Carolyn'],
3: ['Eni', 'Lyn'],
5: ['Peter', 'Sohie']
}
names = ['Andy', 'Carolyn', 'Eni', 'Lyn', 'Peter', 'Sohie']
nameLengthDct = {} # accumulator variable
for name in names:
nameLen = len(name)
if nameLen not in nameLengthDct:
nameLengthDct[nameLen] = [name]
else:
nameLengthDct[nameLen].append(name)
nameLengthDct
Using the same list of files as in Section 1.5, let's group files by their extensions, instead of simply counting them. Here is what your code should produce:
{'txt': ['counts1.txt', 'counts2.txt', 'poem1.txt', 'poem2.txt'],
'csv': ['students.csv', 'cities.csv', 'states.csv'],
'py': ['optimism.py', 'wordsearch.py']}
filesList = ['counts1.txt', 'counts2.txt', 'students.csv', 'optimism.py',
'wordsearch.py', 'cities.csv', 'states.csv', 'poem1.txt', 'poem2.txt']
# Your code here
extensions2 = {}
for file in filesList:
ext = file.split('.')[1]
if ext not in extensions2:
extensions2[ext] = [file]
else:
extensions2[ext].append(file)
extensions2
Given our (expanded and modfied) list of names:
names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter', 'Sohie',
'Ada', 'Smaranda', 'Catherine']
how would we go about grouping them according to the vowels they contain? Here is the dictionary we want to create:
{'a': ['Andy', 'Carolyn', 'Eniana', 'Ada', 'Smaranda', 'Catherine'],
'e': ['Eniana', 'Peter', 'Sohie', 'Catherine'],
'i': ['Eniana', 'Sohie', 'Catherine'],
'o': ['Carolyn', 'Sohie']
}
Note: This is not a trivial problem. What makes it hard is the fact that some names (for example, Peter, Ada, etc.) have several occurrences of the same letter, and we want to avoid listing them multiple times.
names = ['Andy', 'Carolyn', 'Eniana', 'Lyn', 'Peter',
'Sohie', 'Ada', 'Smaranda', 'Catherine']
vowelIndexDct = {} # accumulator variable
# Notice the nested for loops
for name in names:
for letter in name.lower():
if isVowel(letter):
if letter not in vowelIndexDct:
vowelIndexDct[letter] = [name]
else:
# notice that before appending, we first check if the name was added before
if name not in vowelIndexDct[letter]:
vowelIndexDct[letter].append(name)
vowelIndexDct
Similarly to list comprehension, we can do dictionary comprehension. The syntax is very similar, concretely:
{ aKey: aValue for aKey in sequence}
Below are some examples:
# Find the lengths of words in a list
wordList = ['red', 'yellow', 'blue', 'violet', 'black']
lengthsDct = {word: len(word) for word in wordList}
lengthsDct
# Shorten each day name into its first three letters
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
shortDays = {day: day[:3] for day in days}
shortDays
A good use for the dict comprehension is to initialize data structures that then can be used for accumulation. For example, the following dict comprehension:
vowelsDct = {vowel: [] for vowel in 'aeiou'}
vowelsDct
creates a data structure that we can use as a starting point for an alternative solution for the exercise 2.3.
vowelsDct
¶Try it out, the solution is provided in the slides.
# Your code here
for vowel in vowelsDct:
for name in names:
if vowel in name.lower():
if name not in vowelsDct[vowel]:
vowelsDct[vowel].append(name)
vowelsDct
Because we used all vowels as a starting point, but there is no name in the list that contains the vowel 'u', we can remove it from the dictionary:
vowelsDct.pop('u')
vowelsDct.keys()
Refer to the slides 7 and 8 in the lecture notes for a discussion of why the JSON format is useful. Python has its own library json
that deals with the complexities of reading/writing JSON objects. We will use two functions from this library, load
and dump
.
If fileObj
is a file object opened for reading or writing, then the syntax for using these two functions is the following:
json.load(fileObj) # read the content of a file into a Python object
json.dump(someData, fileObj) # write the content of a variable into a file
Let's load the content from a file that contains a tweet stored in the JSON format:
import json
with open("tweet.json", 'r') as inFile:
tweetDct = json.load(inFile)
tweetDct
Let's store the content of a dictionary into a JSON file:
with open("tweet2.json", 'w') as outFile:
json.dump(tweetDct, outFile)
Verify operation
We can read the content of the second file and compare it with the original tweet, to see if the command worked correctly.
with open("tweet2.json", 'r') as inFile:
tweetDct2 = json.load(inFile)
tweetDct == tweetDct2
You are given the file 'fruitPrices.txt' that contains prices for various fruits.
Write the function fruitConversion
that will:
fruitPrices.txt
apples: 2.99
oranges: 3.29
bananas: 1.49
grapes: 5.99
kiwis: 2.59
fruitPrices.json
{"apples": 2.99, "oranges": 3.29, "bananas": 1.49, "grapes": 5.99, "kiwis": 2.59}
def fruitConversion(filename):
"""Write a function that:
1. reads the content from filename
2. splits the lines and inserts the key:value pairs into a dictionary
3. saves the dictionary into a JSON file (that shares the name with the original file)
"""
# Your code here
with open(filename, 'r') as inputData:
fruitsDct = {}
for line in inputData:
fruit, price = line.split(':')
fruitsDct[fruit] = float(price)
newFileName = filename.split('.')[0] + '.json'
with open(newFileName, 'w') as outFile:
json.dump(fruitsDct, outFile)
# Test the function
fruitConversion('fruitPrices.txt')
Let's view the file:
more fruitPrices.json
If you were successful, you should be able to see the dictionary in the JSON file.
You are given a JSON file that contains tweets about the movement Black Lives Matters. These tweets were collected for a research project by a Wellesley research lab.
The JSON file, blmTweets.json
contains a list of 1000 tweets as simple dictionaries with two keys, an ID and the text of the tweet:
[{'id': 1072284009122586625, 'text': 'The case of Jacob Walter Anderson from @Baylor is the perfect amalgamation between the #MeToo and #BlackLivesMatter movements. #ThisIsWhyWeAreAngry’},
{'id': 1071990529448075264, 'text': 'Now, that you all have some background information to this short story, please go read it at 👉👉👉 https://t.co/KRGkjbNJbY 👈👈👈 #NoJusticeNoPeace #BlackLivesMatter #MissionFree #DefendOurFreedom 😎'},
...
]
Let's load the file to check it out:
import json
with open('blmTweets.json', 'r') as inputFile:
blmTweets = json.load(inputFile)
len(blmTweets)
blmTweets[0]
Here are some questions that we can ask of these data:
To simplify working with data, we will transform our data by storing the lowercased hashtags for each tweet. For example, the two tweets above will become:
[{'id': 1072284009122586625, 'hashtags': ['#metoo', '#blacklivesmatter',
#thisiswhyweareangry’},
{'id': 1071990529448075264, 'hashtags': '#nojusticenopeace', '#blacklivesmatter',
'#missionfree', '#defendourfreedom'},
...
]
We will break down the problem to solve it step-by-step.
Write a function that given a tweet as a string, returns a list of all (lowercased) hashtags from the tweet. If you want to challenge yourself, try to write the solution with list comprehension (this is optional).
def tweetToHashtags(tweetText):
"""This is the version without list comprehension.
"""
# Your solution here
hashtags = []
words = tweetText.lower().split() # get the list of words
for word in words:
if word.startswith('#'):
hashtags.append(word)
return hashtags
tweetToHashtags(blmTweets[0]['text'])
Optional: Write the same function using list comprehension.
def tweetToHashtagsLC(tweetText):
"""This is the version with list comprehension.
"""
# Your solution here
return [word for word in tweetText.lower().split() if word.startswith('#')]
tweetToHashtagsLC(blmTweets[0]['text'])
Using the helper function tweetToHashtags
, iterate over all the tweets and create a new list of dictionaries {'id': someId, 'hashtags': [some hashtags]}
and save in into a JSON file to use to answer our questions.
Note: The JSON file, tweetsWithHashtags.json, can be found in the notebook folder. This way, you can continue with the other subsection, without having to solve this step.
def storeNewTweetFormat(tweetsList, filename):
"""Given a list of tweet dicts, do the following:
1. Create a new list which will store new dicts, each of them having an
id and list of hashtags.
2. Dump this list into a JSON file, using the provided filename.
"""
# Your code here
newTweetList = [] # accumulator variable
for tweet in tweetsList:
hashtags = tweetToHashtags(tweet['text'])
newDict = {'id': tweet['id'], 'hashtags': hashtags}
newTweetList.append(newDict)
with open(filename, 'w') as outputFile:
json.dump(newTweetList, outputFile)
# Test the code
storeNewTweetFormat(blmTweets, 'tweetsWithHashtags.json')
more tweetsWithHashtags.json
Using the file 'tweetsWithHashtags.json', you'll write a function that iterates through all the tweet dicts and accumulates a new frequency dictionary that keeps track of all hashtag occurrences.
def findFrequencyHashtags(filename):
"""Given a JSON file that has a list of dicts, each representing a tweet,
do the following:
1. Read the list of tweets (load the JSON file)
2. Iterate over the tweets and keep track of all hashtag occurrences
into a dictionary.
3. Return the dictionary.
"""
# Your code here
with open(filename) as inputFile:
tweetsList = json.load(inputFile)
hashtagFreq = {} # accumulator variable
for tweet in tweetsList:
hashtags = tweet['hashtags']
for ht in hashtags:
hashtagFreq[ht] = hashtagFreq.get(ht, 0) + 1
return hashtagFreq
hashtagsDct = findFrequencyHashtags('tweetsWithHashtags.json')
len(hashtagsDct)
Let's explore this dictionary:
list(hashtagsDct.keys())[:5]
list(hashtagsDct.items())[:5]
A dictionary, cannot be sorted easily, thus, instead of sorting the dictionary, we can sort the list of its items, that is, the list of tuples (key,value):
[('#metoo', 31),
('#blacklivesmatter', 629),
('#thisiswhyweareangry', 1),
('#interview', 1),
('#ferguson', 22)]
In the following, you'll write a function that first sorts the items of the dictionary, then saves the sorted list into a file.
def byFrequency(pair):
"""Helper function for the sorted function.
Given a tuple such as ('#metoo', 31), it will return number 31.
"""
return pair[1]
def sortAndSaveFrequentHashtags(hashtagsDict, filename):
"""
First sort the items of the dictionary in the descending order (largest to lowest),
then dump the list of tuples into a JSON file named 'filenmae'.
"""
# Your code here
sortedPairs = sorted(hashtagsDict.items(),
key=byFrequency,
reverse=True)
with open(filename, 'w') as outputFile:
json.dump(sortedPairs, outputFile)
# Let's test the function
sortAndSaveFrequentHashtags(hashtagsDct, 'sortedHashtags.json')
more sortedHashtags.json
This is the end of the notebook.