Table of Contents
txt
and json
csv
module to work with CSV filescsv.DictReader
csv.DictWriter
writeheader
, writerows
, writerow
txt
and json
¶We have learned how to read/write text files and load/dump json files.
For both formats, we first open a file for reading and writing. The file object has methods to read and write text content, but not complex data structures such as dictionaries of lists. To do that, we use the special functions json.load
and json.dump
. Below are examples of how to perform certain operations. They all assume that there is a variable filepath
which stores the path to a desired file on our computer.
Read all content of a file
with open(filepath, 'r') as fileObj:
text = fileObj.read()
Read one line from a file
with open(filepath, 'r') as fileObj:
line = fileObj.readline()
Read all lines from a file
with open(filepath, 'r') as fileObj:
allLines = fileObj.readlines()
Write into a file
with open(filepath, 'w') as fileObj:
fileObj.write(someText)
Load a JSON object from a file
with open(filepath, 'r') as fileObj:
ourData = json.load(fileObj)
Dump a JSON object into a file
with open(filepath, 'w') as fileObj:
json.dump(ourData, fileObj)
A CSV (comma-separated values) file is a file within which the values are separated by commas. An appropriate application like Excel can display this data as a table of rows and columns. However, a CSV file is a text file, thus, we can read its content similarly to all text files we have seen so far, with the simple twist that we need to split values at commas. Below is a function that does that:
def tuplesFromFile(filename):
'''Read each line from opened file,
strip white space,
split at commas,
convert as tuple and
return a list of tuples.
'''
with open(filename, 'r') as inputFile:
theTuples = [tuple(line.strip().split(',')) # using list comprehension
for line in inputFile] # implicitely uses .readline()
return theTuples
Let's read the content of the file us-states.csv
:
statesData = tuplesFromFile("us-states.csv")
len(statesData)
There are 51 entries, because the first one is the tuple with the column names:
statesData[:4]
There are so many things we can do with this data, but for the moment, we will try something simple: create a CSV file that will look like this:
Let's first create the data that looks similar to what the file should contains:
capitals = [] # new list to store tuples of ("Capital, Abbr", Population)
for _,_,abbr,cap,capPop in statesData[1:]: # notice the tuple assignment, since each line is a tuple
# also, we're skipping the row at index 0
capitals.append((f"{cap}, {abbr}", capPop)) # use the f-string to join two values into a string
capitals[:4] # show the first four items
This worked well. Now we can store this data into a CSV file, making sure that we use comma to separate the values, as well as the newline character, '\n'.
with open("capitals-only.csv", "w") as outF:
for item in capitals:
line = f"{item[0]},{item[1]}\n" # create the line to write in the file
outF.write(line)
We can verify the the file was created:
more capitals-only.csv
The data is stored in this new file. Will our function tuplesFromFile
be able to correctly read this file?
capitals2 = tuplesFromFile("capitals-only.csv")
capitals == capitals2
We didn't get back the same content that we put in. That is cause for concern. Let's look at each list separately:
capitals[:3]
capitals2[:3]
We can see what happened: our function tuplesFromFile
has split the values at the comma, and created a three-element tuple, instead of the two-element tuple we wrote into the file. There are certainly ways to deal with this, but having to write custom-made code everytime we read a new file that we haven't seen before, is not sustainable. This is where Python libraries (or modules) come handy. They contain functions that take into account all possible scenarios, and allow us to abstract away the details of each file.
The json
module we have already encountered is such an example, and today we'll look at the csv
module.
csv
module to work with CSV data¶The csv
module has special functions (known as constructors) that create objects that know how to perform certain operations. These objects are similar to the file objects we learned about a few weeks ago.
csv.reader
--> an object that knows how to read a file object line by line. It returns a list of lists.csv.writer
--> an object that knows how to write lists into a file object.csv.DictReader
--> an object that knows how to read a file object into a list with each element as a dictionary.csv.DictWriter
--> an object that knows how to write into a file object a list of dictionaries.Note: In CS111, we will only focus on csv.DictReader
and csv.DictWriter
. Students who are interested to learn about the other operations should contact an instructor.
import csv
csv.DictReader
¶The csv.DictReader
allows us to create an object that knows how to read a line from a CSV file and represent that as a dictionary.
To understand how csv.DictReader
works, we will use a very simple CSV file, countries.csv:
country,capital
Canada,Ottawa
Mexico,Mexico City
South Korea,Seoul
Ukraine,Kiev
There are two steps to follow:
csv.DictReader
that is tied to a file objectwith open("countries.csv", "r") as inputFile:
dctReader = csv.DictReader(inputFile) # create an object of the csv.DictReader tied to the file object
dctRows = [dict(row) for row in dctReader] # read the content of dctReader, forcing it to iterate
dctRows
We received a list of dictionaries, where every dictionary represents a row of the CSV table. Let's look at the other objects involved in the block of code:
inputFile # shows the value stored by this variable, an object
dctReader # shows the value stored by this variable, an object
type(inputFile) # the type of our object, this is known as "class name"
type(dctReader) # the type of our object, this is known as "class name"
Thus, we have two different objects here. The first one knows how to open a file for reading, the second one how to read the content of the file line by line and convert each line into a dictionary.
What happens if we don't use an explicit dict
conversion?
with open("countries.csv", "r") as inputFile:
dctReader = csv.DictReader(inputFile) # create an object of the csv.DictReader tied to the file object
dctRows = [row for row in dctReader]
dctRows
Python by default creates something called an OrderedDict
, which is shown above as a list of tuples. Since this is not something we have learned, we prefer to use dict
to convert directly into a dictionary.
In the folder, you'll find the file "us-capitals.csv", which looks like this:
"Capital, Abbr",Population
"Montgomery, AL",198525
"Juneau, AK",32113
"Phoenix, AZ",1680992
"Little Rock, AR",197312
"Sacramento, CA",513624
Your task is to write a function that given the name of the file as a parameter, it will return the total population that lives in all these US capitals.
def totalCapitalsPopulation(filename):
"""This fuction should:
1. read the content of the given CSV file as a list of dictionaries
2. Use an accumulator variable to calculate the total population while
iterating over the csv.DictReader object.
Note: It is possible to solve this problem in at least two ways, one of which
uses list comprehension.
"""
# Your code here
totalPop = 0 # accumulator variable
with open(filename, 'r') as fileObj:
dctReader = csv.DictReader(fileObj)
for row in dctReader:
totalPop += int(row['Population'])
return totalPop
Test the function:
totalCapitalsPopulation("us-capitals.csv") # Expected value: 13584723
OPTIONAL: Alternate solution with list comprehension
Python has a useful built-in function, sum
, which can be used to get the sum of all numbers in a list. For example:
sum([1,2,3,4,5])
Write below the alternate solution with list comprehension. It needs to use the function sum
.
def totalCapitalsPopulationLC(filename):
"""An alternate solution that uses list comprehension.
It makes use of the function sum.
"""
# Your code here
with open(filename, 'r') as fileObj:
dctReader = csv.DictReader(fileObj)
totalSum = sum([int(row['Population']) for row in dctReader])
return totalSum
totalCapitalsPopulationLC("us-capitals.csv") # Expected value: 13584723
DictWriter
¶How to write into a CSV file? You already have created comma-separated files in the past. We also saw one example in this notebook:
capitals = [('Montgomery, AL', '198525'),
('Juneau, AK', '32113'),
...
]
with open("capitals-only.csv", "w") as outF:
for item in capitals:
row = f"{item[0]},{item[1]}\n" # create the line to write in the file
outF.write(row)
But, this created the issue of how to deal with values that contain commas. This is why we will now discuss csv.DictWriter
, which resolves this issue and many others.
Let's say we are given a list of dictionaries (csv.DictWriter
only works with a list of dictionaries):
oscarMovies = [{'title': 'CODA', 'year': 2022},
{'title': 'Nomadland', 'year': 2021},
{'title': 'Parasite', 'year': 2020}]
Below is a code snippet to write this list of dicts as a CSV file. It uses this structure:
csv.DictWriter
object tied to the file objectoscarMovies = [{'title': 'CODA', 'year': 2022},
{'title': 'Nomadland', 'year': 2021},
{'title': 'Parasite', 'year': 2020}]
columns = oscarMovies[0].keys() # find the names of the columns
with open('oscarWinners.csv', 'w') as outFile:
dctWriter = csv.DictWriter(outFile, fieldnames=columns)
dctWriter.writeheader() # no need for argument, it was provided above
dctWriter.writerows(oscarMovies)
Let's check the file:
more oscarWinners.csv
Swapping Columns
csv.DictWriter
has a cool feature. It allows us to easily swap the order of columns, through its parameter 'fieldnames'. Concretely:
columns2 = ['year', 'title']
with open('oscarWinners2.csv', 'w') as outFile:
dctWriter = csv.DictWriter(outFile, fieldnames=columns2) # new variable, columns2
dctWriter.writeheader()
dctWriter.writerows(oscarMovies) # list of dicts stays the same
more oscarWinners2.csv
We successfully rewrote the structure of our file with very little effort. That is the power of using Python libraries.
writeheader
, writerows
, writerow
¶In the previous example, we used two methods of the csv.DictWriter
object:
writeheader
--> writes the first row of the file, containing the column nameswriterows
--> writes all the rows, using the data from the list of dictionariesThe latter method is very handy when we already have the list of dictionaries. But sometimes we don't. Sometimes we might be creating a dictionary as we go. In this case, we can use writerow
instead. Let's see a simple example:
# writing a file with names and alteregos of famous superheroes
with open('superheroes.csv', 'w') as outFile:
dctW = csv.DictWriter(outFile,
fieldnames=['name', 'alterego'])
dctW.writerow({'name': 'Superman', 'alterego': 'Clark Kent'})
dctW.writerow({'name': 'Scarlet Witch', 'alterego': 'Wanda Maximoff'})
dctW.writerow({'name': 'Wonderwoman', 'alterego': 'Diana Prince'})
more superheroes.csv
We left out the method writeheader
intentionally, for you to see that if we don't explicitely call it, the csv.DictWriter
object will not write the names of the columns, even though it has access to them.
IMPORTANT: If you forget to use .writeheader()
and then read a CSV file with csv.DictReader
, it will use the first row as the list of column names.
with open('superheroes.csv', 'r') as inputFile:
dctR = csv.DictReader(inputFile)
rows = [dict(row) for row in dctR]
rows
As you can see, this is not what we want.
Remember this example, so that you don't forget to add the header for the file.
You are given an arbitrarely long list of names that are strings composed of a first, middle, and last name. Here is an example:
scientists = ['Maria Gaetana Agnesi',
'Marie-Anne Pierrette Paulze',
'Mary Fairfax Somerville'
'Elizabeth Garrett Anderson',
...
]
Use csv.DictWriter
to write the content of this list as a CSV file with three columns as shown below. Make use of the methods writeheader
, and writerow
within a for loop.
first,middle,last
Maria,Gaetana,Agnesi
Marie-Anne,Pierrette,Paulze
Mary,Fairfax,Somerville
Elizabeth,Garrett,Anderson
This is not straightforward, because we are not given a list of dictionaries, as in previous examples. How can we easily create dictionaries? We show some examples below.
dict
and zip
¶We saw in the lecture about dictionaries that Python has a built-in function, dict
, which can create a dictionary from a list of lists or a list of tuples. Here is an example:
fruitsDct = dict([('apple', 2.99), ('banana', 1.49), ('orange', 3.29)])
fruitsDct
What made this easy was the fact the we already had a list of tuples. What if our data is not already in this form? For example, we can have two separate lists:
fruits = ['apple', 'banana', 'orange']
prices = [2.99, 1.49, 3.29]
How do we create a list of pairs here? We can use list comprehension!
fruitPrices = [(fruits[i], prices[i]) for i in range(len(fruits))]
fruitPrices
Is there an easier way to do this? Yes, the built-in function zip
, which does what the word says, zips together two lists (or more), element-wise.
zip(fruits, prices) # creates an object for performing the zipping
list(zip(fruits, prices)) # forces zip to do the work
dict(zip(fruits, prices)) # create a dict from the zipped lists
Now that we know about how dict
and zip
work, we can solve our problem more easily.
# these are some early women scientists (18th and 19th century)
scientists = ['Maria Gaetana Agnesi',
'Marie-Anne Pierrette Paulze',
'Mary Fairfax Somerville',
'Elizabeth Garrett Anderson'
]
def writeNamesInColumns(namesList, filename):
"""For simplicity, assume all names have all three parts.
We leave it as an optional excersise for students to deal with names that
don't have a middle part.
You must use csv.DictWriter!
"""
columns = ['first', 'middle', 'last']
# Your code here
with open(filename, 'w') as outFile:
dctW = csv.DictWriter(outFile, fieldnames=columns)
dctW.writeheader()
for name in namesList:
nameparts = name.split()
dctRow = dict(zip(columns, nameparts))
dctW.writerow(dctRow)
Let's test the function:
writeNamesInColumns(scientists, 'scientists.csv')
more scientists.csv
It created the file that we intended.
Does it make sense to sort a dictionary? What happens if we try to do so?
fruitColors = {"banana": "yellow", "kiwi": "green",
"grapes": "purple", "apple": "red",
"lemon": "yellow", "pomegranate": "red"}
sorted(fruitColors)
We didn't get an error, but all that was sorted were the keys of the dictionary.
Try to predict what we will see in the following lines:
sorted(fruitColors.values())
sorted(fruitColors.items())
sorted(fruitColors.keys())
In conclusion we can say:
but not the dictionary itself. Meanwhile, we might be able to sort a list of dictionaries, though it's not straightforward.
First, we need to convert our dictionary of fruitColors
into a list of dictionaries, all with the same structure:
# Create a list of dictionaries
fruitColorLst = []
for fruit, color in fruitColors.items():
fruitColorLst.append({'fruit': fruit,
'color': color})
fruitColorLst
Can we sort this list?
sorted(fruitColorLst)
It is not working, becuase Python doesn't have rules for comparing dictionaries, the way that there are rules for comparing strings and tuples (as we have learned when we discussed sorting of sequences). That means, we would have to provide the key parameter to tell Python how to compare the items of this list of dictionaries for sorting.
def byFruit(fruitDct):
"""Helper function to be used in sorted."""
return fruitDct['fruit']
sorted(fruitColorLst, key=byFruit)
Your Turn: Do the same, but by sorting by color.
# Sort fruitColorLst by fruit color
# Your code here
def byColor(fruitDct):
"""Helper function to be used in sorted."""
return fruitDct['color']
sorted(fruitColorLst, key=byColor)
Now that we learned how to sort list of dictionaries, we will put this in practice by reading CSVs, sorting them by various columns, and writing them in CSV files.
In Section 2 of this notebook, we used the file us-states.csv, which contained information about each state, such as:
State,StatePop,Abbrev.,Capital,CapitalPop
Alabama,4921532,AL,Montgomery,198525
Alaska,731158,AK,Juneau,32113
Arizona,7421401,AZ,Phoenix,1680992
Arkansas,3030522,AR,Little Rock,197312
California,39368078,CA,Sacramento,513624
Colorado,5807719,CO,Denver,727211
These are real-world data retrieved from the Census website. In the United States, representation in Congress (House of Representatives) is decided from changes in states' population every ten years. Using the skills that you have learned in the past few lectures, you should be able to write Python code to answer questions with Census data:
You will write code to read the CSV file, sort it in different ways, and then write the results in new CSV files.
We are providing you with the code of reading a CSV file as a list of dictionaries, so that you can focus on sorting.
import csv
# Read content of CSV file as a list of dictionaries
with open("us-states.csv", 'r') as inputF:
dctReader = csv.DictReader(inputF)
stateDctList = [dict(row) for row in dctReader]
len(stateDctList)
Look up an item of this list:
stateDctList[0]
Important: All values in the dictionaries are strings, even the numbers. This is because the CSV format is fundamentally a text format. That means that if we want to use the numerical values, we need to convert them to integer or floats, as appropriate.
How to approach this problem? Here is the algorithm for it:
byStatePop
, which, given a dictionary with state data (one row from our file), returns the appropriate value. We will convert this value into an integer value.sorted
function to the list of dictionaries of state data (stateDctList
), using the key parameter that has as value byStatePop
.reverse
to change the order of sorting.def byStatePop(stateDct):
"""Helper function to be used in sorted."""
return int(stateDct['StatePop'])
# Sort the list, save it in a variable, check its three first values
sortedStates1 = sorted(stateDctList, key=byStatePop)
sortedStates1[:3]
We can see that it's not in the right order, so we need to use reverse
in the sorted
function:
sortedStates1 = sorted(stateDctList, key=byStatePop, reverse=True)
sortedStates1[:3]
We'll use below an advanced syntax for the f-string, that adds a comma to the integer representation of population values. Do not worry about it, we want you simply to be aware that f-string is very powerful.
# Let's print out the top 6 most populated countries (as shown in Slide 15)
print("Top six most populated US states:\n")
for stateDct in sortedStates1[:6]:
print(f"{stateDct['Abbrev.']} -> {int(stateDct['StatePop']):,}")
Your Turn:
Now it's your turn. Write the function writeSortedFile
, which takes two parameters:
writeSortedFile(sortedStates1,
'statesByPopulation_desc.csv')
The function doesn't return anything. Its side effect is to create the CSV file with the data.
def writeSortedFile(sortedList, # contains a list of sorted dictionaries
filename # the filename
):
"""Opens a file for writing,
finds the list of fields (the keys of each dictionary in the list)
creates a DictWriter object (don't forget the fieldnames)
writes the header,
writes the rows.
"""
# Your code here
with open(filename, 'w') as outputFile:
columns = sortedList[0].keys()
dctW = csv.DictWriter(outputFile, fieldnames=columns)
dctW.writeheader()
dctW.writerows(sortedList)
Let's call the function:
writeSortedFile(sortedStates1,
'statesByPopulation_desc.csv')
Let's see the created file:
more statesByPopulation_desc.csv
What if we want to write only two columns and not all of them?
We can pass to fieldnames
parameter a subset of the columns, but Python will complain about the dictionary to write having more keys. However, there is an additional parameter that will force Python to drop this complain:
dctW = csv.DictWriter(outputFile,
fieldnames=columns,
extrasaction='ignore' # ignore keys not in columns
)
Optional Task
Modify the function writeSortedFile
to only save some of the columns in the CSV file.
def writeSortedFile(sortedList, # contains a list of sorted dictionaries
columns, # contains a partial list of the columns
filename # the filename
):
"""Opens a file for writing,
creates a DictWriter object,
writes the header,
write the rows
"""
# Your code here
with open(filename, 'w') as outputFile:
dctW = csv.DictWriter(outputFile,
fieldnames=columns,
extrasaction='ignore'
)
dctW.writeheader()
dctW.writerows(sortedList)
writeSortedFile(sortedStates1,
['Abbrev.', 'StatePop'],
'statesByPopulation_desc.csv')
more statesByPopulation_desc.csv
This is very similar to the first part of the Section 5.2. Write code that prints out the following six states (the least populated in the US):
Top six least populated US states:
WY -> 582,328
VT -> 623,347
AK -> 731,158
ND -> 765,309
SD -> 892,717
DE -> 986,809
# Your code here
sortedStates2 = sorted(stateDctList, key=byStatePop)
print("Top six least populated US states:\n")
for stateDct in sortedStates2[:6]:
print(f"{stateDct['Abbrev.']} -> {int(stateDct['StatePop']):,}")
While the notebook makes it easy to solve problems, the solution is scattered in multiple cells. What we would like to have is a series of functions that can be put together to solve a problem once and for all, and then just use the solution with different arguments.
In this task, you will try to write a single function writeSortedColumns
, which can be used to solve both questions we encountered above (sort by the most or least populated states, and show a subset of columns). Concretely, the function should take these parameters:
The function should create a new CSV file to contain the sorted columsn. It should use as many helper functions as possible, for example, a function that creates a new file name which combines the existing file name with the words: "ascending" or "descending" (depending on the sorting direction).
# Write the helper function to change the file name
# Your code here
def createFileName(filename, reverseOrder):
"""Helper function to create a new filename.
"""
name, ext = filename.split('.')
if reverseOrder == True:
extra = "descending"
else:
extra = "ascending"
return f"{name}_{extra}.{ext}"
createFileName("us-states.csv", True)
createFileName("us-states.csv", False)
def writeSortedColumns(filename, reverseOrder, columnsList):
"""
A function that given a CSV filename will:
- read its content
- sort it
- create a new filename based on sorting order
- write the sorted rows in the new file
"""
# Your code here
# Step 1: Read the file
with open(filename, 'r') as inputFile:
dctReader = csv.DictReader(inputFile)
rowsDctList = [dict(row) for row in dctReader]
# Step 2: Sort the list of dicts
sortedRows = sorted(rowsDctList, key=byStatePop, reverse=reverseOrder)
# Step 3: create new file name
newfilename = createFileName(filename, reverseOrder)
# Step 4: write a new CSV file
with open(newfilename, 'w') as outputFile:
dctWriter = csv.DictWriter(outputFile,
fieldnames=columnsList,
extrasaction='ignore'
)
dctWriter.writeheader()
dctWriter.writerows(sortedRows)
Test the function to do the descending order (from the most populous to the least):
writeSortedColumns("us-states.csv", True, ['Abbrev.', 'StatePop'])
more us-states_descending.csv
Test the function to do the ascending order (from the most populous to the least):
# we are changing the arguments for the 2nd and 3rd parameter
writeSortedColumns("us-states.csv", False, ['State', 'StatePop'])
more us-states_ascending.csv
Additional Challenge
How would you need to modify the function writeSortedColumns
, so that instead of sorting by state population, it sorts by capital population?
This is all for today!