Table of Contents
for
loopstry .. except
myFile = open('thesis.txt', 'r') # 'r' means open the file for reading
print(myFile)
The object returned from open()
is an instance of a class named _io.TextIOWrapper
. This class specifies how to interpret a file as a stream of text.
type(myFile)
The mode 'r'
for reading the file is optional, because the most basic way for opening a file is for reading:
myFile2 = open('thesis.txt')
myFile2
Aside: The encoding 'UTF-8' that you see here is a universal standard for representing characters from all languages. It is similar to ASCII, which we mentioned a while back, you might still remember the ASCII table: http://www.asciitable.com/? However, the Unicode table is much bigger, since it uses 32 instead of 8 bits per character. Here is where to find the Unicode table: https://unicode-table.com/en/#basic-latin.
The simplest way to read a file is to call .read()
on a file object, which returns the contents of the file as a string of characters.
text = myFile.read() # read (as a string) the contents of the file myFile opened above
text
Reading the contents of an open file object mutates it. In most cases, this means that subsequent read operations on the same open file will return the empty string:
myFile.read()
If we want to read the file again, we'd need to open a new Python file object for that file. But before we do that, we should close the file; see the next section.
It's important to close a file when you are done with it, to make sure that its contents get saved (if you have written to it) and to avoid taking up operating system resources (if you are just reading from it).
There are two ways to open/close files: one is explicit, the other is implicit. We prefer the implicit approach in CS111.
f = open('thesis.txt', 'r') # open the file
text = f.read() # read the contents of the file as a string
f.close() # close the file
text # return the file contents
If you try to perform operations on a closed file, you'll get an error.
f.read()
with ... as ...
¶This automatically closes the file opened by open
upon exiting the with
statement.
with open('thesis.txt', 'r') as f:
contents = f.read()
# f is implicitly closed here
contents # return the contents
The file has been closed by with
, even though we didn't close it explicitly:
f.read()
read
, readlines
, readline
, and for
loop¶These methods read data from the file, but their behavior is different.
read
¶As shown above, this method returns a single string with the entire contents of the file. For small files, it makes it easy to access the words with only one split command. This method is not recommended for big files.
with open('cities.txt', 'r') as inputFile:
allText = inputFile.read()
allText
allText.split()
readlines
¶This method returns a list all lines in the file, where each line is a string ending in the newline character (except possibly the last line). If a list of lines is desired, this is easier than splitting the result of read
.
Note: This creates a list of the lines in the file. If the file is big, this is a big list that needs to be stored in the memory.
with open('cities.txt', 'r') as inputFile:
allLines = inputFile.readlines()
allLines
readline
¶This returns the next line in the file as a string, and it keeps the newline character. Conceptually, it also moves a cursor in the file object to the beginning of the next line, and the next call to readline
will read the line starting at that cursor. If the cursor is at the end of the file, readline
returns the empty string to indicate that there are no more lines to read.
lines = []
with open('cities.txt', 'r') as inputFile:
for _ in range(6):
lines.append(inputFile.readline())
lines
Above, note that the first four calls to readline
return the four cities, but the last two calls return the empty string because there are no more lines to read.
for
loop¶Most of the time, we will not be using any of the three methods introduced above. The file object is an iterator that, when used in a for
loop, will iterate over the lines of the file without using .readline()
or .readlines()
explicitly.
def linesFromFile(filename):
'''Returns a list of all the lines from a file with the given filename.
In each line, the terminating newline has been removed.
'''
with open(filename, 'r') as inputFile:
lines = []
for line in inputFile: # notice we're not using a method here
lines.append(line.strip()) # .strip() removes the trailing newline
return lines # file still closes even with return in `with` block
print(linesFromFile('cities.txt'))
You can think of the for
loop
for line in inputFile:
lines.append(line.strip())
as equivalent to the following while
loop:
line = inputFile.readline()
while line != '':
lines.append(line.strip())
line = inputFile.readline()
Python Jupyter notebooks allow us to use some simple operating systems (OS) commands to query the computer's filesystem. (More on this in a later lecture.)
In order to work, these commands must be used in a cell with no other Python code.
pwd
¶The pwd
(print working directory) command shows which directory (folder) in the computer we're currently connected to. Other commands will operate on this directory.
pwd
The above string actually consists of a sequence of folder names separate by the /
character. This sequence is known as a path because it describes how you navigate from the top of your file system to the particular folder that the system is connected to.
In a later lecture, we will see that folders and files on a computer are organized into file trees and that a path describes a way to move betwen nodes in the tree.
The ls
command lists the files in the current working directory.
ls
The 'ls -l' command lists the files with extra information, including their size (in bytes) and a timestamp of when they were last modified.
ls -l
cat
¶The cat
command prints out the contents of a file. They will appear as the result of the cell; you can use more
instead (see below) for larger files.
cat cities.txt
more
¶The more
command displays the contents of a file. In a Jupyter notebook, they appear in a pop-up window at the bottom of the notebook page; the pop-up can be closed by clicking on the X in its upper right corner.
more cities.txt
Note that the file name does not appear in quotes.
To open a file for writing, we use open
with the mode 'w'.
The following code will create a new file named memories.txt
in the current working directory and write in it several lines of text.
with open('memories.txt', 'w') as memfileW:
memfileW.write('get coffee\n') # need newlines
memfileW.write('do CS111 homework\n')
memfileW.write('vote!\n')
We can use ls -l
to see that a new file memories.txt
has been created:
ls -l
In your notebook, you should see the new file memories.txt
that was just created and has the timestamp to prove it.
Use the OS command more
to view the contents of the file:
cat memories.txt
Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.
Let's write formatted strings to our files. F-strings are a mechanism for generating complex strings without concatenation. Below is an example, notice the letter f before the start of the string.
g, n = ('celery', 34) # tuple assignment
print(f"I need to buy {n} pounds of {g}.\n") # notice the curly braces for the variables
The great thing about f-strings is that Python will automatically convert non-string values to strings that will fill the holes (the curly braces) in the f-string.
groceries = [('celery', 34),
('rice', 27),
('cholocate chips', 3.5),
('sorbet', 73)]
with open('memories.txt', 'w') as memfile:
for g, n in groceries:
memfile.write(f"I need to buy {n} pounds of {g}.\n")
cat memories.txt
Note that writing to an existing file erases the previous contents and replaces it by the new contents.
ls -l
Take the contents of cities.txt and print out the following:
Line 1: Wilmington
Line 2: Philadelphia
Line 3: Boston
Line 4: Charlotte
Your implementation should safely open and close cities.txt. You should avoid using .read
or .readlines()
. Remember that the lines in the file end with \n
, so you should get rid of it.
Advice: solve this using the incremental programming strategy:
# Your code here
with open('cities.txt', 'r') as citiesFile:
lineNumber = 0
for cityLine in citiesFile:
lineNumber += 1
print(f"Line {lineNumber}: {cityLine.strip()}")
How do we add lines to the end of an existing file?
We can't open the file in write mode (with a 'w'), because that erases all previous contents and starts with an empty file.
Instead, we open the file in append mode (with an 'a'). Any subsequent writes are made after the existing contents.
with open('memories.txt', 'a') as memfileA:
memfileA.write('win Nobel prize\n')
memfileA.write('eat big sundae\n')
Open the file memories.txt
again, using the OS command more:
cat memories.txt
Suppose we misspelled a file name...
memories = linesFromFile("memory.txt")
An error like this will terminate the execution of a program, which we'd like to avoid.
Can we somehow handle the error programmatically, within our program?
Yes! There are two approaches.
if ... else
¶One way of avoiding the error is by using if ... else
in conjunction with the os.path.exists
function from the Python os
library. This function indicates whether a file or subdirectory exists in the current working directory.
import os
def getLines(filename):
"""If filename names an existing file, return its lines.
Otherwise return the empty list."""
if os.path.exists(filename):
return linesFromFile(filename)
else:
return []
getLines('memories.txt')
getLines('memory.txt') # function succeeds with empty list rather than terminating with error.
def getLines(filename):
"""If filename names an existing file, return its lines.
Otherwise return the empty list."""
try:
return linesFromFile(filename)
except IOError:
return []
getLines('memories.txt')
getLines('memory.txt')
Have you tried to divide by 0?
10/0
If we know the error name, we can use it in the except
clause:
def print_100_divided_by(n):
try:
print(100/n)
except ZeroDivisionError:
print('Do not divide by zero.')
print_100_divided_by(4)
print_100_divided_by(0)
while True:
try:
# Commented out since it may require restarting kernel later...
i = int(input('Please enter an integer: '))
#i = 5 # Can use instead of raw_input if want to terminate immediately
print('Good, you entered', i)
break # Python keyword to exit a loop
except ValueError:
print('Not a valid integer. Try again...')
The file studentCities.txt contains a fictional list of US cities from where students hail. Write a function called cityCount
that takes a city as a string and returns the number of students who come from that city. You should successfully open and close the file studentCities.txt. Take a look at the contents of studentCities.txt.
cat studentCities.txt
Define your function below:
# Your code here
def cityCount(targetCity):
with open("studentCities.txt", "r") as file:
count = 0
for city in file:
city = city.strip()
if city == targetCity:
count += 1
return count
cityCount("Philadelphia") # should return 4
cityCount("Boston") # should return 3
cityCount("Chicago") # should return 0
nums.txt
is a file that contains a sequence of numbers (both integers and floating point), one per line:
cat nums.txt
Write a zero-argument function averageNums
that takes the numbers in nums.txt
and returns the average of those numbers.
# Your code here
def averageNums():
with open('nums.txt', 'r') as file:
summation = 0
numLines = 0
for numRow in file:
num = float(numRow.strip())
summation += num
numLines += 1
return summation/numLines
averageNums() # should return 6.9316666...
Write a zero-argument function intSum
that sums only the integers in nums.txt
. One possible approach is to use try
/except
to sum all the rows by using the int
function. The int
function will throw a ValueError
1 if the string is a float.
# Your code here
def intSum():
with open('nums.txt', 'r') as file:
summation = 0
for numRow in file:
try:
num = int(numRow.strip())
summation += num
except ValueError:
pass
return summation
intSum() # should be 19
A CSV file is a file that stores data in rows of comma-separated values, displaying a 2D table of information. This is a format that is read/exported by Excel and Google Sheets. We will work more with CSV files in the coming weeks. For now, take a look at the files literature-nobel-prize.csv
and peace-nobel-prize.csv
. These files contain real-world data provided by the Nobel Prize Committee.
YOUR TASK: You will read the content of these files, count the number of countries that have received each prize and write the results into new files.
The two files have a similar structure. The first row contains the names of the columns:
For the Literature prize:
Year,Name,,Gender,Citizenship,Second Citizenship,Born,Remarks
For the Peace prize:
Year,Name,,Gender,"Citizenship or Headquarters Location",Second Citizenship,"Born/Established",Remarks,Affiliation
Open the files below to see their content.
more literature-nobel-prize.csv
more peace-nobel-prize.csv
There are many interesting things that we can do with this data. Today, we'll focus on finding the countries with the most prizes. We are providing the solutions for you to get a sense of what you'll produce.
more literature-nobel-solution.txt
more peace-nobel-solution.txt
Now that we know what we want to generate, let's make a plan for our solution. We will use the big ideas of modularity and abstraction:
Let's do some preliminary work. The column we will take into account is the fifth column: Citizenship. There is another column, "Second Citizenship", but that has few data points and we will ignore it.
Here is a typical row from the Literature file:
2018,Olga,Tokarczuk,Female,Poland,,1962,Awarded in 2019
Let's see what happens when we use the method split on this string value:
row = "2018,Olga,Tokarczuk,Female,Poland,,1962,Awarded in 2019"
row.split(',')
Notice that a list was created and the country of citizenship is in the fifth position (or at index 4).
Here is how we will approach this problem:
getCountryNames
returns a list of unique country names.countCountryOccurrences
returns an integer value.writeCountryCounts
does not return a value. Instead its side effect is the creation of a new file. We will provide you with a helper function that given an original file name, returns a new file name, see createFileName
below. writeCountryConts
needs to call this function.getCountryNames
, countCountryOccurrences
, and writeCountryCounts
in order to fullfill our initial goal. This function, titled, orderCountriesWithMostNobels
, takes a single parameter, the file name, and then calls in turn all the three functions above, to generate a new file with names of countries sorted by their number of prizes. It will require a helper function to help the sorting.def createFileName(filename):
"""Given a filename like 'literature-nobel-prize.csv', create a new
filename in the format: 'literature-country-counts.txt'.
"""
parts = filename.split('-') # split at the dash character
name = parts[0] # the first part is what we want
newFileName = f"{name}-country-counts.txt" # create the new file name
return newFileName
createFileName("peace-nobel-prize.csv")
Step 1: Write getCountryNames
def getCountryNames(filename):
"""This function takes one parameter, a file name. It does the following:
1. It opens the file for reading.
2. It reads the first line, but doesn't do anything with it.
3. With a for loops it reads all the other rows one by one
chekcing if it has encountered a country name before. If not, it keeps track of it.
4. It returns a list of unique country names.
This is an accumulation problem.
"""
# Your code here
countriesList = []
with open(filename, "r") as inputFile:
_ = inputFile.readline() # read first line that contains column names
for line in inputFile:
country = line.split(',')[4]
if country not in countriesList:
countriesList.append(country)
return countriesList
Test the function to find out how many countries were read. The expected results are:
Literature prizes: 41 countries.
Peace prizes: 46 countries.
print("Literature prizes:", len(getCountryNames("literature-nobel-prize.csv")), "countries.")
print("Peace prizes:", len(getCountryNames("peace-nobel-prize.csv")), "countries.")
The next function we will create will read the file and count the occurrences of a given country name.
Step 2: Write countCountryOccurrences
def countCountryOccurrences(filename, countryName):
"""
This function takes two parameters: a file name and a country name.
It returns an integer that is the count of occurrences of the country in the file.
This is also an accumulation problem (but with an integer).
It shares a part of the solution with getCountryNames.
"""
# Your code here
countryOcc = 0 # accumulator variable
with open(filename, "r") as inputFile:
_ = inputFile.readline() # read first line that contains column names
for line in inputFile:
country = line.split(',')[4]
if country == countryName:
countryOcc += 1
return countryOcc
Let's test the function with a few countries and each file of prizes:
countryNames = ['United States', 'France', 'Italy']
for countryName in countryNames:
print(f"Literature Nobel Prize for {countryName}:",
countCountryOccurrences("literature-nobel-prize.csv", countryName))
countryNames = ['United States', 'France', 'Italy']
for countryName in countryNames:
print(f"Peace Nobel Prize for {countryName}:",
countCountryOccurrences("peace-nobel-prize.csv", countryName))
Step 3: Write the function writeCountryCounts
Our next step is to write a function that given a list of tuples, such as:
[('United States', 27), ('France', 9), ('Italy', 2)]
writes them into a file. The new file gets its name from the original file, using the helper function createFileName
.
def writeCountryCounts(countsList, filename0):
"""Function takes two parameters a list of tuples and a filename.
1. It first creates a new file name to save the output.
2. It open a file with the created name for writing.
3. Writes a first line with the column names.
4. Iterates through the list of tuples and writes all country names and counts.
5. Prints a message to say that it was done creating the new file.
"""
# Your code here
# Create new name for the output file
filename1 = createFileName(filename0)
# Open file for writing; writes a header and all rows
with open(filename1, 'w') as outputFile:
outputFile.write("Country,Total Wins\n")
for country, count in countsList:
line = f"{country},{count}\n"
outputFile.write(line)
# displays a printed message
print(f"Created new file {filename1}")
someCounts = [('United States', 27), ('France', 9), ('Italy', 2)]
writeCountryCounts(someCounts, 'peace-nobel-prize.csv')
more peace-country-counts.txt
Step 4: Write the function orderCountriesWithMostNobels
Now it's the time to put together all functions we wrote above.
# Create a helper function for sorting by the number of occurrences of a country
# Your code here
def byOccurrence(countryOcc):
"""Helper function to serve as key for sorting."""
return countryOcc[1]
def orderCountriesWithMostNobels(filename):
"""
This function takes a single parameter, the file name of a CSV with Nobel Prize data.
It does the following:
1. Calls the function getCountryNames
2. Generates a list of tuples of (country, occurrences), making use of countCountryOccurrences
3. Sorts the list in reverse order (country with most prizes is first)
4. Writes the list of tuples into a new file, calling the funciton writeCountryCounts.
"""
# Your code here
# 1. Find the list of countries
countriesList = getCountryNames(filename)
# 2. Count how often does each country occur; save results into a list
countryOccList = []
for country in countriesList:
occ = countCountryOccurrences(filename, country)
countryOccList.append((country, occ))
# 3. Sort by occurrence
orderedCountryOccList = sorted(countryOccList, key=byOccurrence, reverse=True)
# 4. Save results into the new file
writeCountryCounts(orderedCountryOccList, filename)
Let's test our function:
orderCountriesWithMostNobels("literature-nobel-prize.csv")
more literature-country-counts.txt
orderCountriesWithMostNobels("peace-nobel-prize.csv")
more peace-country-counts.txt
This was a long solution, because you haven't encountered yet some Python constructs that will greatly simplify these solutions, for example, dictionaries that keep track of frequencies, and libraries to work with CSV files. We will be covering these in the coming weeks and return to this problem for a much shorter solution.
That's all for today!