1. Opening a file

A file is an object that is created by the built-in function open.

In [1]:
myFile = open('thesis.txt', 'r') # 'r' means open the file for reading
print(myFile)
<_io.TextIOWrapper name='thesis.txt' mode='r' encoding='UTF-8'>

The object returned from open() is an instance of a class named _io.TextIOWrapper. This class specifies how to interpret a file as a stream of text.

In [2]:
type(myFile)
Out[2]:
_io.TextIOWrapper

The mode 'r' for reading the file is optional, because the most basic way for opening a file is for reading:

In [3]:
myFile2 = open('thesis.txt')
myFile2
Out[3]:
<_io.TextIOWrapper name='thesis.txt' mode='r' encoding='UTF-8'>

Aside: The encoding 'UTF-8' that you see here is a universal standard for representing characters from all languages. It is similar to ASCII, which we mentioned a while back, you might still remember the ASCII table: http://www.asciitable.com/? However, the Unicode table is much bigger, since it uses 32 instead of 8 bits per character. Here is where to find the Unicode table: https://unicode-table.com/en/#basic-latin.

2. Reading from a file: first cut

The simplest way to read a file is to call .read() on a file object, which returns the contents of the file as a string of characters.

In [4]:
text = myFile.read() # read (as a string) the contents of the file myFile opened above
text
Out[4]:
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis vel fringilla eros. Curabitur vel mollis odio. In vulputate ex et vulputate dignissim. Pellentesque vel leo et ex malesuada facilisis ac sit amet elit. Duis metus arcu, tincidunt vel suscipit quis, venenatis at ex. Duis id lacus sit amet felis tristique hendrerit vel nec nulla. Morbi odio elit, consectetur a rhoncus a, vestibulum at tortor. Suspendisse leo felis, molestie ac justo bibendum, molestie sagittis tortor. Etiam venenatis, leo eget imperdiet pharetra, nulla risus finibus arcu, et hendrerit sem enim eget tortor.\n\nVestibulum commodo euismod enim, eget lacinia eros vehicula id. Phasellus tempor odio in commodo sagittis. Aliquam non condimentum orci. Curabitur gravida blandit nulla id suscipit. Aenean lacus nisl, convallis at felis et, interdum pellentesque eros. Nunc tincidunt, purus at rutrum elementum, ipsum nulla pretium nibh, id placerat odio libero quis libero. In pulvinar tortor cursus orci scelerisque aliquet. Etiam neque purus, posuere ac interdum a, euismod eu tortor. Vestibulum in arcu odio.\n'

Reading the contents of an open file object mutates it. In most cases, this means that subsequent read operations on the same open file will return the empty string:

In [5]:
myFile.read()
Out[5]:
''

If we want to read the file again, we'd need to open a new Python file object for that file. But before we do that, we should close the file; see the next section.

3. Closing Files

It's important to close a file when you are done with it, to make sure that its contents get saved (if you have written to it) and to avoid taking up operating system resources (if you are just reading from it).

There are two ways to open/close files: one is explicit, the other is implicit. We prefer the implicit approach in CS111.

3.1 Open and Close explicitly

In [6]:
f = open('thesis.txt', 'r')  # open the file
text = f.read()              # read the contents of the file as a string
f.close()                    # close the file
text # return the file contents
Out[6]:
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis vel fringilla eros. Curabitur vel mollis odio. In vulputate ex et vulputate dignissim. Pellentesque vel leo et ex malesuada facilisis ac sit amet elit. Duis metus arcu, tincidunt vel suscipit quis, venenatis at ex. Duis id lacus sit amet felis tristique hendrerit vel nec nulla. Morbi odio elit, consectetur a rhoncus a, vestibulum at tortor. Suspendisse leo felis, molestie ac justo bibendum, molestie sagittis tortor. Etiam venenatis, leo eget imperdiet pharetra, nulla risus finibus arcu, et hendrerit sem enim eget tortor.\n\nVestibulum commodo euismod enim, eget lacinia eros vehicula id. Phasellus tempor odio in commodo sagittis. Aliquam non condimentum orci. Curabitur gravida blandit nulla id suscipit. Aenean lacus nisl, convallis at felis et, interdum pellentesque eros. Nunc tincidunt, purus at rutrum elementum, ipsum nulla pretium nibh, id placerat odio libero quis libero. In pulvinar tortor cursus orci scelerisque aliquet. Etiam neque purus, posuere ac interdum a, euismod eu tortor. Vestibulum in arcu odio.\n'

If you try to perform operations on a closed file, you'll get an error.

In [7]:
f.read()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 f.read()

ValueError: I/O operation on closed file.

3.2 Automatic closing with the syntax: with ... as ...

This automatically closes the file opened by open upon exiting the with statement.

In [8]:
with open('thesis.txt', 'r') as f:
    contents = f.read()
# f is implicitly closed here
contents # return the contents
Out[8]:
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis vel fringilla eros. Curabitur vel mollis odio. In vulputate ex et vulputate dignissim. Pellentesque vel leo et ex malesuada facilisis ac sit amet elit. Duis metus arcu, tincidunt vel suscipit quis, venenatis at ex. Duis id lacus sit amet felis tristique hendrerit vel nec nulla. Morbi odio elit, consectetur a rhoncus a, vestibulum at tortor. Suspendisse leo felis, molestie ac justo bibendum, molestie sagittis tortor. Etiam venenatis, leo eget imperdiet pharetra, nulla risus finibus arcu, et hendrerit sem enim eget tortor.\n\nVestibulum commodo euismod enim, eget lacinia eros vehicula id. Phasellus tempor odio in commodo sagittis. Aliquam non condimentum orci. Curabitur gravida blandit nulla id suscipit. Aenean lacus nisl, convallis at felis et, interdum pellentesque eros. Nunc tincidunt, purus at rutrum elementum, ipsum nulla pretium nibh, id placerat odio libero quis libero. In pulvinar tortor cursus orci scelerisque aliquet. Etiam neque purus, posuere ac interdum a, euismod eu tortor. Vestibulum in arcu odio.\n'

The file has been closed by with, even though we didn't close it explicitly:

In [9]:
f.read()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 f.read()

ValueError: I/O operation on closed file.

4. Four ways to read files: read, readlines, readline, and for loop

These methods read data from the file, but their behavior is different.

read

As shown above, this method returns a single string with the entire contents of the file. For small files, it makes it easy to access the words with only one split command. This method is not recommended for big files.

In [10]:
with open('cities.txt', 'r') as inputFile:
    allText = inputFile.read()
    
allText
Out[10]:
'Wilmington\nPhiladelphia\nBoston\nCharlotte\n'
In [11]:
allText.split()
Out[11]:
['Wilmington', 'Philadelphia', 'Boston', 'Charlotte']

readlines

This method returns a list all lines in the file, where each line is a string ending in the newline character (except possibly the last line). If a list of lines is desired, this is easier than splitting the result of read. Note: This creates a list of the lines in the file. If the file is big, this is a big list that needs to be stored in the memory.

In [12]:
with open('cities.txt', 'r') as inputFile:
    allLines = inputFile.readlines()
    
allLines
Out[12]:
['Wilmington\n', 'Philadelphia\n', 'Boston\n', 'Charlotte\n']

readline

This returns the next line in the file as a string, and it keeps the newline character. Conceptually, it also moves a cursor in the file object to the beginning of the next line, and the next call to readline will read the line starting at that cursor. If the cursor is at the end of the file, readline returns the empty string to indicate that there are no more lines to read.

In [13]:
lines = []
with open('cities.txt', 'r') as inputFile:    
    for _ in range(6):
        lines.append(inputFile.readline())

lines
Out[13]:
['Wilmington\n', 'Philadelphia\n', 'Boston\n', 'Charlotte\n', '', '']

Above, note that the first four calls to readline return the four cities, but the last two calls return the empty string because there are no more lines to read.

Preferred reading method: for loop

Most of the time, we will not be using any of the three methods introduced above. The file object is an iterator that, when used in a for loop, will iterate over the lines of the file without using .readline() or .readlines() explicitly.

In [14]:
def linesFromFile(filename):
    '''Returns a list of all the lines from a file with the given filename. 
    In each line, the terminating newline has been removed.
    '''    
    with open(filename, 'r') as inputFile:
        lines = []
        for line in inputFile: # notice we're not using a method here
            lines.append(line.strip()) # .strip() removes the trailing newline
        return lines # file still closes even with return in `with` block

print(linesFromFile('cities.txt'))
['Wilmington', 'Philadelphia', 'Boston', 'Charlotte']

You can think of the for loop

for line in inputFile: 
  lines.append(line.strip())

as equivalent to the following while loop:

line = inputFile.readline()
while line != '':
    lines.append(line.strip())
    line = inputFile.readline()

5. Some simple OS Commands

Python Jupyter notebooks allow us to use some simple operating systems (OS) commands to query the computer's filesystem. (More on this in a later lecture.)

In order to work, these commands must be used in a cell with no other Python code.

pwd

The pwd (print working directory) command shows which directory (folder) in the computer we're currently connected to. Other commands will operate on this directory.

In [15]:
pwd
Out[15]:
'/Users/fturbak/lyn/cs111-s23-Airbook/cs111/lec15_files_solns'

The above string actually consists of a sequence of folder names separate by the / character. This sequence is known as a path because it describes how you navigate from the top of your file system to the particular folder that the system is connected to.

In a later lecture, we will see that folders and files on a computer are organized into file trees and that a path describes a way to move betwen nodes in the tree.

ls

The ls command lists the files in the current working directory.

In [16]:
ls
cities.txt                     nums.txt
launchLocalNotebook.py         peace-country-counts.txt
lec15_files_solns.html         peace-nobel-prize.csv
lec15_files_solns.ipynb        peace-nobel-solution.txt
literature-country-counts.txt  studentCities.txt
literature-nobel-prize.csv     test.txt
literature-nobel-solution.txt  thesis.txt
memories.txt

The 'ls -l' command lists the files with extra information, including their size (in bytes) and a timestamp of when they were last modified.

In [17]:
ls -l
total 1016
-rw-r--r--  1 fturbak  staff      41 Jan 18 12:30 cities.txt
-rw-r--r--  1 fturbak  staff   10254 Jan 30 10:47 launchLocalNotebook.py
-rw-r--r--  1 fturbak  staff  387933 Jan 18 12:30 lec15_files_solns.html
-rw-r--r--  1 fturbak  staff   65507 Mar 20 22:06 lec15_files_solns.ipynb
-rw-r--r--  1 fturbak  staff     456 Mar 20 21:52 literature-country-counts.txt
-rw-r--r--  1 fturbak  staff    5145 Jan 18 12:30 literature-nobel-prize.csv
-rw-r--r--@ 1 fturbak  staff     456 Jan 18 12:30 literature-nobel-solution.txt
-rw-r--r--  1 fturbak  staff     179 Mar 20 21:51 memories.txt
-rw-r--r--  1 fturbak  staff      45 Jan 18 12:30 nums.txt
-rw-r--r--  1 fturbak  staff     554 Mar 20 21:52 peace-country-counts.txt
-rw-r--r--  1 fturbak  staff    7373 Jan 18 12:30 peace-nobel-prize.csv
-rw-r--r--  1 fturbak  staff     554 Jan 18 12:30 peace-nobel-solution.txt
-rw-r--r--@ 1 fturbak  staff     171 Jan 18 12:30 studentCities.txt
-rw-r--r--  1 fturbak  staff       0 Jan 18 12:30 test.txt
-rw-r--r--@ 1 fturbak  staff    1086 Jan 18 12:30 thesis.txt

cat

The cat command prints out the contents of a file. They will appear as the result of the cell; you can use more instead (see below) for larger files.

In [18]:
cat cities.txt
Wilmington
Philadelphia
Boston
Charlotte

more

The more command displays the contents of a file. In a Jupyter notebook, they appear in a pop-up window at the bottom of the notebook page; the pop-up can be closed by clicking on the X in its upper right corner.

In [19]:
more cities.txt

Note that the file name does not appear in quotes.

6. Writing Files

To open a file for writing, we use open with the mode 'w'.

The following code will create a new file named memories.txt in the current working directory and write in it several lines of text.

In [20]:
with open('memories.txt', 'w') as memfileW:
    memfileW.write('get coffee\n') # need newlines
    memfileW.write('do CS111 homework\n')
    memfileW.write('vote!\n')

We can use ls -l to see that a new file memories.txt has been created:

In [21]:
ls -l
total 1016
-rw-r--r--  1 fturbak  staff      41 Jan 18 12:30 cities.txt
-rw-r--r--  1 fturbak  staff   10254 Jan 30 10:47 launchLocalNotebook.py
-rw-r--r--  1 fturbak  staff  387933 Jan 18 12:30 lec15_files_solns.html
-rw-r--r--  1 fturbak  staff   65507 Mar 20 22:06 lec15_files_solns.ipynb
-rw-r--r--  1 fturbak  staff     456 Mar 20 21:52 literature-country-counts.txt
-rw-r--r--  1 fturbak  staff    5145 Jan 18 12:30 literature-nobel-prize.csv
-rw-r--r--@ 1 fturbak  staff     456 Jan 18 12:30 literature-nobel-solution.txt
-rw-r--r--  1 fturbak  staff      35 Mar 20 22:06 memories.txt
-rw-r--r--  1 fturbak  staff      45 Jan 18 12:30 nums.txt
-rw-r--r--  1 fturbak  staff     554 Mar 20 21:52 peace-country-counts.txt
-rw-r--r--  1 fturbak  staff    7373 Jan 18 12:30 peace-nobel-prize.csv
-rw-r--r--  1 fturbak  staff     554 Jan 18 12:30 peace-nobel-solution.txt
-rw-r--r--@ 1 fturbak  staff     171 Jan 18 12:30 studentCities.txt
-rw-r--r--  1 fturbak  staff       0 Jan 18 12:30 test.txt
-rw-r--r--@ 1 fturbak  staff    1086 Jan 18 12:30 thesis.txt

In your notebook, you should see the new file memories.txt that was just created and has the timestamp to prove it.

Use the OS command more to view the contents of the file:

In [22]:
cat memories.txt
get coffee
do CS111 homework
vote!

Alternatively, go to Finder (on a Mac) or Windows Explorer (PC) to view the contents of the file.

7. Using f-strings when reading/writing files

Let's write formatted strings to our files. F-strings are a mechanism for generating complex strings without concatenation. Below is an example, notice the letter f before the start of the string.

In [23]:
g, n = ('celery', 34) # tuple assignment
print(f"I need to buy {n} pounds of {g}.\n") # notice the curly braces for the variables
I need to buy 34 pounds of celery.

The great thing about f-strings is that Python will automatically convert non-string values to strings that will fill the holes (the curly braces) in the f-string.

In [24]:
groceries = [('celery', 34), 
             ('rice', 27), 
             ('cholocate chips', 3.5), 
             ('sorbet', 73)]

with open('memories.txt', 'w') as memfile:
    for g, n in groceries:
        memfile.write(f"I need to buy {n} pounds of {g}.\n")
In [25]:
cat memories.txt
I need to buy 34 pounds of celery.
I need to buy 27 pounds of rice.
I need to buy 3.5 pounds of cholocate chips.
I need to buy 73 pounds of sorbet.

Note that writing to an existing file erases the previous contents and replaces it by the new contents.

In [26]:
ls -l
total 1016
-rw-r--r--  1 fturbak  staff      41 Jan 18 12:30 cities.txt
-rw-r--r--  1 fturbak  staff   10254 Jan 30 10:47 launchLocalNotebook.py
-rw-r--r--  1 fturbak  staff  387933 Jan 18 12:30 lec15_files_solns.html
-rw-r--r--  1 fturbak  staff   65507 Mar 20 22:06 lec15_files_solns.ipynb
-rw-r--r--  1 fturbak  staff     456 Mar 20 21:52 literature-country-counts.txt
-rw-r--r--  1 fturbak  staff    5145 Jan 18 12:30 literature-nobel-prize.csv
-rw-r--r--@ 1 fturbak  staff     456 Jan 18 12:30 literature-nobel-solution.txt
-rw-r--r--  1 fturbak  staff     148 Mar 20 22:07 memories.txt
-rw-r--r--  1 fturbak  staff      45 Jan 18 12:30 nums.txt
-rw-r--r--  1 fturbak  staff     554 Mar 20 21:52 peace-country-counts.txt
-rw-r--r--  1 fturbak  staff    7373 Jan 18 12:30 peace-nobel-prize.csv
-rw-r--r--  1 fturbak  staff     554 Jan 18 12:30 peace-nobel-solution.txt
-rw-r--r--@ 1 fturbak  staff     171 Jan 18 12:30 studentCities.txt
-rw-r--r--  1 fturbak  staff       0 Jan 18 12:30 test.txt
-rw-r--r--@ 1 fturbak  staff    1086 Jan 18 12:30 thesis.txt

8. Exercise 1: Print formatted strings from cities.txt

Take the contents of cities.txt and print out the following:

Line 1: Wilmington
Line 2: Philadelphia
Line 3: Boston
Line 4: Charlotte

Your implementation should safely open and close cities.txt. You should avoid using .read or .readlines(). Remember that the lines in the file end with \n, so you should get rid of it.

Advice: solve this using the incremental programming strategy:

  1. First, print out each city withough worrying about the line number
  2. Only then should you add the line numbering aspect.
In [27]:
# Your code here
with open('cities.txt', 'r') as citiesFile:
    lineNumber = 0
    for cityLine in citiesFile:
        lineNumber += 1
        print(f"Line {lineNumber}: {cityLine.strip()}")
Line 1: Wilmington
Line 2: Philadelphia
Line 3: Boston
Line 4: Charlotte

9. Appending to files

How do we add lines to the end of an existing file?

We can't open the file in write mode (with a 'w'), because that erases all previous contents and starts with an empty file.

Instead, we open the file in append mode (with an 'a'). Any subsequent writes are made after the existing contents.

In [28]:
with open('memories.txt', 'a') as memfileA:
    memfileA.write('win Nobel prize\n')
    memfileA.write('eat big sundae\n')

Open the file memories.txt again, using the OS command more:

In [29]:
cat memories.txt
I need to buy 34 pounds of celery.
I need to buy 27 pounds of rice.
I need to buy 3.5 pounds of cholocate chips.
I need to buy 73 pounds of sorbet.
win Nobel prize
eat big sundae

10. Exceptions

Suppose we misspelled a file name...

In [30]:
memories = linesFromFile("memory.txt")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[30], line 1
----> 1 memories = linesFromFile("memory.txt")

Cell In[14], line 5, in linesFromFile(filename)
      1 def linesFromFile(filename):
      2     '''Returns a list of all the lines from a file with the given filename. 
      3     In each line, the terminating newline has been removed.
      4     '''    
----> 5     with open(filename, 'r') as inputFile:
      6         lines = []
      7         for line in inputFile: # notice we're not using a method here

File ~/Library/Python/3.10/lib/python/site-packages/IPython/core/interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
    275 if file in {0, 1, 2}:
    276     raise ValueError(
    277         f"IPython won't let you open fd={file} by default "
    278         "as it is likely to crash IPython. If you know what you are doing, "
    279         "you can use builtins' open."
    280     )
--> 282 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'memory.txt'

An error like this will terminate the execution of a program, which we'd like to avoid.

Can we somehow handle the error programmatically, within our program?

Yes! There are two approaches.

10.1 Avoid the error with if ... else

One way of avoiding the error is by using if ... else in conjunction with the os.path.exists function from the Python os library. This function indicates whether a file or subdirectory exists in the current working directory.

In [31]:
import os

def getLines(filename):
    """If filename names an existing file, return its lines. 
    Otherwise return the empty list."""
   
    if os.path.exists(filename):
        return linesFromFile(filename)
    else:
        return []
In [32]:
getLines('memories.txt')
Out[32]:
['I need to buy 34 pounds of celery.',
 'I need to buy 27 pounds of rice.',
 'I need to buy 3.5 pounds of cholocate chips.',
 'I need to buy 73 pounds of sorbet.',
 'win Nobel prize',
 'eat big sundae']
In [33]:
getLines('memory.txt') # function succeeds with empty list rather than terminating with error.
Out[33]:
[]

10.2 Exception Handling with try ... except

try: statements1 except exceptionType : statements2 first executes statements1. If statements1 executes without error, statements2 are not executed. But if executing statements1 encounters an error, statements2 will be executed.

Dealing with IOError

In [34]:
def getLines(filename):
    """If filename names an existing file, return its lines. 
    Otherwise return the empty list."""
    
    try: 
        return linesFromFile(filename)
    except IOError:
        return []
In [35]:
getLines('memories.txt')
Out[35]:
['I need to buy 34 pounds of celery.',
 'I need to buy 27 pounds of rice.',
 'I need to buy 3.5 pounds of cholocate chips.',
 'I need to buy 73 pounds of sorbet.',
 'win Nobel prize',
 'eat big sundae']
In [36]:
getLines('memory.txt')
Out[36]:
[]

Exception handling example 1: Dealing with ZeroDivision

Have you tried to divide by 0?

In [37]:
10/0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[37], line 1
----> 1 10/0

ZeroDivisionError: division by zero

If we know the error name, we can use it in the except clause:

In [38]:
def print_100_divided_by(n):
    try:
        print(100/n)
    except ZeroDivisionError:
        print('Do not divide by zero.')
In [39]:
print_100_divided_by(4)
25.0
In [40]:
print_100_divided_by(0)
Do not divide by zero.

Exception handling example 2: Dealing with value errors

In [41]:
while True:
    try:
        # Commented out since it may require restarting kernel later...
        i = int(input('Please enter an integer: '))
        #i = 5 # Can use instead of raw_input if want to terminate immediately
        print('Good, you entered', i)
        break # Python keyword to exit a loop
    except ValueError:
        print('Not a valid integer. Try again...')
Please enter an integer: hello
Not a valid integer. Try again...
Please enter an integer: 12.34
Not a valid integer. Try again...
Please enter an integer: 56
Good, you entered 56

11. Exercise 2: City Count

The file studentCities.txt contains a fictional list of US cities from where students hail. Write a function called cityCount that takes a city as a string and returns the number of students who come from that city. You should successfully open and close the file studentCities.txt. Take a look at the contents of studentCities.txt.

In [42]:
cat studentCities.txt
Philadelphia
Wilmington
Boston
Philadelphia
Boston
New Haven
Charleston
Charleston
Philadelphia
Sacramento
Reno
Boise
Sacramento
Los Angeles
Philadelphia
Boston
New Haven

Define your function below:

In [43]:
# Your code here
def cityCount(targetCity):
    with open("studentCities.txt", "r") as file:
        count = 0
        for city in file:
            city = city.strip()
            if city == targetCity:
                count += 1
        return count
In [44]:
cityCount("Philadelphia") # should return 4
Out[44]:
4
In [45]:
cityCount("Boston") # should return 3
Out[45]:
3
In [46]:
cityCount("Chicago") # should return 0
Out[46]:
0

12. Exercise 3: Number List Average

nums.txt is a file that contains a sequence of numbers (both integers and floating point), one per line:

In [47]:
cat nums.txt
3.14
2
4.5
8.91
0.2
4
5
8
9.43
2.1
23.2
12.7

Write a zero-argument function averageNums that takes the numbers in nums.txt and returns the average of those numbers.

In [48]:
# Your code here
def averageNums():
    with open('nums.txt', 'r') as file:
        summation = 0
        numLines = 0
        for numRow in file:
            num = float(numRow.strip())
            summation += num
            numLines += 1
        return summation/numLines
In [49]:
averageNums() # should return 6.9316666...
Out[49]:
6.9316666666666675

Bonus: Sum Integers

Write a zero-argument function intSum that sums only the integers in nums.txt. One possible approach is to use try/except to sum all the rows by using the int function. The int function will throw a ValueError1 if the string is a float.

In [50]:
# Your code here
def intSum():
    with open('nums.txt', 'r') as file:
        summation = 0
        for numRow in file:
            try: 
                num = int(numRow.strip())
                summation += num
            except ValueError:
                pass
        return summation
In [51]:
intSum() # should be 19
Out[51]:
19

13. Challenge Problem: reading and writing real-world data

A CSV file is a file that stores data in rows of comma-separated values, displaying a 2D table of information. This is a format that is read/exported by Excel and Google Sheets. We will work more with CSV files in the coming weeks. For now, take a look at the files literature-nobel-prize.csv and peace-nobel-prize.csv. These files contain real-world data provided by the Nobel Prize Committee.

YOUR TASK: You will read the content of these files, count the number of countries that have received each prize and write the results into new files.

The two files have a similar structure. The first row contains the names of the columns:

For the Literature prize:
Year,Name,,Gender,Citizenship,Second Citizenship,Born,Remarks

For the Peace prize:
Year,Name,,Gender,"Citizenship or Headquarters Location",Second Citizenship,"Born/Established",Remarks,Affiliation

Open the files below to see their content.

In [52]:
more literature-nobel-prize.csv
In [53]:
more peace-nobel-prize.csv

There are many interesting things that we can do with this data. Today, we'll focus on finding the countries with the most prizes. We are providing the solutions for you to get a sense of what you'll produce.

In [54]:
more literature-nobel-solution.txt
In [55]:
more peace-nobel-solution.txt

Now that we know what we want to generate, let's make a plan for our solution. We will use the big ideas of modularity and abstraction:

  • We will write several small functions, but each of them will be capable of working with each file.
  • We will combine the small functions together to build our solution.

Let's do some preliminary work. The column we will take into account is the fifth column: Citizenship. There is another column, "Second Citizenship", but that has few data points and we will ignore it.

Here is a typical row from the Literature file:

2018,Olga,Tokarczuk,Female,Poland,,1962,Awarded in 2019

Let's see what happens when we use the method split on this string value:

In [56]:
row = "2018,Olga,Tokarczuk,Female,Poland,,1962,Awarded in 2019"
row.split(',')
Out[56]:
['2018',
 'Olga',
 'Tokarczuk',
 'Female',
 'Poland',
 '',
 '1962',
 'Awarded in 2019']

Notice that a list was created and the country of citizenship is in the fifth position (or at index 4).

Here is how we will approach this problem:

  1. We will write a function that given a file name, reads its content line by line and keeps track of the country names that it has not encountered yet. This function, titled, getCountryNames returns a list of unique country names.
  2. We will write a function that given a filename and a country name, returns how often that country appears in the file. This function, countCountryOccurrences returns an integer value.
  3. We will write a function that given a list of tuples [(Country 1, count 1), (Country 2, count 2), ...], writes these values into a new file. The name of the new file is also a parameter to the function. This function, writeCountryCounts does not return a value. Instead its side effect is the creation of a new file. We will provide you with a helper function that given an original file name, returns a new file name, see createFileName below. writeCountryConts needs to call this function.
  4. We will write a function that will call the three functions getCountryNames, countCountryOccurrences, and writeCountryCounts in order to fullfill our initial goal. This function, titled, orderCountriesWithMostNobels, takes a single parameter, the file name, and then calls in turn all the three functions above, to generate a new file with names of countries sorted by their number of prizes. It will require a helper function to help the sorting.
In [57]:
def createFileName(filename):
    """Given a filename like 'literature-nobel-prize.csv', create a new 
    filename in the format: 'literature-country-counts.txt'.
    """
    parts = filename.split('-') # split at the dash character
    name = parts[0] # the first part is what we want
    newFileName = f"{name}-country-counts.txt" # create the new file name
    return newFileName
In [58]:
createFileName("peace-nobel-prize.csv")
Out[58]:
'peace-country-counts.txt'

Step 1: Write getCountryNames

In [59]:
def getCountryNames(filename):
    """This function takes one parameter, a file name. It does the following:
    1. It opens the file for reading. 
    2. It reads the first line, but doesn't do anything with it.
    3. With a for loops it reads all the other rows one by one 
    chekcing if it has encountered a country name before. If not, it keeps track of it.
    4. It returns a list of unique country names.
    This is an accumulation problem. 
    """
    # Your code here
    countriesList = []
    with open(filename, "r") as inputFile:
        _ = inputFile.readline() # read first line that contains column names
        for line in inputFile:
            country = line.split(',')[4]
            if country not in countriesList:
                countriesList.append(country)
                
    return countriesList

Test the function to find out how many countries were read. The expected results are:

Literature prizes: 41 countries.
Peace prizes: 46 countries.
In [60]:
print("Literature prizes:", len(getCountryNames("literature-nobel-prize.csv")), "countries.")
Literature prizes: 41 countries.
In [61]:
print("Peace prizes:", len(getCountryNames("peace-nobel-prize.csv")), "countries.")
Peace prizes: 46 countries.

The next function we will create will read the file and count the occurrences of a given country name.

Step 2: Write countCountryOccurrences

In [62]:
def countCountryOccurrences(filename, countryName):
    """
    This function takes two parameters: a file name and a country name.
    It returns an integer that is the count of occurrences of the country in the file.
    This is also an accumulation problem (but with an integer). 
    It shares a part of the solution with getCountryNames.
    """
     # Your code here
    countryOcc = 0 # accumulator variable
    with open(filename, "r") as inputFile:
        _ = inputFile.readline() # read first line that contains column names
        for line in inputFile:
            country = line.split(',')[4]
            if country == countryName:
                countryOcc += 1
                
    return countryOcc

Let's test the function with a few countries and each file of prizes:

In [63]:
countryNames = ['United States', 'France', 'Italy']
for countryName in countryNames:
    print(f"Literature Nobel Prize for {countryName}:", 
          countCountryOccurrences("literature-nobel-prize.csv", countryName))
Literature Nobel Prize for United States: 12
Literature Nobel Prize for France: 15
Literature Nobel Prize for Italy: 6
In [64]:
countryNames = ['United States', 'France', 'Italy']
for countryName in countryNames:
    print(f"Peace Nobel Prize for {countryName}:", 
          countCountryOccurrences("peace-nobel-prize.csv", countryName))
Peace Nobel Prize for United States: 27
Peace Nobel Prize for France: 9
Peace Nobel Prize for Italy: 2

Step 3: Write the function writeCountryCounts

Our next step is to write a function that given a list of tuples, such as:

[('United States', 27), ('France', 9), ('Italy', 2)]

writes them into a file. The new file gets its name from the original file, using the helper function createFileName.

In [65]:
def writeCountryCounts(countsList, filename0):
    """Function takes two parameters a list of tuples and a filename.
    1. It first creates a new file name to save the output.
    2. It open a file with the created name for writing.
    3. Writes a first line with the column names.
    4. Iterates through the list of tuples and writes all country names and counts.
    5. Prints a message to say that it was done creating the new file.
    """
    
    # Your code here
    # Create new name for the output file
    filename1 = createFileName(filename0)
    
    # Open file for writing; writes a header and all rows
    with open(filename1, 'w') as outputFile:
        outputFile.write("Country,Total Wins\n")
        for country, count in countsList:
            line = f"{country},{count}\n"
            outputFile.write(line)
        
    # displays a printed message
    print(f"Created new file {filename1}")
In [66]:
someCounts = [('United States', 27), ('France', 9), ('Italy', 2)]
writeCountryCounts(someCounts, 'peace-nobel-prize.csv')
Created new file peace-country-counts.txt
In [67]:
more peace-country-counts.txt

Step 4: Write the function orderCountriesWithMostNobels

Now it's the time to put together all functions we wrote above.

In [68]:
# Create a helper function for sorting by the number of occurrences of a country

# Your code here
def byOccurrence(countryOcc):
    """Helper function to serve as key for sorting."""
    return countryOcc[1]
In [69]:
def orderCountriesWithMostNobels(filename):
    """
    This function takes a single parameter, the file name of a CSV with Nobel Prize data.
    It does the following:
    1. Calls the function getCountryNames
    2. Generates a list of tuples of (country, occurrences), making use of countCountryOccurrences
    3. Sorts the list in reverse order (country with most prizes is first)
    4. Writes the list of tuples into a new file, calling the funciton writeCountryCounts.
    """
    # Your code here
    
    # 1. Find the list of countries
    countriesList = getCountryNames(filename)
    
    # 2. Count how often does each country occur; save results into a list
    countryOccList = []
    for country in countriesList:
        occ = countCountryOccurrences(filename, country)
        countryOccList.append((country, occ))
    
    # 3. Sort by occurrence
    orderedCountryOccList = sorted(countryOccList, key=byOccurrence, reverse=True)
    
    # 4. Save results into the new file
    writeCountryCounts(orderedCountryOccList, filename)

Let's test our function:

In [70]:
orderCountriesWithMostNobels("literature-nobel-prize.csv")
Created new file literature-country-counts.txt
In [71]:
more literature-country-counts.txt
In [72]:
orderCountriesWithMostNobels("peace-nobel-prize.csv")
Created new file peace-country-counts.txt
In [73]:
more peace-country-counts.txt

This was a long solution, because you haven't encountered yet some Python constructs that will greatly simplify these solutions, for example, dictionaries that keep track of frequencies, and libraries to work with CSV files. We will be covering these in the coming weeks and return to this problem for a much shorter solution.

That's all for today!