Table of Contents
key
parameterIn the previous lecture on File Formats, we worked with the file us-states-more.csv, which contained information about each state, such as:
State,StatePop,Abbrev.,Capital,CapitalPop
Alabama,4921532,AL,Montgomery,198525
Alaska,731158,AK,Juneau,32113
Arizona,7421401,AZ,Phoenix,1680992
Arkansas,3030522,AR,Little Rock,197312
California,39368078,CA,Sacramento,513624
Colorado,5807719,CO,Denver,727211
These are real-world data retrieved from the Census website. In the United States, representation in Congress (House of Representatives) is decided about changes in states' population every ten years. In this lecture, you will learn to write Python code to answer questions with Census data:
In order to answer these questions, we first need to learn about sorting sequences.
Let's start with the simplest case, sorting a list of unordered numbers, positive and negative.
numbers = [35, -2, 17, -9, 0, 12, 19]
We will use Python's built-in function, sorted
to sort the list. This function always returns a new list.
sorted(numbers)
And we can verify that numbers
hasn't changed:
numbers
Note: This suggests that if we want to use the result of sorted
, we must define a variable to save its returned value, for example:
sortedNumbers = sorted(numbers)
By default the list is sorted in the ascending order (from the smalled value to the largest), but we can easily reverse the order, using a named parameter of the function sorted
, as shown below:
sorted(numbers, reverse=True)
The use of the named parameter, reverse
, in this function call is new syntax that you haven't seen before. We will get back to it later in this lecture.
Strings and tuples can also be sorted in the same way. The result is always going to be a new list.
phrase = 'Red Code 1'
sorted(phrase)
Question: Why do we see a space as the first element in the sorted list?
Answer: Because of the ASCII representation of characters.
We can use the Python built-in function ord
to find the ASCII code of every character:
ord(' ')
We can write a for
loop to print the code for every character.
We use the f-string method to format the output.
for item in sorted(phrase):
print(f"'{item}' has ASCII code {ord(item)}")
Just as in the case of the list numbers
in the above example, the string value of phrase
hasn't changed:
phrase
This is to be expected, because strings are immutable.
digits = (9, 7, 5, 3, 1) # this is a tuple
type(digits) # check the type
sorted(digits)
Notice that the result of the sorting is a list, not a tuple. This is because the function sorted
always returns a list.
digits
The original tuple value hasn't changed.
We can sort list of sequences such as list of strings, list of tuples, and list of lists.
Sorting the list of tuples and the list of lists is going to be similar. The same principles will apply.
# a long string that we will split into a list of words
phrase = "99 red balloons *floating* in the Summer sky"
words = phrase.split()
words
sorted(words)
Question: Can you explain the results of sorting here? What rules are in place?
Answer: Words that contain special characters come first, then words that contain digits, words starting with uppercase letters, and finally, words with lowercase letters in alphabetical order. This ordering corresponds to the ASCII table numerical representations of each word's first character.
String characters are ordered by these rules:
sorted(words, reverse=True)
Remember, the original list is unchanged:
words
Tuples are compared element by element, starting with the one at index 0. This is known as lexicographic order, which is a generalization of dictionary order on strings in which each tuple element generalizes a character in a string.
triples = [(8, 'a', '$'), (7, 'c', '@'),
(7, 'b', '+'), (8, 'a', '!')]
sorted(triples)
Q: What happens in the case of ties for the first elements of tuples?
A: We keep comparing elements with the same indices until we find two that are not the same. (See example for the two tuples that start with 8.)
ord('!') < ord('$')
print(ord('!'), ord('$'))
That is, the reason '!'
is less than '$'
is that the first has a smaller ASCII code than the latter.
key
parameter¶Often there are cases in which we want to sort by an element that is not first in a sequence, for example, given the list of tuples people
(below), we want to sort by the age of a person.
people = [('Mary Beth Johnson', 18),
('Ed Smith', 17),
('Janet Doe', 25),
('Bob Miller', 31)]
Simply using sorted
as we have done so far will not work. But the function sorted
has been designed to deal with this scenario in mind. Let us read its doc string.
help(sorted)
Notice the phrase: A custom key function can be supplied to customize the sort order. This means that we can specify a function that for each element determines how it should be compared to other elements of the iterable. Let us see an example.
people = [('Mary Beth Johnson', 18),
('Ed Smith', 17),
('Janet Doe', 25),
('Bob Miller', 31)]
We'll create the function age
that given a person tuple (name, age) will return the age value.
def age(personTuple):
"""Helper function to use in sorted"""
return personTuple[1]
age(('Janet Doe', 25))
Now that we have this function, we will use it as the value for key in sorted
.
sorted(people, key=age)
The list was sorted by the age values! Let's see one more example. We will create a helper function lastName
that returns a person's last name.
def lastName(personTuple):
"""Helper function to use in sorted"""
return personTuple[0].split()[-1] # first access the whole name (has index=0 in the tuple)
# then split it (will create a list),
# then return its last element (index=-1)
lastName(('Bob Miller', 31))
sorted(people, key=lastName)
Important: The parameter key
is being assigned as its value a function value. Functions in Python are values, see the examples below:
age
lastName
We can create a variable, assign it a function value, and then call that variable as if it was a function (because indeed it's an alias for a function).
boo = age
boo(('Janet Doe', 25))
foo = lastName
foo(('Ed Smith', 17))
The variables boo
and foo
are aliases for the functions age
and lastName
, which we can easily verify:
boo
foo
key
functions¶Assume we have a new list of person tuples, where there are lots of ambiguities in terms of what comes first. Concretely:
people2 = [('Ed Jones', 18),
('Ana Doe', 25),
('Ed Doe', 18),
('Bob Doe', 25),
('Ana Jones', 18)]
Notice that we have several individuals with the same age, or the same first name, or the same last name. How should we sort elements in this situation?
We can create a function that uses a tuple to break the ties.
def ageLastFirst(person):
return (age(person), lastName(person), firstName(person))
Your Turn
Write a function firstName
, that mimics lastName
, but returns the first name of a person.
# Your code here
def firstName(personTuple):
"""Helper function to use in sorted"""
return personTuple[0].split()[0]
If you wrote firstName
, now we can write ageLastFirst
:
def ageLastFirst(person):
"""Helper function to use in sorted"""
return (age(person), lastName(person), firstName(person))
people2 = [('Ed Jones', 18),
('Ana Doe', 25),
('Ed Doe', 18),
('Bob Doe', 25),
('Ana Jones', 18)]
sorted(people2, key=ageLastFirst)
Notice that in the result, the tuples are sorted first by age, then by last name (when the same age), and then by first name (when same age and last name).
key
work?¶When sorted
is called with a key
parameter, the first thing it does is to invoke the function that is referred to by key
for each element of the sequence. If we think of the value returned by the key
function as keyvalue
, then what sorted
does is to create a tuple (keyvalue,value)
, sort the sequence based on this tuple, and then get rid of the tuple and return the sorted values only.
This process is also known as Decorate, Sort, Undecorate and we can try it too:
# Step 1 (Decorate): create a list of tuples (keyvalue, value)
decorated = [(age(person), person) for person in people]
decorated
# Step 2 (Sort): invoke the function sorted without the key function
decoratedSorted = sorted(decorated)
decoratedSorted
# Step 3 (Undecorate): extract now the value from each (keyvalue,value) pair to create the end result
undecoratedResult = [item[1] for item in decoratedSorted]
undecoratedResult
As you might remember, when we include key in sorted
the result is the same:
sorted(people, key=age)
Basically, the parameter key
works, becuase of the rules for sorting a list of tuples, that we saw earlier on the notebook.
Here is a list of the Space Missions where the second element in the tuple represents the number of days the mission lasted.
missions = [('Apollo 11', 8),
('Mercury-Redstone 3', 1),
('Apollo 13', 5),
('Gemini 3', 1),
('Little Joe 6', 1)]
Some quick terminology:
missions = [('Apollo 11', 8),
('Mercury-Redstone 3', 1),
('Apollo 13', 5),
('Gemini 3', 1),
('Little Joe 6', 1)]
Task 1: Sort by number of days
Using function sorted
with key, sort the list of missions based on the number of days on space. This should produce:
[('Mercury-Redstone 3', 1),
('Gemini 3', 1),
('Little Joe 6', 1),
('Apollo 13', 5),
('Apollo 11', 8)]
You will have to create first a helper function to be used as a value for key.
# Your code here
def days(missionTuple):
return missionTuple[1]
sorted(missions, key=days)
Task 2: Sort by mission number
This should produce the list:
[('Mercury-Redstone 3', 1),
('Gemini 3', 1),
('Little Joe 6', 1),
('Apollo 11', 8),
('Apollo 13', 5)]
Similarly as above, create a helper function to be assigned to key.
# Your code here
def missionNumber(missionTuple):
return int(missionTuple[0].split()[-1])
sorted(missions, key=missionNumber)
Task 3: Sort by program name
This should produce the list:
[('Apollo 11', 8),
('Apollo 13', 5),
('Gemini 3', 1),
('Little Joe 6', 1),
('Mercury-Redstone 3', 1)]
Again, start by creating a helper function to be the value for key.
# Your code here
def programName(missionTuple):
return ''.join(missionTuple[0].split()[:-1])
sorted(missions, key=programName)
Task 4: Sort by days, program name, and mission number
Here, you'll combine together all the functions you wrote above, to produce this list:
[('Gemini 3', 1),
('Little Joe 6', 1),
('Mercury-Redstone 3', 1),
('Apollo 13', 5),
('Apollo 11', 8)]
# Your code here
def daysProgramNumber(missionTuple):
return (days(missionTuple), programName(missionTuple), missionNumber(missionTuple))
sorted(missions, key=daysProgramNumber)
Lists have two methods for sorting. These methods mutate the original list. They are sort
and reverse
.
sort
¶numbers = [35, -2, 17, -9, 0, 12, 19]
numbers.sort()
Notice that no value was returned, because sort
mutates the original list.
numbers
reverse
¶numbers2 = [35, -2, 17, -9, 0, 12, 19]
numbers2.reverse()
This method also does not return a value, because it too mutates the list.
numbers2
In combination, sort and reverse can sort a list in reverse order:
numbers2.sort()
numbers2.reverse()
numbers2
key
and reverse
as well¶people.sort(key=age, reverse=True)
people
Does it make sense to sort a dictioanry? What happens when we try to do so?
fruitColors = {"banana": "yellow", "kiwi": "green",
"grapes": "purple", "apple": "red",
"lemon": "yellow", "pomegranate": "red"}
sorted(fruitColors)
We didn't get an error, but all that was sorted were the keys of the dictionary.
Try to predict what we will see in the following lines:
sorted(fruitColors.values())
sorted(fruitColors.items())
sorted(fruitColors.keys())
While it doesn't make sense to sort a dictionary, it certainly makes sense to sort a list of dictionaries, but we will need to do some work. Let's try it out:
peopleDctList = [{'name':'Mary Beth Johnson', 'age': 18},
{'name':'Ed Smith', 'age': 17},
{'name':'Janet Doe', 'age': 25},
{'name':'Bob Miller', 'age': 31}]
sorted(peopleDctList)
It doesn't happen out of the box, becuase Python doesn't have rules for comparing dictionaries, the ways that there are rules for comparing strings and tuples (as we saw at the beginning). That means, we would have to provide the key parameter to tell Python how to sort the items of this list of dictionaries.
def byAge(personDct):
"""Helper function to be used in sorted."""
return personDct['age']
sorted(peopleDctList, key=byAge)
Create a function byLastName
, that will sort the list of dictionaries by the last name of people. When used in sorted
, it should produce the list:
[{'name': 'Janet Doe', 'age': 25},
{'name': 'Mary Beth Johnson', 'age': 18},
{'name': 'Bob Miller', 'age': 31},
{'name': 'Ed Smith', 'age': 17}]
# Your code here
def byLastName(personDct):
"""Helper function to be used in sorted."""
return personDct['name'].split()[-1]
sorted(peopleDctList, key=byLastName)
Conclusion: You learned to sort a list of dictionaries. Now that you know that, you can answer our initial questions about the Census data.
We want to answer all these questions through sorting:
For the sake of space, you should only print the first 6 items of each sorted list. The final results for each question are shown in the slides.
We are providing you with the code of reading a CSV file as a list of dictionaries, so that you can focus on sorting.
import csv
# Read content of CSV file as a list of dictionaries
with open("us-states-more.csv", 'r') as inputF:
dctReader = csv.DictReader(inputF)
stateDctList = list(dctReader)
len(stateDctList)
Look up an item of this list:
stateDctList[0]
Remember that while Python displays this item as a list of tuples, this is in fact a dictionary, you can access its keys and values in the usual way:
for key in stateDctList[0]:
print(f"{key}: {stateDctList[0][key]}")
The steps for answering this question can be found also in Slide 20 of our lecture notes, but here they are for simplicity of access:
byStatePop
, which, given a dictionary with state data (one row from our file), returns the appropriate value. Remember that all values in the dictionary are strings, because they come from the CSV file.sorted
function to the list of dictionaries of state data (stateDctList
), using the key parameter that has as value byStatePop
.You can find the results of the printing in Slide 20.
# Write the function byStatePop
# Your code here
def byStatePop(stateDct):
"""Helper function to be used in sorted."""
return int(stateDct['StatePop'])
# Sort the list in descending order, save it in a variable
# Your code here
sortedStates1 = sorted(stateDctList, key=byStatePop, reverse=True)
# Print out the top 6 entries (see Slide 20)
# Do not worry if you cannot get the formatting of the numbers, check the solution later
# Your code here
print("Top six most populated US states:\n")
for stateDct in sortedStates1[:6]:
print(f"{stateDct['Abbrev.']} -> {int(stateDct['StatePop']):,}")
This is very similar to question 1. You do not need to write a helper function, because you can use byStatePop
. You only have to sort data in the ascending order (the default way that sorted
works) and then print the top six values.
# First sort (save in a variable), and then print out the top 6 least populated states in the US
# Your code here
sortedStates2 = sorted(stateDctList, key=byStatePop)
print("Top six least populated US states:\n")
for stateDct in sortedStates2[:6]:
print(f"{stateDct['Abbrev.']} -> {int(stateDct['StatePop']):,}")
This will be very similar to the two questions above, but first you'll have to write the helper function byCapitalPop
. Once you do that, check out Slide 21 to see what kind of output to print. Do not worry about the formatting, make sure you have the content.
# Write the function byCapitalPop
# Your code here
def byCapitalPop(stateDct):
"""Helper function to be used by sorted."""
return int(stateDct['CapitalPop'])
# Sort the list in descending order, save it in a variable
# Your code here
sortedCapitals1 = sorted(stateDctList, key=byCapitalPop, reverse=True)
# Print the top 6 capitals together with some other data (See slide 21 for what information to include)
# Your code here
print("Top six most populated US state capitals:\n")
for stateDct in sortedCapitals1[:6]:
combined = f"{stateDct['Capital']} ({stateDct['Abbrev.']})"
print(f"{combined:18s} -> {int(stateDct['CapitalPop']):10,d}")
## What does the formatting mean?
# :18s means allocate 18 characters for this string, padd with white space as necessary
# :10,d means allocate 10 digits that will be separated by , at the 1000s
Very similar to Question 3, but the capitals will now be in ascending order and you'll only print the top 6 least populated.
# First sort (save in a variable), and then print out the top 6 least populated US state capitals
# Your code here
sortedCapitals2 = sorted(stateDctList, key=byCapitalPop)
print("Top six least populated US state capitals:\n")
for stateDct in sortedCapitals2[:6]:
combined = f"{stateDct['Capital']} ({stateDct['Abbrev.']})"
print(f"{combined:18s} -> {int(stateDct['CapitalPop']):10,d}")
The general algorithm for answering this question remains the same.
byPercentage
, that finds the percentage of people in a state living in its capital# Write the function byPercentage
# Your code here
def byPercentage(stateDct):
"""Helper function to be used by sorted."""
return round(int(stateDct['CapitalPop'])/int(stateDct['StatePop'])*100, 2)
# Sort the list of state dicts by using the key function you wrote. Save the result in a variable.
# Your code here
sortedPercentages = sorted(stateDctList, key=byPercentage, reverse=True)
# Print the top six results (see what to print in Slide 22)
# Your code here
print("Top six US states with the largest population percentage living in the capital:\n")
for stateDct in sortedPercentages[:6]:
combined = f"{stateDct['State']:15s} {byPercentage(stateDct):2.2f}%"
print(f"{combined} of population lives in the capital, {stateDct['Capital']}.")
This is the end of this notebook!