Table of Contents
As in our previous lecture, we will experiment with the file tree below.
Note: In this lecture, the top folder is actually named lec23_recursive_fileTrees
rather than lec_fileTrees
, and the notebook file is named lec23_recursive_fileTrees.ipynb
rather than lec_fileTrees.ipynb
. These changes will affect some of the examples and results below.
A file tree is recursively defined as:
a file (a leaf of the tree). In the above tree, files are represented by the document icon.
a directory (a.k.a. folder, an intermediate node of the tree) containing zero or more file trees. In the above tree, directories are represented by the folder icon.
The top node of the tree (in this case, the folder named lec_fileTrees
) is called the root of the tree.
Tree-shaped structures consisting of nodes that branch out to subtrees that terminate in leaves are common in computer science. We focus on file trees in this lecture because everyone is familiar with them and they're an excellent domain for fruitful recursive functions.
A running program is often considered to be connected to a particular directory in the file tree of the computer. This is why you often have to change the directory in the Python console when you are executing code. In the Jupyter notebook, we can directly use certain OS (operating system) commands to figure out information about files and folders.
pwd
: print working directorycd
dirName : change the working directory to dirNamecd ..
: change the working directory to the parent of the current working directory. (In general, ..
means "parent of the current directory" and .
means "the current directory".)ls
: list the contents of working directoryls
dirName: list the contents of directory dirName-a
flag to ls
includes hidden files (begin with a dot)-l
flag to ls
puts each file on a line with extra information (size, timestamp, etc)-a
and -l
flags can be combined as -al
Let's try them below to see the results.
pwd
ls
cd testdir
pwd
ls
ls -a
ls -al
cd ..
IMPORTANT: All these command lines can be used in the Terminal application in a Mac computer to navigate folders in the computer. Alternatively, they are performed via point-and-click operations in the Finder application.
Via the os
module, Python provides a way to manipulate the directories and files in a file system. To use these features, we first need to import the os module:
import os
os
functions¶Function | Description |
---|---|
os.getcwd() |
Outputs the working directory as an absolute path |
os.listdir(pathName) |
Returns the list of files and folders for the given path name |
os.path.exists(pathName) |
Returns a boolean to indicate if a file or folder with the given pathname exists |
os.path.isfile(pathName) |
Returns a boolean to indicate if the given path name refers to a file |
os.path.isdir(pathName) |
Returns a boolean to indicate if the given path name refers to a folder |
os.path.join(folderPath, filename) |
Returns a string by joining together the given parameters. This new string is a valid path name. |
os.path.basename(pathName) |
Returns a string that corresponds to the last part in a path name. |
os.path.getsize(pathName) |
Returns a number that corresponds to the size in bytes of the file (or folder) referred by the path name. |
os.getcwd
¶The os.getcwd
function returns the current working directory as a string.
os.getcwd()
os.listdir
¶The os.listdir
function returns a list of all files/directories in the argument directory.
os.listdir(os.getcwd())
So-called "dot files" whose names begin with the '.' character are special system files that are often hidden by the operating system when displaying files. We will tend to ignore them. Note that "dot files" do not include . (the current directory) or .. (the parent directory).
Depending on the settings for your computer's file browser (e.g., Finder on a Mac), you might or might not see dot files explicitly listed in the file browser. For example, here's a version of a Mac Finder window where hidden files are shown:
And here's a version of a Mac Finder window where dot files are not shown. Note that by default newer versions of Finder will not show .DS_store files:
Let's see more examples of os.listdir
:
os.listdir('testdir')
os.listdir('testdir/remember')
os.listdir('testdir/pset')
os.listdir('testdir/pset/scene')
os.listdir('testdir/pset/shrub')
YOUR TURN: Below, write a command that lists the content for the subfolder "images".
The expected result is ['shrub1.png', 'shrub2.png']
.
# Your code here
os.listdir('testdir/pset/shrub/images')
What happens if the os.listdir
is given the name of a nondirectory file or a nonexistent file?
os.listdir('testdir/hyp.py')
os.listdir('remember') # Not a subdirectory of the connected directory
The os.path.exists
function determines whether the given name denotes a file/directory in the filesystem.
os.path.exists('testdir/remember/memories.txt')
os.path.exists('testdir/remember')
os.path.exists('catPlaysPiano.jpg')
os.path.exists('remember')
Note that the search for a file/directory begins in the working directory, which in the above examples is the lec_fileTrees
directory. This is why os.path.exists('testdir/remember')
is True
but os.path.exists('remember')
is False.
YOUR TURN: How would you check that the file cs1graphics.py
is in the sample file tree?
# Your code here
os.path.exists('testdir/pset/scene/cs1graphics.py')
The os.path.isfile
and os.path.isdir
functions determine whether the given name is a file or directory, respectively. They both return false for a nonexistent file/directory name.
os.path.isfile('testdir/remember/memories.txt')
os.path.isdir('testdir/remember/memories.txt')
os.path.isfile('testdir/remember/')
os.path.isdir('testdir/remember/')
os.path.isfile('remember')
os.path.isdir('remember')
YOUR TURN: Verify that shrub1.png
is a file, using the correct path.
# Your code here
os.path.isfile('testdir/pset/shrub/images/shrub1.png')
YOUR TURN: Verify that scene
is a directory, using the correct path.
# Your code here
os.path.isdir('testdir/pset/scene')
os.path.join
¶We can use os.path.join
to join to strings that contain parts of the path. We often need to do this together with os.listdir
, which only shows the names of the contained files and directories without their relative paths.
root = 'testdir'
for name in os.listdir(root):
print(name)
But, joining the root folder with the file name gives the entire relative path for a file/folder:
root = 'testdir'
for name in os.listdir(root):
print(os.path.join(root, name))
We could instead use string concatenation via +
to combine path elements, but os.path.join
is more convenient for handling the slashes that separate path components.
os.path.join('testdir/', 'remember/', 'memories.txt')
os.path.join('testdir', 'remember', 'memories.txt')
'testdir' + '/' + 'remember' + '/' + 'memories.txt'
YOUR TURN: Modify the above for
loop to print out the relative paths of only the directories in testdir
. You'll need to use one of the functions we learned in this section, in addition to os.path.join
:
# Your code here:
root = 'testdir'
for name in os.listdir(root):
pathName = os.path.join(root, name)
if os.path.isdir(pathName):
print(pathName)
os.path.basename
¶os.path.basename
returns the last component in a file path.
os.path.basename('testdir/remember/memories.txt')
os.path.basename('testdir/remember')
os.path.basename('testdir/remember/')
os.path.getsize
¶A file has a size measured in bytes. For a folder, the size only refers to some bookkeeping information; it does not refer to the total size of the files in the folder!
os.path.getsize("testdir/remember/ephemeral.py")
os.path.getsize("testdir/remember/memories.txt")
os.path.getsize("testdir/remember/persistent.py")
os.path.getsize("testdir/remember")
Note that 170 is less than (379 + 80 + 1634); it is unrelated to the sizes of the files in the remember
directory!
os.path.getsize("testdir/tracks.csv")
os.path.getsize("testdir")
YOUR TURN: Write a for
loop that will print (1) the relative path name of each element in the testdir
folder along with (2) its size. The solution should look like this (note that your order may be different):
testdir/.DS_Store 6148 # This might or might not appear, depending on your operating system.
testdir/.numbers 21
testdir/hyp.py 94
testdir/pset 170
testdir/pubs.json 107062
testdir/remember 170
testdir/tracks.csv 18778
Note: this is trickier than it might first appear. Mastering this pattern is essential for writing functions that manipulate file trees (see the next section).
# Your code here:
folder = 'testdir'
for name in os.listdir(folder):
wholeName = os.path.join(folder, name)
print(wholeName, os.path.getsize(wholeName))
As noted above, a file tree is recursively defined as:
a file (a leaf of the tree); or
a directory (a.k.a. folder, an intermediate node of the tree) containing zero or more file trees
Because the structure of a file tree is recursive, it is natural to process such trees with recursive functions.
Here's the subtree of our file tree example rooted at the testdir
directory. We'll be referring to this subtree in our examples.
printFileTree
function¶The goal of the printFileTree
function is to print each directory or file (one per line) encountered in a traversal of the tree that first "visits" a directory before visiting its contents. For example:
printFileTree('testdir')
should print
testdir
testdir/hyp.py
testdir/pset
testdir/pset/scene
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub
testdir/pset/shrub/images
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv
As indicated above, dot files should not be displayed by printFileTree
.
We'll give the name printFileTreeBroken
to our first attempt. Although it's close to correct, it doesn't work.
def printFileTreeBroken(root):
'''Print all directories and files (one per line) starting at root,
which itself is a directory or file name.
'''
if os.path.isfile(root):
print(root)
elif os.path.isdir(root):
print(root)
# Note that we use a loop to call the recursive function on all
# the files and folders in a directory. This combination of a
# loop with recursion is a common pattern in file tree traversals
for fileOrDir in os.listdir(root):
printFileTreeBroken(fileOrDir)
# There's an implicit else: pass here that does nothing in all other cases,
# e.g nonexistent files and files types other than regular files and directories.
# check how it works for current directory
printFileTreeBroken('.')
# check how it works for the first subdirectory
printFileTreeBroken('testdir')
Why doesn't printFileTreeBroken('testdir')
print the contents of testdir
?
Well, trace through the code. For example:
os.listdir('testdir')
Within the function body, we're performing an if
/elif
test, but both of them evaluate to False
. Why?
os.path.isfile('hyp.py')
os.path.isdir('pset')
The problem is that the code doesn't correctly handle relative directories. Here's an improved version that does this, but it still has a problem. What problem does it still have?
def printFileTreeBetter(root):
"""A step toward a better version of 'printFileTree'."""
if os.path.isfile(root):
print(root)
elif os.path.isdir(root):
print(root)
for fileOrDir in os.listdir(root):
printFileTreeBetter(os.path.join(root, fileOrDir))
printFileTreeBetter('testdir')
Let's define a helper function to test for hidden files:
def isHiddenFile(path):
base = os.path.basename(path)
return (len(base) > 0
and base[0] == '.'
and base != '.' # '.' ("the current directory") is a special case
and base != '..' # '..' ("the parent of the current directory") is a special case
)
isHiddenFile('.data')
isHiddenFile('testdir/.numbers')
isHiddenFile('testdir/psets/memories.txt')
isHiddenFile('.')
Now we can use isHiddenFile
to filter out hidden files, leading to our final, correct version of printFileTree
.
def printFileTree(root):
'''Print all directories and files (one per line) starting at root,
which itself is a directory or file name.
'''
if isHiddenFile(root):
pass # filter out dot files
elif os.path.isfile(root):
print(root) # print the name of the file
elif os.path.isdir(root):
# 1. Print the name of the directory
print(root)
# 2. For each contained file or subdirectory, recursively print its file tree
for fileOrDir in os.listdir(root):
printFileTree(os.path.join(root, fileOrDir))
printFileTree('testdir')
printFiles
¶Starting with printFileTree
, define a modified function printFiles
that prints only the (nondirectory) files encountered in a file tree traversal starting at a given root. For example,
printFiles('testdir')
should print
testdir/hyp.py
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv
def printFiles(root):
'''Print only the (nondirectory) files encountered
in a file tree traversal starting at root.
'''
# Your code here:
if isHiddenFile(root):
pass # filter out dot files
elif os.path.isfile(root):
print(root) # print the name of the file
elif os.path.isdir(root):
# Do *not* print the name of the directory!
# For each contained file or subdirectory, recursively print *only* its files
for fileOrDir in os.listdir(root):
printFiles(os.path.join(root, fileOrDir))
# Test the printFiles function:
printFiles('testdir')
printDirs
¶Starting with printFileTree
, define a modified function printDirs
that prints only the directories encountered in a file tree traversal starting at a given root. For example,
printDirs('testdir')
should print
testdir
testdir/pset
testdir/pset/scene
testdir/pset/shrub
testdir/pset/shrub/images
testdir/remember
def printDirs(root):
'''Print only the directories encountered in a
file tree traversal starting at root.
'''
# Your code here:
if isHiddenFile(root):
pass # filter out dot files
elif os.path.isdir(root):
# 1. Print the name of the directory
print(root)
# For each contained node, recursively print *only* its directories
for fileOrDir in os.listdir(root):
printDirs(os.path.join(root, fileOrDir))
# for files, have an implicit else: pass
# Test the printDirs function:
printDirs('testdir')
countFiles
¶Starting with printFiles
, define a modified function countDirs
that returns the number of (nondirectory) files encountered in a file tree traversal starting at a given root. For example:
countFiles('testdir/pset/shrub/images/shrub1.png') => 1
countFiles('testdir/pset/shrub/images') => 2
countFiles('testdir/pset/shrub') => 3
countFiles('testdir/pset') => 4
countFiles('testdir') => 10
def countFiles(root):
'''Returns the number of (nondirectory) files encountered
in a file tree traversal starting at root.
'''
# Your code here:
if isHiddenFile(root):
return 0 # filter out dot files
elif os.path.isfile(root):
return 1 # count the file we're testing
elif os.path.isdir(root):
# # Solution using accumulation variable
# filesHere = 0
# for fileOrDir in os.listdir(root):
# filesHere += countFiles(os.path.join(root, fileOrDir))
# return filesHere
# Alternative solution using list comprehensions and sum
return sum([countFiles(os.path.join(root, fileOrDir))
for fileOrDir in os.listdir(root)])
# Test the countFiles function:
for f in ['testdir/pset/shrub/images/shrub1.png',
'testdir/pset/shrub/images',
'testdir/pset/shrub', 'testdir/pset', 'testdir']:
print(f"countFiles('{f}') => {countFiles(f)}")
countDirs
¶Starting with printDirs
, define a modified function countDirs
that returns the number of directories encountered in a file tree traversal starting at a given root. For example:
countDirs('testdir/pset/shrub/images/shrub1.png') => 0
countDirs('testdir/pset/shrub/images') => 1
countDirs('testdir/pset/shrub') => 2
countDirs('testdir/pset') => 4
countDirs('testdir') => 6
def countDirs(root):
'''Returns the number of directories encountered in a
file tree traversal starting at root.
'''
# Your code here:
if isHiddenFile(root):
return 0 # filter out dot files
elif os.path.isdir(root):
# # Solution using accumulation variable
# dirsHere = 1 # count the root directory itself
# for fileOrDir in os.listdir(root):
# dirsHere += countDirs(os.path.join(root, fileOrDir))
# return dirsHere
# Alternative solution using list comprehensions and sum
return 1 + sum([countDirs(os.path.join(root, fileOrDir))
for fileOrDir in os.listdir(root)])
else:
return 0 # nondirectory files don’t count
# Test the countDirs function:
for f in ['testdir/pset/shrub/images/shrub1.png', 'testdir/pset/shrub/images',
'testdir/pset/shrub', 'testdir/pset', 'testdir']:
print("countDirs('{}') => {}".format(f, countDirs(f)))
printLargestFiles
¶This function will print out for each folder the largest file with its corresponding size.
Hidden files are ignored and for folders without files an appropriate message is printed.
Here is the result for printLargestFiles('testdir')
:
Biggest file in testdir is pubs.json with size = 107062 bytes.
No files to check in testdir/pset
Biggest file in testdir/pset/scene is cs1graphics.py with size = 212018 bytes.
Biggest file in testdir/pset/shrub is shrub.py with size = 1842 bytes.
Biggest file in testdir/pset/shrub/images is shrub1.png with size = 27248 bytes.
Biggest file in testdir/remember is persistent.py with size = 1634 bytes.
def printLargestFiles(root):
'''For every folder prints out the relative path of the folder, the
name of the largest file, and its size. For folders that have no
files, prints out an appropriate message.
'''
# Your code here:
if os.path.isfile(root):
pass
elif os.path.isdir(root):
# get the list of everything in the directory 'root'
everything = os.listdir(root)
# keep only files, discard directories and dot files
onlyFiles = [f for f in everything
if os.path.isfile(os.path.join(root, f))
and not isHiddenFile(f)]
# perform printing of results
if len(onlyFiles) > 0:
size, fName = max([(os.path.getsize(os.path.join(root, f)),
f) for f in onlyFiles])
print(f"Biggest file in {root} is {fName} with size = {size} bytes.")
else:
print(f"No files to check in {root}.")
# continue with recursion
for fileOrDir in everything:
printLargestFiles(os.path.join(root, fileOrDir))
# Test the printLargestFiles function:
printLargestFiles('testdir')