1. Operating System (OS) Commands in the Notebook

Files are usually organized in folders (also known as directories). This is why you often have to change the directory in the Python console when you are executing code. In the Jupyter notebook (and the interactive pane on Canopy), we can directly use certain OS (operating system) commands to figure out information about files and folders.

  • pwd : print working directory
  • cd dirName : change the working directory to dirName
    • cd .. : change the working directory to the parent of the current working directory. (In general, .. means "parent of the current directory" and . means "the current directory".)
  • ls : list the contents of working directory
    • ls dirName: list the contents of directory dirName
    • the -a flag to ls includes hidden files (begin with a dot)
    • the -l flag to ls puts each file on a line with extra information (size, timestamp, etc)
    • the -a and -l flags can be combined as -al

Let's try them below to see the results.

In [1]:
pwd
Out[1]:
'/Users/fturbak/Downloads/lec_fileTrees_solns'
In [2]:
ls
lec_fileTrees_solns.ipynb  testdir/
pics/
In [3]:
cd testdir
/Users/fturbak/Downloads/lec_fileTrees_solns/testdir
In [4]:
pwd
Out[4]:
'/Users/fturbak/Downloads/lec_fileTrees_solns/testdir'
In [5]:
ls
hyp.py      pset/       pubs.json   remember/   tracks.csv
In [6]:
ls -a
./          .DS_Store   hyp.py      pubs.json   tracks.csv
../         .numbers    pset/       remember/
In [7]:
ls -al
total 288
drwxr-xr-x  9 fturbak  staff     306 Apr 30 06:12 ./
drwxr-xr-x  8 fturbak  staff     272 Apr 30 06:41 ../
-rw-r--r--@ 1 fturbak  staff    6148 Apr 30 06:12 .DS_Store
-rw-r--r--  1 fturbak  staff      21 Apr 29 08:56 .numbers
-rw-r--r--  1 fturbak  staff      94 Apr 29 08:56 hyp.py
drwxr-xr-x  5 fturbak  staff     170 Apr 30 06:12 pset/
-rw-r--r--  1 fturbak  staff  107062 Apr 29 08:56 pubs.json
drwxr-xr-x  6 fturbak  staff     204 Apr 30 06:12 remember/
-rw-r--r--  1 fturbak  staff   18778 Apr 29 08:56 tracks.csv
In [8]:
cd ..
/Users/fturbak/Downloads/lec_fileTrees_solns

IMPORTANT: All these command lines can be used in the Terminal application in a Mac computer to navigate folders in the computer. Alternatively, they are performed via point-and-click operations in the Finder application.

2. File System Operations

We will experiment with a file tree that looks like this:

<img src = "pics/fileTree.png" width=600>

A file tree is recursively defined as:

  • a file (a leaf of the tree). In the above tree, files are represented by the document icon.

  • a directory (a.k.a. folder, an intermediate node of the tree) containing zero or more file trees. In the above tree, directories are represented by the folder icon.

The top node of the tree (in this case, the folder named lec_fileTrees) is called the root of the tree.

Tree-shaped structures consisting of nodes that branch out to subtrees that terminate in leaves are common in computer science. We focus on file trees in this lecture because everyone is familiar with them and they're an excellent domain for fruitful recursive functions.

Via the os module, Python provides a way to manipulate the directories and files in a file system. To use these features, we first need to import the os module:

In [9]:
import os

(a) Get working directory: os.getcwd

The os.getcwd function returns the current working directory as a string.

In [10]:
os.getcwd()
Out[10]:
'/Users/fturbak/Downloads/lec_fileTrees_solns'

(b) List directory: os.listdir

The os.listdir function returns a list of all files/directories in the argument directory.

In [11]:
os.listdir(os.getcwd())
Out[11]:
['.data',
 '.DS_Store',
 '.ipynb_checkpoints',
 'lec_fileTrees_solns.ipynb',
 'pics',
 'testdir']

So-called "dot files" whose names begin with the '.' character are special system files that are often hidden by the operating system when displaying files. We will tend to ignore them. Note that "dot files" do not include . (the current directory) or .. (the parent directory).

Depending on the settings for your computer's file browser (e.g., Finder on a Mac), you might or might not see dot files explicitly listed in the file browser. For example, here's a version of a Mac Finder window where hidden files are shown:

<img src = "pics/fileListingWithHiddenFiles.png" width=600>

And here's a version of a Mac Finder window where dot files are not shown. Note that by default newer versions of Finder will not show .DS_store files:

<img src = "pics/fileListingWithoutHiddenFiles.png" width=600>

Let's see more examples of os.listdir:

In [12]:
os.listdir('testdir')
Out[12]:
['.DS_Store',
 '.numbers',
 'hyp.py',
 'pset',
 'pubs.json',
 'remember',
 'tracks.csv']
In [13]:
os.listdir('testdir/remember')
Out[13]:
['.DS_Store', 'ephemeral.py', 'memories.txt', 'persistent.py']
In [14]:
os.listdir('testdir/pset')
Out[14]:
['.DS_Store', 'scene', 'shrub']
In [15]:
os.listdir('testdir/pset/scene')
Out[15]:
['.DS_Store', 'cs1graphics.py']
In [16]:
os.listdir('testdir/pset/shrub')
Out[16]:
['.DS_Store', 'images', 'shrub.py']

YOUR TURN: Below, write a command that lists the content for the subfolder "images".
The expected result is ['shrub1.png', 'shrub2.png'].

In [17]:
# Write your command here
os.listdir('testdir/pset/shrub/images')
Out[17]:
['.DS_Store', 'shrub1.png', 'shrub2.png']

What happens if the os.listdir is given the name of a nondirectory file or a nonexistent file?

In [18]:
os.listdir('testdir/hyp.py')
---------------------------------------------------------------------------
NotADirectoryError                        Traceback (most recent call last)
<ipython-input-18-eb97991bd016> in <module>()
----> 1 os.listdir('testdir/hyp.py')

NotADirectoryError: [Errno 20] Not a directory: 'testdir/hyp.py'
In [19]:
os.listdir('remember') # Not a subdirectory of the connected directory
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-8d8dc5976bfa> in <module>()
----> 1 os.listdir('remember') # Not a subdirectory of the connected directory

FileNotFoundError: [Errno 2] No such file or directory: 'remember'

(c) Does this path exist?

The os.path.exists function determines whether the given name denotes a file/directory in the filesystem.

In [20]:
os.path.exists('testdir/remember/memories.txt')
Out[20]:
True
In [21]:
os.path.exists('testdir/remember')
Out[21]:
True
In [22]:
os.path.exists('catPlaysPiano.jpg')
Out[22]:
False
In [23]:
os.path.exists('remember')
Out[23]:
False

Note that the search for a file/directory begins in the working directory, which in the above examples is the lec_fileTrees directory. This is why os.path.exists('testdir/remember') is True but os.path.exists('remember') is False.

YOUR TURN: How would you check that the file cs1graphics.py is in the sample file tree?

In [24]:
# Write your command here
os.path.exists('testdir/pset/scene/cs1graphics.py')
Out[24]:
True

(d) Determine file or directory status

The os.path.isfile and os.path.isdir functions determine whether the given name is a file or directory, respectively. They both return false for a nonexistent file/directory name.

In [25]:
os.path.isfile('testdir/remember/memories.txt')
Out[25]:
True
In [26]:
os.path.isdir('testdir/remember/memories.txt')
Out[26]:
False
In [27]:
os.path.isfile('testdir/remember/')
Out[27]:
False
In [28]:
os.path.isdir('testdir/remember/')
Out[28]:
True
In [29]:
os.path.isfile('remember')
Out[29]:
False
In [30]:
os.path.isdir('remember')
Out[30]:
False

YOUR TURN: Verify that shrub1.png is a file, using the correct path.

In [31]:
# Write your command here
os.path.isfile('testdir/pset/shrub/images/shrub1.png')
Out[31]:
True

YOUR TURN: Verify that scene is a directory, using the correct path.

In [32]:
# Write your command here
os.path.isdir('testdir/pset/scene')
Out[32]:
True

(e) Creating a path with os.path.join

We can use os.path.join to join to strings that contain parts of the path. We often need to do this together with os.listdir, which only shows the names of the contained files and directories without their relative paths.

In [33]:
root = 'testdir'
for name in os.listdir(root):
    print(name)
.DS_Store
.numbers
hyp.py
pset
pubs.json
remember
tracks.csv

But, joining the root folder with the file name gives the entire relative path for a file/folder:

In [34]:
root = 'testdir'
for name in os.listdir(root):
    print(os.path.join(root, name))
testdir/.DS_Store
testdir/.numbers
testdir/hyp.py
testdir/pset
testdir/pubs.json
testdir/remember
testdir/tracks.csv

We could instead use string concatenation via + to combine path elements, but os.path.join is more convenient for handling the slashes that separate path components.

In [35]:
os.path.join('testdir/', 'remember/', 'memories.txt')
Out[35]:
'testdir/remember/memories.txt'
In [36]:
os.path.join('testdir', 'remember', 'memories.txt')
Out[36]:
'testdir/remember/memories.txt'
In [37]:
'testdir' + '/' + 'remember' + '/' + 'memories.txt'
Out[37]:
'testdir/remember/memories.txt'

YOUR TURN: Modify the above for loop to print out the relative paths of only the directories in testdir. You'll need to use one of the functions we learned in this section, in addition to os.path.join:

In [38]:
# Write your modified loop code here:
root = 'testdir'
for name in os.listdir(root):
    pathName = os.path.join(root, name)
    if os.path.isdir(pathName):
        print(pathName)
testdir/pset
testdir/remember

(f) Getting the last component of path with os.path.basename

os.path.basename returns the last component in a file path.

In [39]:
os.path.basename('testdir/remember/memories.txt')
Out[39]:
'memories.txt'
In [40]:
os.path.basename('testdir/remember')
Out[40]:
'remember'
In [41]:
os.path.basename('testdir/remember/')
Out[41]:
''

(g) Getting the size of files and folder with os.path.getsize

A file has a size measured in bytes. For a folder, the size only refers to some bookkeeping information; it does not refer to the total size of the files in the folder!

In [42]:
os.path.getsize("testdir/remember/ephemeral.py")
Out[42]:
379
In [43]:
os.path.getsize("testdir/remember/memories.txt")
Out[43]:
80
In [44]:
os.path.getsize("testdir/remember/persistent.py")
Out[44]:
1634
In [45]:
os.path.getsize("testdir/remember")
Out[45]:
204

Note that 170 is less than (379 + 80 + 1634); it is unrelated to the sizes of the files in the remember directory!

In [46]:
os.path.getsize("testdir/tracks.csv")
Out[46]:
18778
In [47]:
os.path.getsize("testdir")
Out[47]:
306

YOUR TURN: Write a for loop that will print (1) the relative path name of each element in the testdir folder along with (2) its size. The solution should look like this (note that your order may be different):

testdir/.DS_Store 6148 # This might or might not appear, depending on your operating system. 
testdir/.numbers 21
testdir/hyp.py 94
testdir/pset 170
testdir/pubs.json 107062
testdir/remember 170
testdir/tracks.csv 18778

Note: this is trickier than it might first appear. Mastering this pattern is essential for writing functions that manipulate file trees (see the next section).

In [48]:
# Write your code here:
folder = 'testdir'
for name in os.listdir(folder):
    wholeName = os.path.join(folder, name)
    print(wholeName, os.path.getsize(wholeName))
testdir/.DS_Store 6148
testdir/.numbers 21
testdir/hyp.py 94
testdir/pset 170
testdir/pubs.json 107062
testdir/remember 204
testdir/tracks.csv 18778

3. File Tree Traversals

A file tree is recursively defined as:

  1. a file (a leaf of the tree); or

  2. a directory (a.k.a. folder, an intermediate node of the tree) containing zero or more file trees

Because the structure of a file tree is recursive, it is natural to process such trees with recursive functions.

Here's the subtree of our file tree example rooted at the testdir directory. We'll be referring to this subtree in our examples.

<img src = "pics/fileTreeRootedAtTestdir.png" width=600>

printFileTree function

The goal of the printFileTree function is to print each directory or file (one per line) encountered in a traversal of the tree that first "visits" a directory before visiting its contents. For example:

printFileTree('testdir')

should print

testdir
testdir/hyp.py
testdir/pset
testdir/pset/scene
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub
testdir/pset/shrub/images
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv

As indicated above, dot files should not be displayed by printFileTree.

We'll give the name printFileTreeBroken to our first attempt. Although it's close to correct, it doesn't work.

In [49]:
def printFileTreeBroken(root):    
    '''Print all directories and files (one per line) starting at root, 
    which itself is a directory or file name.
    '''
    if os.path.isfile(root):        
        print(root)        
    elif os.path.isdir(root):        
        print(root)
        # Note that we use a loop to call the recursive function on all
        # the files and folders in a directory. This combination of a 
        # loop with recursion is a common pattern in file tree traversals
        for fileOrDir in os.listdir(root):            
            printFileTreeBroken(fileOrDir)
    # There's an implicit else: pass here that does nothing in all other cases,
    # e.g nonexistent files and files types other than regular files and directories.
In [50]:
# check how it works for current directory
printFileTreeBroken('.')
.
.data
.DS_Store
.ipynb_checkpoints
lec_fileTrees_solns.ipynb
pics
.DS_Store
testdir
.DS_Store
In [51]:
# check how it works for the first subdirectory
printFileTreeBroken('testdir') 
testdir
.DS_Store

Why doesn't printFileTreeBroken('testdir')print the contents of testdir?

Well, trace through the code. For example:

In [52]:
os.listdir('testdir')
Out[52]:
['.DS_Store',
 '.numbers',
 'hyp.py',
 'pset',
 'pubs.json',
 'remember',
 'tracks.csv']

Within the function body, we're performing an if/elif test, but both of them evaluate to False. Why?

In [53]:
os.path.isfile('hyp.py')
Out[53]:
False
In [54]:
os.path.isdir('pset')
Out[54]:
False

The problem is that the code doesn't correctly handle relative directories. Here's an improved version that does this, but it still has a problem. What problem does it still have?

In [55]:
def printFileTreeBetter(root):
    """A step toward a better version of 'printFileTree'."""
    if os.path.isfile(root):
        print(root) 
    elif os.path.isdir(root):
        print(root) 
        for fileOrDir in os.listdir(root):
            printFileTreeBetter(os.path.join(root, fileOrDir)) 
            

printFileTreeBetter('testdir')
testdir
testdir/.DS_Store
testdir/.numbers
testdir/hyp.py
testdir/pset
testdir/pset/.DS_Store
testdir/pset/scene
testdir/pset/scene/.DS_Store
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub
testdir/pset/shrub/.DS_Store
testdir/pset/shrub/images
testdir/pset/shrub/images/.DS_Store
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember
testdir/remember/.DS_Store
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv

Let's define a helper function to test for hidden files:

In [56]:
def isHiddenFile(path):
    base = os.path.basename(path)
    return (len(base) > 0 
            and base[0] == '.'
            and base != '.' # '.' ("the current directory") is a special case
            and base != '..' # '..' ("the parent of the current directory") is a special case
           )
In [57]:
isHiddenFile('.DS_Store')
Out[57]:
True
In [58]:
isHiddenFile('testdir/psets/.DS_Store')
Out[58]:
True
In [59]:
isHiddenFile('testdir/psets/memories.txt')
Out[59]:
False
In [60]:
isHiddenFile('.')
Out[60]:
False

Now we can use isHiddenFile to filter out hidden files, leading to our final, correct version of printFileTree.

In [61]:
def printFileTree(root):
    '''Print all directories and files (one per line) starting at root, 
    which itself is a directory or file name.
    '''
    if isHiddenFile(root):
        pass # filter out dot files
    elif os.path.isfile(root):
        print(root)
    elif os.path.isdir(root):
        print(root)
        for fileOrDir in os.listdir(root):
            printFileTree(os.path.join(root, fileOrDir))
                
printFileTree('testdir')
testdir
testdir/hyp.py
testdir/pset
testdir/pset/scene
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub
testdir/pset/shrub/images
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv

4. Exercise 1: printFiles

Starting with printFileTree, define a modified function printFiles that prints only the (nondirectory) files encountered in a file tree traversal starting at a given root. For example,

printFiles('testdir')

should print

testdir/hyp.py
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv
In [62]:
def printFiles(root):
    '''Print only the (nondirectory) files encountered 
    in a file tree traversal starting at root.
    '''
    # Write your code here:
    if isHiddenFile(root):
        pass # filter out dot files
    elif os.path.isfile(root):
        print(root) 
    elif os.path.isdir(root):
        for fileOrDir in os.listdir(root):
            printFiles(os.path.join(root, fileOrDir))

# Test the printFiles function: 
printFiles('testdir')
testdir/hyp.py
testdir/pset/scene/cs1graphics.py
testdir/pset/shrub/images/shrub1.png
testdir/pset/shrub/images/shrub2.png
testdir/pset/shrub/shrub.py
testdir/pubs.json
testdir/remember/ephemeral.py
testdir/remember/memories.txt
testdir/remember/persistent.py
testdir/tracks.csv

5. Exercise 2: printDirs

Starting with printFileTree, define a modified function printDirs that prints only the directories encountered in a file tree traversal starting at a given root. For example,

printDirs('testdir')

should print

testdir
testdir/pset
testdir/pset/scene
testdir/pset/shrub
testdir/pset/shrub/images
testdir/remember
In [63]:
def printDirs(root):
    '''Print only the directories encountered in a 
    file tree traversal starting at root.
    '''
    # Write your code here:
    if isHiddenFile(root):
        pass # filter out dot files
    elif os.path.isdir(root):
        print(root) 
        for fileOrDir in os.listdir(root):
            printDirs(os.path.join(root, fileOrDir))

            
# Test the printDirs function: 
printDirs('testdir')
testdir
testdir/pset
testdir/pset/scene
testdir/pset/shrub
testdir/pset/shrub/images
testdir/remember

6. Exercise 3: countFiles

Starting with printFiles, define a modified function countDirs that returns the number of (nondirectory) files encountered in a file tree traversal starting at a given root. For example:

countFiles('testdir/pset/shrub/images/shrub1.png') => 1
countFiles('testdir/pset/shrub/images') => 2
countFiles('testdir/pset/shrub') => 3
countFiles('testdir/pset') => 4
countFiles('testdir') => 10
In [64]:
def countFiles(root):
    '''Returns the number of (nondirectory) files encountered 
    in a file tree traversal starting at root.
    '''
    # Write your code here:
    if isHiddenFile(root):
        return 0 # filter out dot files
    elif os.path.isfile(root):
        return 1 # count the file we're testing
    elif os.path.isdir(root):
        # # Solution using accumulation variable 
        # filesHere = 0 
        # for fileOrDir in os.listdir(root):
        #     filesHere += countFiles(os.path.join(root, fileOrDir))
        # return filesHere
        
        # Alternative solution using list comprehensions and sum
        return sum([countFiles(os.path.join(root, fileOrDir))
                    for fileOrDir in os.listdir(root)])
    
# Test the countFiles function: 
for f in ['testdir/pset/shrub/images/shrub1.png', 'testdir/pset/shrub/images', 
          'testdir/pset/shrub', 'testdir/pset', 'testdir']:
    print("countFiles('{}') => {}".format(f, countFiles(f)))
countFiles('testdir/pset/shrub/images/shrub1.png') => 1
countFiles('testdir/pset/shrub/images') => 2
countFiles('testdir/pset/shrub') => 3
countFiles('testdir/pset') => 4
countFiles('testdir') => 10

7. Exercise 4: countDirs

Starting with printDirs, define a modified function countDirs that returns the number of directories encountered in a file tree traversal starting at a given root. For example:

countDirs('testdir/pset/shrub/images/shrub1.png') => 0
countDirs('testdir/pset/shrub/images') => 1
countDirs('testdir/pset/shrub') => 2
countDirs('testdir/pset') => 4
countDirs('testdir') => 6
In [65]:
def countDirs(root):
    '''Returns the number of directories encountered in a 
    file tree traversal starting at root.
    '''
    # Write your code here:
    if isHiddenFile(root):
        return 0 # filter out dot files
    elif os.path.isdir(root):
        # # Solution using accumulation variable 
        # dirsHere = 1 # count the root directory itself
        # for fileOrDir in os.listdir(root):
        #     dirsHere += countDirs(os.path.join(root, fileOrDir))
        # return dirsHere
        
        # Alternative solution using list comprehensions and sum
        return 1 + sum([countDirs(os.path.join(root, fileOrDir))
                    for fileOrDir in os.listdir(root)])
    else:
        return 0 # nondirectory files don’t count

    
# Test the countDirs function: 
for f in ['testdir/pset/shrub/images/shrub1.png', 'testdir/pset/shrub/images', 
          'testdir/pset/shrub', 'testdir/pset', 'testdir']:
    print("countDirs('{}') => {}".format(f, countDirs(f)))
countDirs('testdir/pset/shrub/images/shrub1.png') => 0
countDirs('testdir/pset/shrub/images') => 1
countDirs('testdir/pset/shrub') => 2
countDirs('testdir/pset') => 4
countDirs('testdir') => 6

8. Exercise 5 (Challenging): printLargestFiles

This function will print out for each folder the largest file with its corresponding size.
Hidden files are ignored and for folders without files an appropriate message is printed.

Here is the result for printLargestFiles('testdir'):

Biggest file in testdir is pubs.json with size = 107062 bytes.
No files to check in testdir/pset
Biggest file in testdir/pset/scene is cs1graphics.py with size = 212018 bytes.
Biggest file in testdir/pset/shrub is shrub.py with size = 1842 bytes.
Biggest file in testdir/pset/shrub/images is shrub1.png with size = 27248 bytes.
Biggest file in testdir/remember is persistent.py with size = 1634 bytes.
In [66]:
def printLargestFiles(root):
    '''For every folder prints out the relative path of the folder, the
    name of the largest file, and its size. For folders that have no 
    files, prints out an appropriate message.
    '''
    # Write your code here:
    if os.path.isfile(root):
        pass
    elif os.path.isdir(root):
        # get the list of everything in the directory 'root'
        everything = os.listdir(root)
        # keep only files, discard directories and dot files
        onlyFiles = [f for f in everything 
                     if os.path.isfile(os.path.join(root, f)) 
                        and not isHiddenFile(f)]
        # perform printing of results
        if len(onlyFiles) > 0:
            size, fName = max([(os.path.getsize(os.path.join(root, f)), f) for f in onlyFiles])
            print('Biggest file in {} is {} with size = {} bytes.'.format(root, fName, size))
        else:
            print("No files to check in {}.".format(root))
            
        # continue with recursion
        for fileOrDir in everything:
            printLargestFiles(os.path.join(root, fileOrDir))
            

# Test the printLargestFiles function:  
printLargestFiles('testdir')                
Biggest file in testdir is pubs.json with size = 107062 bytes.
No files to check in testdir/pset.
Biggest file in testdir/pset/scene is cs1graphics.py with size = 212018 bytes.
Biggest file in testdir/pset/shrub is shrub.py with size = 1842 bytes.
Biggest file in testdir/pset/shrub/images is shrub1.png with size = 27248 bytes.
Biggest file in testdir/remember is persistent.py with size = 1634 bytes.
In [ ]: