Lab 13: Working with Data
Summary
As usual, we'll work on this week's B notebook today.
Table of Contents
- Lab 13 Home
- Part 1: Exercises (start with B notebook)
- CS Research Overview
- Knowledge Check
Big Questions
-
Why is a list of dictionaries a natural format for storing data?
Show Answer
There's no single best answer here, but at a basic level, it mimics a spreadsheet, since each list entry can have different keys to represent labeled columns, and in the code, it's easy to iterate through the rows and then access multiple parts of each row by name. Why are spreadsheets so popular? They're a good way for non-programmers to record, inspect, and even analyze data, and they can represent a lot of different kinds of data in one common format. One caveat: although the list-of-dictionaries format is conceptually simple, it's computationally very inefficient (since you have to store the keys separately in each row). Using a library like `pandas` for data processing would be better if you have a lot of data, but for small data sets, it probably won't matter. -
Why do we want to store data in files, as opposed to directly in Python variables?
Show Answer
The important thing about files is that they persist: they will still be there even after we quit the program or turn the computer off and on again. If we stored our data in Python variables, we'd have to keep the program running for as long as we wanted that data to stay around. Another advantage of files: you can send them in an email or put them on a USB drive and transfer them around, which you can't do with Python variables. A related question is: why store data in CSV or JSON formats instead of just writing a `.py` file which when run defines the data structure we want? There are two big reasons for this: first, interoperability: lots of programming languages and programs can read and/or write CSV files, but only someone who knows how to program Python can interpret a Python data file, and programs written in other languages probably wouldn't be able to use it. The second reason is security: If we ask you to run a Python file which defines our data, it's possible to sneak in some other Python commands that act as a virus. So if someone you don't trust sends you a `.py` file, it's probably best not to just run it without looking inside and understanding what the code is first. In contrast, CSV and JSON formats can't include any code: they just include data. You might still want to be cautious about what data they do have, but it's much safer to actually open up and look at that data or load it into a processing program.