Lesson 1.7: Working with data files
In this lesson, we will discuss the basics of working with data:
parsing text
opening and reading text files
writing text files
Parsing Strings to Lists
Separating text into words is a common programming task
Individual words are called tokens.
The process of separating a line of text into individual tokens is called parsing text or tokenizing it.
The ability to tokenize a line of text depends on the choice of delimiter.
A delimiter is one or more characters that are used to separate a line of text into different tokens.
Separating text into words is so common that the functionality is built into strings via the split() method
The default behavior of the split method is to use whitespace characters (space, tab, etc.) to separate text.
With default behavior, tokens are sequences of non-whitespace characters.
For example:
This is what is happening...
characters in the string are inspected one by one
delimiters separate the string into a list of one or more sub-strings and puts them in a list
the default delimiter is so-called "whitespace" characters (space, tab, etc.)
Successive delimiters are treated as one...
The resulting listOfWords is the same in these three cases:
You can choose any delimiter you'd like...
Separators are whitespace characters unless otherwise specified.
To specify a different delimiter, simply give it as an argument to split()
File IO: File Types
There are two types of files:
Textfile (generic)
The content is saved in a text (ASCII) format
The content can be viewed by any text editor or word processor
Binary File (specific)
The content is saved in an encoded format, e.g., doc, jpeg, gif, etc.
The content is viewed by an application that recognizes the encoding, e.g., we can use any application that "understands" the Word doc format to view a Word doc file
In this course, we are going to restrict attention to text files only.
Simple Output Files: Redirecting Program Output
Perhaps the simplest way to create an output file is to "redirect" the normal output of a program (i.e., its print statements) into a file with a specific name.
From the (Unix) command line, the
>
operator writes program output to a file:
This is called "redirecting file output" and it has nothing to do with Python per se. When we are doing this, we are taking advantage of a feature of the operating system. Nonetheless, it's an easy and powerful way of creating an output file.
It's important to note that the
>
operator creates a new file if the named file (here, "outfile.txt") does not exist. If the file already exists, the previous file is overwritten by the new content.From the (Unix) command line, the
>>
operator appends program output to a file (this can be used to run the program computation for different inputs/situations and get one file containing all results back).
Each time the program myscript.py is called in this manner, its output is added to the end of the existing file outfile.txt.
Writing Directly to a File
We also have the ability for our Python program to write "directly" to a file. In order to do this, we must include Python commands to execute each of the following steps:
Open (or create if it does not exist) the file in write mode
Write the desired information
Close the the file (data gets lost if the file is not closed)
Note that the file will be written to your current "working" directory.
Here is a simple example:
Having executed the code block above, we should now see a file named sample1.txt
in the same directory as this notebook.
Reading Data Directly from a File
We can also create Python programs that read the contents of a file.
In order to do this, we must include Python commands to execute each of the following steps:
Open (or create if it does not exist) the file in write mode
Read the desired information
Close the the file (data gets lost if the file is not closed)
Note that the file must already exist in your current "working" directory.
For example, we can use Python to read the file we just wrote.
Here are some simple file reading scripts:
In the block of code above, the read()
function read the entire file, which is typically not very useful in practice. Instead, we typically read a file line-by-line.
What a pain. It would be much better to use a loop. As with all loops, there is more than one way to do this. Here's a common one:
Comma Separated Value (CSV) data
One of the simplest data formats is known as Comma Separated Values or CSV data.
CSV is simply text with commas, used to separate individual values. However, the convention is to use a ".csv" file extension to indicate that the file has this format.
Many programs "know" how to read/write CSV data, including spreadsheet programs like Excel.
Often, the data in a spreadsheet can be converted to a Comma-Separated Value (.csv) format (for example, Microsoft Excel allows you to save spreadsheets as a .csv)
We can use what we've learned today to read and write from/to .csv files (you will probably need to do this often).
The examples below use the "CSVFile.csv" file you should have downloaded and have in the same director as this .ipynb file.
The csv
module
Reading and writing comma-separated value (CSV) data is so common that there is a Python module to make it easier. Check out https://docs.python.org/3/library/csv.html.
Key features of this module:
no need to
split
each line on comma (this happens automatically)the module knows different "dialects" of CSV files (see the documentation for details)