Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 1374
Image: ubuntu2004
Kernel: Python 3 (system-wide)

nuwc_nps_banner.png

Lesson 1.7: Working with data files

In this lesson, we will discuss the basics of working with data:

  • parsing text

  • opening and reading text files

  • writing text files

Parsing Strings to Lists

  • Separating text into words is a common programming task

    • Individual words are called tokens.

    • The process of separating a line of text into individual tokens is called parsing text or tokenizing it.

    • The ability to tokenize a line of text depends on the choice of delimiter.

    • A delimiter is one or more characters that are used to separate a line of text into different tokens.

  • Separating text into words is so common that the functionality is built into strings via the split() method

  • The default behavior of the split method is to use whitespace characters (space, tab, etc.) to separate text.

  • With default behavior, tokens are sequences of non-whitespace characters.

  • For example:

someText = "Read my lips!" someText
listOfWords = someText.split() # default is to parse on white space listOfWords
type(listOfWords)
# we can iterate over the items in the list as usual... for word in listOfWords: print(word)

This is what is happening...

  • characters in the string are inspected one by one

  • delimiters separate the string into a list of one or more sub-strings and puts them in a list

  • the default delimiter is so-called "whitespace" characters (space, tab, etc.)

Successive delimiters are treated as one...

The resulting listOfWords is the same in these three cases:

You can choose any delimiter you'd like...

  • Separators are whitespace characters unless otherwise specified.

  • To specify a different delimiter, simply give it as an argument to split()

someText
# Example 1 listOfWords = someText.split('ead') listOfWords
# quick example to show that you can cascade splits and selections mypath = "/foo/bar/biff/baz/file.png" mypath.split("/")[-1].split(".")[0]
'file'
# Example 2 lineOfText = "one fish two fish red fish blue fish"
tokens = lineOfText.split() # default is to parse on white space # default parsing for item in tokens: print("next item: '%s'" % item)
next item: 'one' next item: 'fish' next item: 'two' next item: 'fish' next item: 'red' next item: 'fish' next item: 'blue' next item: 'fish'
tokens = lineOfText.split('h') # now, parse on 'h' # default parsing for item in tokens: print("next item: '%s'" % item) # take note of resulting whitespace
next item: 'one fis' next item: ' two fis' next item: ' red fis' next item: ' blue fis' next item: ''
tokens = lineOfText.split('fish') # default parsing for item in tokens: tokens = item.split() print("next item: '%s'" % item.strip()) # take note of resulting whitespace

File IO: File Types

There are two types of files:

  • Textfile (generic)

    • The content is saved in a text (ASCII) format

    • The content can be viewed by any text editor or word processor

  • Binary File (specific)

    • The content is saved in an encoded format, e.g., doc, jpeg, gif, etc.

    • The content is viewed by an application that recognizes the encoding, e.g., we can use any application that "understands" the Word doc format to view a Word doc file

In this course, we are going to restrict attention to text files only.

Simple Output Files: Redirecting Program Output

  • Perhaps the simplest way to create an output file is to "redirect" the normal output of a program (i.e., its print statements) into a file with a specific name.

  • From the (Unix) command line, the > operator writes program output to a file:

python myscript.py > outfile.txt

This is called "redirecting file output" and it has nothing to do with Python per se. When we are doing this, we are taking advantage of a feature of the operating system. Nonetheless, it's an easy and powerful way of creating an output file.

  • It's important to note that the > operator creates a new file if the named file (here, "outfile.txt") does not exist. If the file already exists, the previous file is overwritten by the new content.

  • From the (Unix) command line, the >> operator appends program output to a file (this can be used to run the program computation for different inputs/situations and get one file containing all results back).

python myscript.py >> outfile.txt

Each time the program myscript.py is called in this manner, its output is added to the end of the existing file outfile.txt.

Writing Directly to a File

We also have the ability for our Python program to write "directly" to a file. In order to do this, we must include Python commands to execute each of the following steps:

  • Open (or create if it does not exist) the file in write mode

  • Write the desired information

  • Close the the file (data gets lost if the file is not closed)

Note that the file will be written to your current "working" directory.

Here is a simple example:

#open a file named "sample1.txt" in 'write' mode myFile = open("sample1.txt", "w") # the "w" indicates to open in write mode #write (save) three lines of text myFile.write("This is line 1.\n") # note the use of \n to get a "carriage return" myFile.write("And here's line 2.\n") myFile.write("Finally comes line 3.\n") #close the file myFile.close() #data get lost if not closed

Having executed the code block above, we should now see a file named sample1.txt in the same directory as this notebook.

Reading Data Directly from a File

We can also create Python programs that read the contents of a file.

In order to do this, we must include Python commands to execute each of the following steps:

  • Open (or create if it does not exist) the file in write mode

  • Read the desired information

  • Close the the file (data gets lost if the file is not closed)

Note that the file must already exist in your current "working" directory.

For example, we can use Python to read the file we just wrote.

Here are some simple file reading scripts:

# Reading Files Example 1: #open the file in 'read' mode myFile = open("sample1.txt", "r") # the "r" indicates to open in read mode whole_file = myFile.read() # this reads the ENTIRE FILE into the variable print(whole_file) # to print the whole file myFile.close()

In the block of code above, the read() function read the entire file, which is typically not very useful in practice. Instead, we typically read a file line-by-line.

# Reading Files Example 2: #open the file in 'read' mode myFile = open("sample1.txt", "r") # the "r" indicates to open in read mode first_line = myFile.readline() # this reads one line of the file into the variable print(first_line) # to print the contents second_line = myFile.readline() # this reads one line of the file into the variable print(second_line) # to print the contents third_line = myFile.readline() # this reads one line of the file into the variable print(third_line) # to print the contents myFile.close() #close the file

What a pain. It would be much better to use a loop. As with all loops, there is more than one way to do this. Here's a common one:

# Reading Files Example 3: #open the file in 'read' mode myFile = open("sample1.txt", "r") # the "r" indicates to open in read mode for line in myFile: # the for-loop: a nice way to iterate over the lines in a file print(line) myFile.close() #close the file

Comma Separated Value (CSV) data

  • One of the simplest data formats is known as Comma Separated Values or CSV data.

  • CSV is simply text with commas, used to separate individual values. However, the convention is to use a ".csv" file extension to indicate that the file has this format.

  • Many programs "know" how to read/write CSV data, including spreadsheet programs like Excel.

  • Often, the data in a spreadsheet can be converted to a Comma-Separated Value (.csv) format (for example, Microsoft Excel allows you to save spreadsheets as a .csv)

  • We can use what we've learned today to read and write from/to .csv files (you will probably need to do this often).

  • The examples below use the "CSVFile.csv" file you should have downloaded and have in the same director as this .ipynb file.

# To read a .csv file target = open("CSVFile.csv", "r") # open the target file in read mode my_data = [] # create an empty list for line in target: # run a loop over each line of the target file line = line.strip() # this strips any leading/lagging whitespace and any special characters my_data.append(line.split(',')) # this splits the line into a list and appends that list to the my_data list my_data # note that all of the data returned is of type string! target.close() # close the target file
my_data
my_data.append(['E', '17', '18', '19', '20']) # let's add another line - note everything is a string my_data
# To write a .csv file f = open('CSVFile.csv','w') # open the file and refer to it as "f" for sublist in my_data: # loop over rows for item in sublist: # loop over columns f.write(item + ',') # this writes each element and a comma to serve as the delimiter f.write('\n') # this executes a newline character at the end of the line f.close()

The csv module

Reading and writing comma-separated value (CSV) data is so common that there is a Python module to make it easier. Check out https://docs.python.org/3/library/csv.html.

Key features of this module:

  • no need to split each line on comma (this happens automatically)

  • the module knows different "dialects" of CSV files (see the documentation for details)

### Read data into list import csv # this is a real shortcut to reading everything into a list-of-lists... with open('CSVFile.csv', 'r') as f: reader = csv.reader(f) my_data = list(reader) f.close()
my_data
[['A', '1', '2', '3', '4'], ['B', '5', '6', '7', '8'], ['C', '9', '10', '11', '12'], ['D', '13', '14', '15', '16']]
my_data.append(['F', '21', '22', '23', '24']) # let's add another line - note everything is a string my_data
[['A', '1', '2', '3', '4'], ['B', '5', '6', '7', '8'], ['C', '9', '10', '11', '12'], ['D', '13', '14', '15', '16'], ['F', '21', '22', '23', '24']]
with open('CSVFile.csv', 'w') as csvfile: writer = csv.writer(csvfile) # pass additional parameters as appropriate for row in my_data: writer.writerow(row)