CoCalc -- 1.7_parsingtext

| Download

Project: NUWC_U_Python_May_2021

Path: Day_1_Python_Basics / 1.7_parsingtext_fileIO.ipynb

Views: ¹³⁷⁴
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Lesson 1.7: Working with data files

In this lesson, we will discuss the basics of working with data:

parsing text
opening and reading text files
writing text files

Parsing Strings to Lists

Separating text into words is a common programming task
- Individual words are called tokens.
- The process of separating a line of text into individual tokens is called parsing text or tokenizing it.
- The ability to tokenize a line of text depends on the choice of delimiter.
- A delimiter is one or more characters that are used to separate a line of text into different tokens.
Separating text into words is so common that the functionality is built into strings via the split() method
The default behavior of the split method is to use whitespace characters (space, tab, etc.) to separate text.
With default behavior, tokens are sequences of non-whitespace characters.
For example:

In [0]:

someText = "Read my lips!"
someText

In [0]:

listOfWords = someText.split()   # default is to parse on white space
listOfWords

In [0]:

type(listOfWords)

In [0]:

# we can iterate over the items in the list as usual...
for word in listOfWords:
    print(word)

This is what is happening...

characters in the string are inspected one by one
delimiters separate the string into a list of one or more sub-strings and puts them in a list
the default delimiter is so-called "whitespace" characters (space, tab, etc.)

Successive delimiters are treated as one...

The resulting listOfWords is the same in these three cases:

You can choose any delimiter you'd like...

Separators are whitespace characters unless otherwise specified.
To specify a different delimiter, simply give it as an argument to split()

In [0]:

someText

In [0]:

# Example 1

listOfWords = someText.split('ead')

listOfWords

In [4]:

#  quick example to show that you can cascade splits and selections 

mypath = "/foo/bar/biff/baz/file.png"
mypath.split("/")[-1].split(".")[0]

'file'

In [1]:

# Example 2

lineOfText = "one fish two fish red fish   blue fish"

In [2]:

tokens = lineOfText.split()   # default is to parse on white space

# default parsing
for item in tokens:
    print("next item: '%s'" % item)

next item: 'one'
next item: 'fish'
next item: 'two'
next item: 'fish'
next item: 'red'
next item: 'fish'
next item: 'blue'
next item: 'fish'

In [3]:

tokens = lineOfText.split('h')   # now, parse on 'h'

# default parsing
for item in tokens:
    print("next item: '%s'" % item)    # take note of resulting whitespace

next item: 'one fis'
next item: ' two fis'
next item: ' red fis'
next item: '   blue fis'
next item: ''

In [0]:

tokens = lineOfText.split('fish')
# default parsing
for item in tokens:
    tokens = item.split()
    print("next item: '%s'" % item.strip()) # take note of resulting whitespace

File IO: File Types

There are two types of files:

Textfile (generic)
- The content is saved in a text (ASCII) format
- The content can be viewed by any text editor or word processor
Binary File (specific)
- The content is saved in an encoded format, e.g., doc, jpeg, gif, etc.
- The content is viewed by an application that recognizes the encoding, e.g., we can use any application that "understands" the Word doc format to view a Word doc file

In this course, we are going to restrict attention to text files only.

Simple Output Files: Redirecting Program Output

Perhaps the simplest way to create an output file is to "redirect" the normal output of a program (i.e., its print statements) into a file with a specific name.
From the (Unix) command line, the > operator writes program output to a file:

python myscript.py > outfile.txt

This is called "redirecting file output" and it has nothing to do with Python per se. When we are doing this, we are taking advantage of a feature of the operating system. Nonetheless, it's an easy and powerful way of creating an output file.

It's important to note that the > operator creates a new file if the named file (here, "outfile.txt") does not exist. If the file already exists, the previous file is overwritten by the new content.
From the (Unix) command line, the >> operator appends program output to a file (this can be used to run the program computation for different inputs/situations and get one file containing all results back).

python myscript.py >> outfile.txt

Each time the program myscript.py is called in this manner, its output is added to the end of the existing file outfile.txt.

Writing Directly to a File

We also have the ability for our Python program to write "directly" to a file. In order to do this, we must include Python commands to execute each of the following steps:

Open (or create if it does not exist) the file in write mode
Write the desired information
Close the the file (data gets lost if the file is not closed)

Note that the file will be written to your current "working" directory.

Here is a simple example:

In [0]:

#open a file named "sample1.txt" in 'write' mode
myFile = open("sample1.txt", "w")         # the "w" indicates to open in write mode

#write (save) three lines of text
myFile.write("This is line 1.\n")         # note the use of \n to get a "carriage return"
myFile.write("And here's line 2.\n")
myFile.write("Finally comes line 3.\n")

#close the file
myFile.close() #data get lost if not closed

Having executed the code block above, we should now see a file named sample1.txt in the same directory as this notebook.

Reading Data Directly from a File

We can also create Python programs that read the contents of a file.

In order to do this, we must include Python commands to execute each of the following steps:

Open (or create if it does not exist) the file in write mode
Read the desired information
Close the the file (data gets lost if the file is not closed)

Note that the file must already exist in your current "working" directory.

For example, we can use Python to read the file we just wrote.

Here are some simple file reading scripts:

In [0]:

# Reading Files Example 1: 

#open the file in 'read' mode
myFile = open("sample1.txt", "r")     # the "r" indicates to open in read mode

whole_file = myFile.read()            # this reads the ENTIRE FILE into the variable

print(whole_file)                     # to print the whole file

myFile.close()

In the block of code above, the read() function read the entire file, which is typically not very useful in practice. Instead, we typically read a file line-by-line.

In [0]:

# Reading Files Example 2: 

#open the file in 'read' mode
myFile = open("sample1.txt", "r")     # the "r" indicates to open in read mode

first_line = myFile.readline()        # this reads one line of the file into the variable
print(first_line)                     # to print the contents

second_line = myFile.readline()       # this reads one line of the file into the variable
print(second_line)                    # to print the contents

third_line = myFile.readline()        # this reads one line of the file into the variable
print(third_line)                     # to print the contents

myFile.close()                        #close the file

What a pain. It would be much better to use a loop. As with all loops, there is more than one way to do this. Here's a common one:

In [0]:

# Reading Files Example 3: 

#open the file in 'read' mode
myFile = open("sample1.txt", "r")   # the "r" indicates to open in read mode

for line in myFile:                 # the for-loop: a nice way to iterate over the lines in a file
    print(line)        

myFile.close()                      #close the file

Comma Separated Value (CSV) data

One of the simplest data formats is known as Comma Separated Values or CSV data.
CSV is simply text with commas, used to separate individual values. However, the convention is to use a ".csv" file extension to indicate that the file has this format.
Many programs "know" how to read/write CSV data, including spreadsheet programs like Excel.
Often, the data in a spreadsheet can be converted to a Comma-Separated Value (.csv) format (for example, Microsoft Excel allows you to save spreadsheets as a .csv)
We can use what we've learned today to read and write from/to .csv files (you will probably need to do this often).
The examples below use the "CSVFile.csv" file you should have downloaded and have in the same director as this .ipynb file.

In [0]:

# To read a .csv file

target = open("CSVFile.csv", "r")    # open the target file in read mode

my_data = []                         # create an empty list
for line in target:                  # run a loop over each line of the target file
    line = line.strip()              # this strips any leading/lagging whitespace and any special characters
    my_data.append(line.split(','))  # this splits the line into a list and appends that list to the my_data list

my_data                              # note that all of the data returned is of type string!

target.close()                       # close the target file

In [0]:

my_data

In [0]:

my_data.append(['E', '17', '18', '19', '20'])     # let's add another line - note everything is a string
my_data

In [0]:

# To write a .csv file

f = open('CSVFile.csv','w')                   # open the file and refer to it as "f"
for sublist in my_data:                      # loop over rows
    for item in sublist:                     # loop over columns
        f.write(item + ',')                  # this writes each element and a comma to serve as the delimiter
    f.write('\n')                            # this executes a newline character at the end of the line
        
f.close()

The `csv` module

Reading and writing comma-separated value (CSV) data is so common that there is a Python module to make it easier. Check out https://docs.python.org/3/library/csv.html.

Key features of this module:

no need to split each line on comma (this happens automatically)
the module knows different "dialects" of CSV files (see the documentation for details)

In [5]:

### Read data into list

import csv

# this is a real shortcut to reading everything into a list-of-lists...
with open('CSVFile.csv', 'r') as f:
    reader = csv.reader(f)
    my_data = list(reader)

f.close()

In [6]:

my_data

[['A', '1', '2', '3', '4'],
 ['B', '5', '6', '7', '8'],
 ['C', '9', '10', '11', '12'],
 ['D', '13', '14', '15', '16']]

In [7]:

my_data.append(['F', '21', '22', '23', '24'])     # let's add another line - note everything is a string
my_data

[['A', '1', '2', '3', '4'],
 ['B', '5', '6', '7', '8'],
 ['C', '9', '10', '11', '12'],
 ['D', '13', '14', '15', '16'],
 ['F', '21', '22', '23', '24']]

In [0]:

with open('CSVFile.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)  # pass additional parameters as appropriate
    for row in my_data:
        writer.writerow(row)