# If you don't have Python, you can use ReplIt:

# https://repl.it/languages/Python3

## Let's write some text to disk!

In [1]:
# https://repl.it/languages/Python3

f = open('input.txt', mode='w')
f.write("Hello World!")
f.close()

f = open('input.txt', mode='r')
print(f.read())
f.close()

Hello World!


## There are a lot of fiddly bits to files, how do you look up the options and operations?

In [2]:
help(open)
# ? in ipython

Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise IOError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position

In [3]:
dir(f)
# ?f.read* in ipython

['_CHUNK_SIZE',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_finalizing',
 'buffer',
 'close',
 'closed',
 'detach',
 'encoding',
 'errors',
 'fileno',
 'flush',
 'isatty',
 'line_buffering',
 'mode',
 'name',
 'newlines',
 'read',
 'readable',
 'readline',
 'readlines',
 'seek',
 'seekable',
 'tell',
 'truncate',
 'writable',
 'write',
 'writelines']

## What do you expect to happen if you read after closing?
## What do you expect to happen if you open many files without closing them?

## Python Context Managers help us avoid these sorts of mistakes

In [4]:
with open('input.txt') as f:
    print(f.read())
    
print(f.closed)

Hello World!
True


## What do you expect to happen if you read twice?

In [5]:
with open('input.txt') as f:
    print(f.read())
    print(f.read())


Hello World!



## Files contain a position that advances as you read and write.

## How do you read a file line by line?

In [6]:
coconuts = 'coconuts.txt'
with open(coconuts, 'w') as f:
    f.write("I've got a lovely bunch of coconuts\n")
    f.write("There they are, all standing in a row\n")
    f.write("Big ones, small ones, some as big as your head\n")
    f.write("Give them a twist a flick of the wrist\n")
    f.write("That's what the showman said")
    
with open(coconuts, 'r') as f:
    for line in f:
        print(line)

I've got a lovely bunch of coconuts

There they are, all standing in a row

Big ones, small ones, some as big as your head

Give them a twist a flick of the wrist

That's what the showman said


## This is great!  But it is kind of a blank slate.

## How do we make good life choices when storing data?

## JSON is a lightweight specification (15 pages) for text based data containing:
* <span style="color:blue">strings</span>
* <span style="color:blue">booleans</span>
* <span style="color:blue">integers</span>
* <span style="color:blue">floats</span> (including exponential notation)
* <span style="color:blue">nulls</span> (called None in Python)
* <span style="color:red">Objects</span> (called dictionary in Python)
* <span style="color:red">Array</span> (called list in Python)

# What are these type things, and why do we need them?
* csv files don't have types
* Bash (shell) scripts don't have types

## Gene name errors are widespread in the scientific literature

Mark Ziemann, Yotam Eren and Assam El-Osta

https://doi.org/10.1186/s13059-016-1044-7
    
>The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating point numbers.  A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.

## Having ambiguous input/output multiplies the amount of work.

The unix command to list files (ls) needs one flag (-C) for newline seperated output, and another for comma seperated output.

Lots of commands have a flag for every possible interpretation of concepts like list and number.

## Poorly specified formats are hard to implement
## Many csv parsers fail when values contain commas or newlines

## At the lowest level, there is no difference between a string, integer, or float.

In [16]:
import struct

s = 'abcd'
b = bytes('abcd', 'utf-8')
print('string:', s)
print('integers:', list(ord(c) for c in s))
print('float:', struct.unpack('f', b))

string: abcd
integers: [97, 98, 99, 100]
float: (1.6777999408082104e+22,)


## Types determine representation and behavior

In [7]:
# The '+' operation varies wildly depending on the relevent type
print('conc' + 'atination')
print(2 + 3)
print('conc' + 3)

concatination
5


TypeError: must be str, not int

## List: [ , , ]

* Mutable sequence
* Think 'examples of X'
* Called an Array in JSON, represented the same way

In [10]:
vals = ['1', 1]
vals.append(1.0)
vals += ['one']
print(vals[3])

one


## Tuple    ( , , )

* Immutable sequence
* Think 'group of associated data'
* Also look up collections.namedtuple
* No JSON representation

In [11]:
attributes = ('some', 'associated', 'data')
val1, val2, val3 = attributes

print(attributes[2])

data


## Dictionary    { : , : }

* Unique Key Value Pairs
* Keys must be immutable
* Think 'I want to look this up later'
* Confusingly, called an object in JSON.  Represented the same way.

In [12]:
d = {'name': 'Jared', 'date': '2018-10-01'}
print(d['name'])
del d['name']
print(d)

Jared
{'date': '2018-10-01'}


# Set    { , , }

* Unique Collection
* No JSON representation

In [13]:
s = {'one', 'two'}
s.add('two')
s.remove('one')
print(s)

{'two'}


## Describe the difference between "123" and 123 in a few sentences.

## Why doesn't JSON provide Set and Tuple collections?

## Let's load some JSON data!

People in space -- Nathan Bergey

http://api.open-notify.org/astros.json

In [14]:
import json
from pprint import pprint

in_space = """
{
    "message": "success", 
    "people": [
        {"craft": "ISS", "name": "Oleg Artemyev"}, 
        {"craft": "ISS", "name": "Andrew Feustel"}, 
        {"craft": "ISS", "name": "Richard Arnold"}, 
        {"craft": "ISS", "name": "Sergey Prokopyev"}, 
        {"craft": "ISS", "name": "Alexander Gerst"}, 
        {"craft": "ISS", "name": "Serena Aunon-Chancellor"}
     ], 
     "number": 6
}
"""

pprint(json.loads(in_space))

{'message': 'success',
 'number': 6,
 'people': [{'craft': 'ISS', 'name': 'Oleg Artemyev'},
            {'craft': 'ISS', 'name': 'Andrew Feustel'},
            {'craft': 'ISS', 'name': 'Richard Arnold'},
            {'craft': 'ISS', 'name': 'Sergey Prokopyev'},
            {'craft': 'ISS', 'name': 'Alexander Gerst'},
            {'craft': 'ISS', 'name': 'Serena Aunon-Chancellor'}]}


## Strings
* Input with json.loads
* Output with json.dumps

## Files
* Input with json.load
* Output with json.dump

# Add yourself to Space!

## Text data is much more robust than binary data
* Self documenting
* Much much easier to debug
* Easier to version
* Cost doesn't matter for small data
* Most data is small data

## What Doesn't JSON have?
* Comments
* Dates

## Use Toml/Yaml Instead

## Toml vs Yaml is a good argument for minimal data formats
* YAML is 86 pages
* Toml is comparable to JSON in size
* Loading YAML is a security risk by default
* Lots of variation between parsers.  Lots of incomplete implementation.

## Things to watch out for: 


### Writing down something in JSON that can't be represented in the language.

In [15]:
l = json.loads('{"number": 1.6000000000000000000001}')
print(l)

{'number': 1.6}


## Things to watch out for:

### Encoding/Decoding can get expensive

# Takeaways!
* look up file handling options with help, dir, and ?
* Avoid dangling files and other resources by using context managers (`with` statement)
* Use Python types and collections to unambiguously represent your data
* Look up David Beazly's talk "Builtin Superheros" for more information on collections
* To make your software more robust, use text data
* Start with JSON, evolve as needed

In [0]:
License joke -- Douglas Crockford
"The Software shall be used for Good, not Evil."
"I give permission for IBM, its customers, partners, and minions, to use JSLint for evil."