If you don't have Python, you can use ReplIt:
https://repl.it/languages/Python3
Let's write some text to disk!
There are a lot of fiddly bits to files, how do you look up the options and operations?
What do you expect to happen if you read after closing?
What do you expect to happen if you open many files without closing them?
Python Context Managers help us avoid these sorts of mistakes
What do you expect to happen if you read twice?
Files contain a position that advances as you read and write.
How do you read a file line by line?
This is great! But it is kind of a blank slate.
How do we make good life choices when storing data?
JSON is a lightweight specification (15 pages) for text based data containing:
strings
booleans
integers
floats (including exponential notation)
nulls (called None in Python)
Objects (called dictionary in Python)
Array (called list in Python)
What are these type things, and why do we need them?
csv files don't have types
Bash (shell) scripts don't have types
Gene name errors are widespread in the scientific literature
Mark Ziemann, Yotam Eren and Assam El-Osta
https://doi.org/10.1186/s13059-016-1044-7
The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.
Having ambiguous input/output multiplies the amount of work.
The unix command to list files (ls) needs one flag (-C) for newline seperated output, and another for comma seperated output.
Lots of commands have a flag for every possible interpretation of concepts like list and number.
Poorly specified formats are hard to implement
Many csv parsers fail when values contain commas or newlines
At the lowest level, there is no difference between a string, integer, or float.
Types determine representation and behavior
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-af1cce258651> in <module>()
2 print('conc' + 'atination')
3 print(2 + 3)
----> 4 print('conc' + 3)
TypeError: must be str, not int
List: [ , , ]
Mutable sequence
Think 'examples of X'
Called an Array in JSON, represented the same way
Tuple ( , , )
Immutable sequence
Think 'group of associated data'
Also look up collections.namedtuple
No JSON representation
Dictionary { : , : }
Unique Key Value Pairs
Keys must be immutable
Think 'I want to look this up later'
Confusingly, called an object in JSON. Represented the same way.
Set { , , }
Unique Collection
No JSON representation
Describe the difference between "123" and 123 in a few sentences.
Why doesn't JSON provide Set and Tuple collections?
Strings
Input with json.loads
Output with json.dumps
Files
Input with json.load
Output with json.dump
Add yourself to Space!
Text data is much more robust than binary data
Self documenting
Much much easier to debug
Easier to version
Cost doesn't matter for small data
Most data is small data
What Doesn't JSON have?
Comments
Dates
Use Toml/Yaml Instead
Toml vs Yaml is a good argument for minimal data formats
YAML is 86 pages
Toml is comparable to JSON in size
Loading YAML is a security risk by default
Lots of variation between parsers. Lots of incomplete implementation.
Things to watch out for:
Writing down something in JSON that can't be represented in the language.
Things to watch out for:
Encoding/Decoding can get expensive
Takeaways!
look up file handling options with help, dir, and ?
Avoid dangling files and other resources by using context managers (
with
statement)Use Python types and collections to unambiguously represent your data
Look up David Beazly's talk "Builtin Superheros" for more information on collections
To make your software more robust, use text data
Start with JSON, evolve as needed