class: middle, center
Making open source data analysis software easily available and more collaborative
William Stein
SageMath, Inc. and University of Washington
August 24, 2018 at Center of Mathematical Sciences and Applications, 2018 Big Data Conference
???
Thank the organizers.
Remind people to interrupt me with questions at any time.
class: middle
Contents
§0. Background
§1. Demo
§2. Architecture
§3. Sync
§4. Functionality
§5. Culture
§6. Questions
class: middle, center
§0. Background
class: middle, center
I spent the last 6 years creating CoCalc...
(NOTE: there are several other repos not shown here)
Who am I? William Stein --
Started SageMath: in 2004 -- huge open source research-level mathematics Python library, with over 600 contributors, and a million lines of code.
Started Cython: I came up with the name "Cython" and launched the project as a fork of "Pyrex". The potential I saw in Pyrex was critical to choosing Python as the implementation language of Sage.
Started Sage Notebook: in 2006 -- first serious mature browser-based web notebook.
Mathematician: Berkeley Ph.D., faculty at Harvard (2000-2005), UCSD (2005-2006), Univ of Washington (2006-present). Full professor. Published 3 books and over 40 research papers.
CEO, SageMath, Inc: On leave from UW for the last two years to work fulltime on CoCalc.
class: middle, center
§1. Demo
Live Demo of https://CoCalc.com
If I'm going to suggest you use CoCalc when teaching your classes, I better put my arse on the line and do a live demo right now on the production website! So here we go...
Create a new project on https://cocalc.com
Jupyter notebook; select a kernel, then use assistant
Terminal: Jupyter console.
Chat on the side of the Jupyter notebook
Add a collaborator, and edit notebook
TimeTravel
Project snapshots
Sage worksheet
LaTeX file, RTex, RMarkdown
class: middle, center
§2. Architecture
The tech stack
Node.js: Javascript on server; highly async; same code runs in browser and project!
PostgreSQL: database; we make heavy use of LISTEN/NOTIFY
Python3: use for dozens of scripts to control things
React.js/Redux: the browser client heavily uses this; it provides a very functional and reactive approach to user interface implementation.
Typescript/CoffeeScript: used CoffeeScript heavily 2013-2017 for a more Python-looking approach to Javascript, and to avoid the bad parts. Javascript got way, way better, and now Typescript is vastly superior, and we're halfway through switching. The Jupyter code in CoCalc is already fully in Typescript.
Ways to Use CoCalc...
cocalc-docker -- 100% free and open source; run CoCalc on your own computer:
https://CoCalc.com -- main production site, regularly hosts well over 1000 simultaneous running projects. Uses Kubernetes extensively.
Install directly -- Not well supported yet, except from within a CoCalc project itself.
--
Big Data?
cocalc.com results in a lot of user data every day:
Challenge: lots of data to store, backup, etc.
Opportunity: we store every state of every document anybody works with.
Cleverly use it to make CoCalc better? improve error messages, suggest useful code snippets
But that is not what today's talk is about.
class: middle, center
§3. Sync
Realtime Synchronized Editing for Every Document Type
First implementation - differential sync (2013-2016)
I implemented the Differential Synchronization algorithm by Neil Fraser
Problems in practice, at scale.
--
Second implementation - revision log (2016-now)
Definitions:
Patch log: sequence of triples (time, user, patch), with distinct (time, user).
State: current state of the document is simply the result of applying all patches in the patch log in order, where patches are applied on a "best effort basis" (by definition, no merge conflicts).
Algorithm:
Time: sync clock with a central server (a few seconds accuracy is enough).
Edit and send: When user changes doc, broadcast
(time,user,patch)
describing their change.Receive and update: Clients receive patches, integrate into the patch log, and document is updated.
TimeTravel: Version of the document at time t0 by applying all patches up to time t0.
class: middle, center
§4. Functionality
Jupyter Notebooks, LaTeX, Terminals and much more!
Browser-Based Code Notebooks
Sage Notebook: In 2006, Alex Clemesha, Tom Boothby, and I created The Sage Notebook
first serious browser based notebook, inspired by Mathematica and Macsyma notebooks and IPython's command line.
heavily developed until maybe 2010, by Tim Dummol, Jason Grout, and others.
Jupyter: 2011, the IPython project launched a notebook that looked much like sagenb, but was rewritten from scratch using much more modern tools:
Renamed to "Jupyter notebook", to be much more inclusive,
Fantastic grant funding (due to hard work of F. Perez and many others),
Issues with Jupyter due to lack of full backend state.
Close your browser while running a computation, and lose output.
Multiple clients opening the same notebook at once.
Large images and output embedded in ipynb file.
???
Point out that closing your browser is same thing as " (do your network connections ever temporarily fail?)."
Jupyter in CoCalc
There are (at least) three completely different ways to use Jupyter notebooks in CoCalc.
1. Classical with realtime sync
Jupyter embedded in an iframe, then heavily monkey patched:
Factor out large images
Add realtime sync
Timetravel for browsing pasts states of the notebook
A nightmare:
Large output, data, and images: Univ of Sheffield using R with Jupyter.
Subtle sync isues are impossible to avoid
Each new version of Jupyter can break it in subtle ways
Still loses output if connection interrupted.
Jupyter in CoCalc
1. Classical with realtime sync
2. Cocalc Jupyter: React.js rewrite
I got fed up with monkey patching classical Jupyter, since it was just too hackish.
In 2017, I reimplemented the entire Jupyter stack (except kernels), both frontend and back:
Maintained almost the same look and feel as classical Jupyter (unlike Google Collaboratory).
There's a ton of features when you try to implement them all!
This was a couple months hard slog
CoCalc Jupyter has full knowledge of notebook state on the backend:
Large images served directly over http (not in the document)
No lost output when network flaky or user refreshes browsers
Large output is buffered on the backend and can be obtained by explicit request
Shares some components and code with nteract (from Kyle Kelly at Netflix).
Jupyter in CoCalc
1. Classical with sync (...)
2. Cocalc Jupyter (...)
3. Plain Classical Jupyter
Port forwarding and base url's...
One click to start a classic server from a project, which only you and your collabs can access.
Fallback just in case (e.g., extension support)
Can also run JupyterLab this way.
Sage Worksheets in CoCalc
Weakness of Sage and Jupyter notebooks: editing that involves multiple cells can be awkward.
In 2013, I was designing a collaborative way to use Sage from CoCalc.
I was worried about the performance of hundreds of CodeMirror editors at once.
I created "Sage worksheets" which are a single CodeMirror editor with output "CodeMirror Widgets".
Rather than "cells" (lots of separate editors), you fully leverage using a single document:
multiple cursors
find/replace
range selection copy/paste (with any amount of input/output)
code folding
The current implementation is still a little bit flaky since it uses a single string, rather than a db-doc, for the state; also it does not use React.js. I'm doing a rewrite to fix this soon.
???
In Sage Notebook we used textareas (so no syntax highlighting, etc.) for performance reasons.
Collaborative IDE
Collaboratively do:
Sage development
Javascript dev -- I've done all CoCalc development in CoCalc on a Chromebook since 2013!
CoCalc has a new tiled code editor built on CodeMirror
Arbitrarily many simultaneous views on a document, like in Emacs (say).
Syntax highlighting, code folding, auto formatting, color themes, etc.
Work in progress: support VS Code's "language server protocol".
LaTeX editor
You can write collaborative research papers using the LaTeX editor.
Code editor with extra compilation and pdf preview functionality.
Special code for dealing with multiple users compiling simultaneously
You can run arbitrary code/scripts and use data as part of writing your paper, unlike any other web-based Latex editor.
SageTex is fully supported: easily run Python code in your document.
Command line terminal
CoCalc projects are Ubuntu 18.04 Docker containers.
I started with term.js and added color themes, support for "open filename", different char sets, etc.
Each terminal session has a corresponding .term file, so:
Multiple people can open the same terminal
Chat on the side of terminal
Click rocket ship to edit custom script that is run when terminal starts.
Type
open filename
to open a file from the terminal.
Chatrooms
You can chat on the side of any file
Chat messages are markdown, with math typesetting
Chatrooms: bigger, and has rendered preview.
Anybody can edit any past message!
Chat notifications in upper right
Course Management
CoCalc has a full course management system, allowing an instructor to easily:
Create projects for all students,
Send assignments to students. An assignment is any directory of files (e.g., Jupyter notebook, Latex document, etc.)
Watch students in realtime while they are working on assignments, and chat with them.
Collect assignments
Grade and return assignments
See full history of how all their students did the assignments.
Peer grading
R
1. Jupter Kernel
2. Command Line
3. RMarkdown (Knitr)
4. RTex (Knitr)
R
1. Jupyter Kernel
2. Command Line
3. RMarkdown (Knitr)
4. RTex (Knitr)
R
1. Jupyter Kernel
2. Command Line
3. RMarkdown (Knitr)
4. RTex (Knitr)
#R
1. Jupyter Kernel
2. Command Line
3. RMarkdown (Knitr)
4. RTex (Knitr)
class: middle, center
§5. Culture
People
The Team
Current active code contributors: John Jeng, Harald Schilly, Hal Snyder, William Stein, Travis Scholl
Past significant contributors: Greg Bard, Rob Beezer, Keith Clawson, Tim Clemans, Andy Huchala, Jon Lee, Simon Luu, Nicholas Ruhland, Todd Zimmerman
Advisers/Investors: ask me
Rich diversity of users
College and high school teachers
Researchers from around the world
Wide range of disciplines (not just math!)
Commercial versus Academic
Sage and Jupyter are not-for-profit academic projects. CoCalc is a commercial project. WHY?
I hosted "the online pari/Magma calculator" and "the modular forms calculator" at Harvard 2000-2005.
I developed, maintained and hosted sagenb.org (the Sage Notebook) 2007--2014, and learned a lot about the challenges of hosting something like this at a University:
Attacks by hackers
Malware
Periodically getting our internet connection automatically cut by the University
Legal: violation by users expose me to liability.
I started CoCalc as a commercial project, not an academic one:
Generate money to support Sage development (instead, CoCalc has lost about $500K)
Be sustainable (not depend on grants)
Be fulltime: I work much better when I focus fulltime, and grants only reduce my teaching very little.
I wasn't getting grants anymore
Conversation with Fernado Perez about how "Project Jupyter" made the opposite decision...
I'm still faculty at UW, but have been on unpaid leave for 2 years.
???
Maybe mention that when I finally talked with the higher ups at UW about running sagenb, they said "absolutely no way".
class: middle, center