| Download

Project: 🤔 William Stein personal general -- TODO, Planning, Inventory, Notes

Path: travel/2018/2018-07-icms/talks / cocalc / slides / slides.md

Views: ²⁷³⁰

Making open source mathematical software easily available on the web

William Stein

SageMath, Inc. and University of Washington

July 24, 2018 at ICMS

Audio Recording of Talk

???

Thank the organizers.
Remind people to interrupt me with questions at any time.

This talk is about what I spent the last 6 years on...

§1. A live demo...

??? (15 minutes)

Open existing project on cocalc.com
Jupyter notebook: factor 2018
Terminal: run tmux, Maxima (factor 2018), htop
Latex: create and factor 2018 using sagetex.
Chat on the side of a file
Add a collaborator
Show TimeTravel
Encourage people in audience to play around in this project, invite nearby people...
We will return to demo at end of talk.

§2. Architecture

CoCalc's three architectures

1. Insecure monolith (one account)

Server is a single monolithic node.js process

Security? None
Isolation? None, since everything runs as the same UNIX account.

Most CoCalc development: We run this monolithic app in a CoCalc project, and do most CoCalc development from inside CoCalc
Wishlist:
wrap this as an electron app
make this a simple npm install cocalc
get this into Linux distros...?

CoCalc's three architectures

1. Insecure monolith (one account)

2. Small number of services (one account per project)

Separate Linux account for each project
HAproxy, Nginx, several node.js services to create projects, etc.
No use of Docker (but used by CoCalc-docker container)

Isolation? yes, different user per project, and cgroups.
Security? yes if on multiple VM's; none on cocalc-docker

Main production site: April 2013 - July 2017.
- Total nightmare: no autoscaling, no good health checks, painful upgrades and versions
- Basically, Harald Schilly and I had to manually keep it running, scale, etc.

???

say something about CoCalc Docker and encourage its use!

CoCalc's three architectures

1. Insecure monolith (one account)

2. Small number of services (one account per project)

3. Kubernetes (one Docker container per project)

Kubernetes 1.11, Docker, Google Compute Engine (cloud storage, load balancers, node provisioning)
Scalable:
- Supports large number of simultaneous users
- Autoscaling of host VM's (many subtle details...)
- Sophisticated tiered storage:
  - Each project has its own ZFS pool
  - ZFS pools are thinly provisioned, distributed, network mounted
  - ZFS provides data integrity, compression, dedup, archive streams.
Main production site: August 2017 - now.

Distributed systems are hard (for me)

Implementing the Kubernetes based version of CoCalc and getting this to work well at scale is the most difficult thing I have ever done in my life!!!

Harder for me than getting Sage off the ground (which was also very hard!), because...

One bug deep in Linux or Docker meant realworld consequences, e.g., hundreds of students can't take an exam, very angry emails from professors (all while my company is bleeding money).

I had to do almost everything, due to lack of money. (Harald Schilly was the only one who helped.)

Stress => sleep deprivation, breathing problems.
Came close to just giving up.
I still haven't fully recovered physically...

... but at least CoCalc's scalable backend works really well overall!

???

Mention cancelling going to IMA sage days in 2017...

CoCalc's tech stack

and Kubernetes for the scalable cloud deployment.

CoCalc's tech stack

Node.js: Javascript on server; highly async; same code runs in browser and project!
PostgreSQL: database; we make heavy use of LISTEN/NOTIFY
Python3: use for dozens of scripts to control things
React.js/Redux: the browser client heavily uses this; it provides a very functional and reactive approach to user interface implementation.
Typescript/CoffeeScript: used CoffeeScript heavily 2013-2017 for a more Python-looking approach to Javascript, and to avoid the bad parts. Javascript got way, way better, and now Typescript is vastly superior, and we're halfway through switching.

§3. Realtime Synchronized Editing

Sync: 1st implementation - differential sync (2013)

I implemented the Differential Synchronization algorithm by Neil Fraser
Uses Google's diff-match-patch for Javascript diffs/patches.
Ran this in production April 2013-early 2016.
Huge problems in practice:
- Only implemented it for strings (built everything else we needed on strings via hackery)
- Involves each client syncing with a backend server.
  - Browsers sync'd with hubs; hubs sync'd with project...
  - Complexity: subtle horrible problems involving having to do O(n²) operations on the backend...
  - Brittle: sync would not robustly work if the project wasn't actively involved...
  - The algorithm is confusing and difficult (for me!) to think about. Bad abstractions.

???

try to describe the algorithm quickly in words:
- "sort of like git commit; git pull; git push" by every client every few seconds.
- server and client both do lots of diffs, which are expensive.
edge case of diff taking minutes, which was shifting
the "terminate after trying for a few seconds" diff functionality was documented, but not actually implemented...!

Sync: 2nd implementation - distributed revlog (2016)

Time travel: inspiration!
- Jonathan Lee did a GSoC project to provide a "TimeTravel slider" like hackpad, which shows all versions of a file
- We had to record all the diffs as a file is edited
- I started using React.js, which got me into functional reactive programming... and RethinkDB, a database that pushed out realtime changes to clients.
- Learned about how online multiplayer games sync state
- All this together suggested a different approach to sync...
I switched CoCalc to RethinkDB (from Cassanda) with lots of use of Changefeeds. This made things like "change your project's title" update in realtime for all users. (We eventually switched to LISTEN/NOTIFY in PostgreSQL, since RethinkDB failed.)
I replaced the differential sync algorithm by a new one based on a revision log.
Fixed many very subtle bugs over the last 3 years. (No known sync bugs today.)
Also implemented sync for structured objects, with very efficient update of strings fields. Needed for Jupyter notebooks, Sage worksheets, Todo lists, Chatrooms, etc.

Sync: revision log algorithm

Prerequisite: a database with changefeeds

All browser clients and the project need to have some way to have a synchronized "list of objects".

Some implementations of a database with changefeeds...

Meteor.js does something like this using MongoDB
Firebase (owned by Google) I think does this...
RethinkDB did a part of this, and Horizon.js was a new product from RethinkDB to do the client part, but RethinkDB went out of business. RethinkDB performance issues were a hellish nightmare for me, because RethinkDB tried to be too general.
In 2017, I implemented a complete solution for this using PostgreSQL's LISTEN/NOTIFY and Node.js. I think it's pretty amazing, and it's entirely open source. (But it's tangled up in the CoCalc source code.)

???

todo: put mall meteor, firebase, rethinkdb logos here?

Sync: CoCalc's Document Sync & TimeTravel Algorithm

Definitions:

A patch log is a sequence of triples (time, user, patch), with unique times.
The current state of a document is the result of applying all patches in the patch log in order, where patches are applied on a "best effort basis" (there are never merge conflicts).

Algorithm:

Clients synchronize their clock with a central server (up to a few seconds accuracy is more than enough).
When a client is editing, it computes diffs and sends a stream of patches describing changes.
If all clients stop editing, they will all eventually have identical patch logs.
As new patches arrive to clients, they are integrated into the patch log, and document updated.
Any past version of the document can be got by replying the patch log up to some point in time.

Implementation:

A ton of work (for me) in finding algorithms to do everything efficiently, both in time and space.
For example, have to work out how to do user specific undo/redo in terms of our data structures.
Storage tiering: "Oh crap, there are 150 million patches in the database..."

???

If I was a CS professor, maybe I would write a paper about this... I'm not.
Similar ideas will make it possible to make CoCalc itself more distributed, e.g., CoCalc project in a docker container on your laptop (hence super fast, offline, etc.) syncing with the main cocalc.com, and the main cocalc.com split into multiple regional sites.

Sync: The Source Code

Most sync-related code in CoCalc runs in the browser and in the backend project, so node.js is very helpful.

Here's the crucial code:

postgres-synctable.coffee: changefeeds on top of PostgreSQL (runs in node.js on server)
synctable.coffee: synchronized list of objects that have a primary key; uses immutable.js
syncstring.coffee: synchronized string with history of all edits.
db-doc.coffee: synchronized list of objects with a primary key and history of all edits (and efficient updates); basically a queryable, disributed, eventually consistent object database.

These files are all in the main CoCalc git repo:

https://github.com/sagemathinc/cocalc

???

§4. Notebooks, Worksheets, Latex, Terminals and more!

Browser Based Code Notebooks

In 2006, Alex Clemesha, Tom Boothby, and I created The Sage Notebook, which was the first serious browser based notebook, inspired by Mathematica and Macsyma notebooks and IPython's command line.
Heavily developed until maybe 2010, by Tim Dummol, Jason Grout, and others.

2011 -- the IPython project launched a notebook that looked much like sagenb, but was rewritten from scratch using more "modern" (as of 2011) tech.
- Renamed to "Jupyter notebook", to be less Python-centric.
- Bazillions of dollars in grant funding.
- Very popular
There are some problems in Jupyter due to lack of backend state.
- Close your browser while running a computation, and lose output
- Multiple clients opening the same notebook at once.
- Large images and large output. while True: print('hi')

???

Point out that closing your browser is same thing as " (do your network connections ever temporarily fail?)."

Jupyter in CoCalc

There are (at least) three completely different ways to use Jupyter notebooks in CoCalc.

1. Classical with realtime sync

Jupyter embedded in an iframe, then heavily monkey patched:
- Factor out large images
- Add realtime sync
- Timetravel for browsing pasts states of the notebook
A nightmare:
- Large output, data, and images: Univ of Sheffield using R with Jupyter.
- Subtle sync isues are impossible to avoid
- Each new version of Jupyter can break it in subtle ways
- Still looses output if connection interrupted.

Jupyter in CoCalc

1. Classical with realtime sync

2. Cocalc Jupyter: React.js rewrite

I got fed up with monkey patching classical Jupyter, since it was just too hackish.
In 2017, I reimplemented the entire Jupyter stack (except kernels), both frontend and back:
- Maintained almost the same look and feel as classical Jupyter (unlike Google Collaboratory).
- There's a ton of features when you try to implement them all!
- This was a couple months hard slog
CoCalc Jupyter has full knowledge of notebook state on the backend:
- Large images served directly over http (not in the document)
- No lost output when network flaky or user refreshes browsers
- Large output is buffered on the backend and can be obtained by explicit request
Shares some components and code with nteract (from Kyle Kelly at Netflix).

Jupyter in CoCalc

1. Classical with sync (...)

2. Cocalc Jupyter (...)

3. Plain Classical Jupyter

Port forwarding and base url's...
One click to start a classic server from a project, which only you and your collabs can access.
Fallback just in case (e.g., extension support)
Can also run JupyterLab this way.

Sage Worksheets in CoCalc

Weakness of Sage and Jupyter notebooks: editing that involves multiple cells can be awkward.
In 2013, I was designing a collaborative way to use Sage from CoCalc.
I was worried about the performance of hundreds of separate CodeMirror editors on the same page.
I created "Sage worksheets" which are a single CodeMirror editor with output "CodeMirror Widgets".
Rather than "cells" (lots of separate editors), you fully leverage using a single document:
- multiple cursors
- find/replace
- range selection copy/paste (with any amount of input/output)
- code folding
The current implementation is still a little bit flakie since it uses a single string, rather than structured objects, for the state; also it does not use React.js. I'm doing a rewrite to fix this soon.

???

In Sage Notebook we used textareas (so no syntax highlighting, etc.) for performance reasons.

Collaborative IDE

Collaboratively do:

Sage development
Javascript dev (CoCalc itself)

CoCalc has a new fully tiled code editor built on CodeMirror

Arbitrarily many simultaneous views on a document, like in Emacs (say).
Syntax highlighting, code folding, auto formatting, color themes, etc.

[removed]

CoCalc's LaTeX editor

You can write collaborative research papers using CoCalc's LaTeX editor.
Code editor with extra compilation and pdf preview functionality.
Special code for dealing with multiple users compiling simultaneously
You can run arbitrary code/scripts and use data as part of writing your paper, unlike any other web-based Latex editor.
SageTex is fully supported.

[removed]

# CoCalc's Command line terminal

CoCalc projects are Ubuntu 18.04 Docker containers.
I started with term.js and added color themes, support for "open filename", different char sets, etc.
Each terminal session has a corresponding .term file, so:
- Multiple people can open the same terminal
- Chat on the side of terminal

- When text selected, terminal pauses

Pressing Control+C when terminal is paused is copy; is interrupt if not paused.
Click rocket ship to edit custom script that is run when terminal starts.
Type open filename to open a file from the terminal.

???

Actually Ubuntu 16.04 today, but will be 18.04 very soon.
Mention where term.js came from.
todo: name Fabrice b?
Terminals currently do not work well for multiple users; will be fixed soon. Must use tmux.

CoCalc's Chatrooms

You can chat on the side of any file
Chat messages are markdown, with math typesetting
Chatrooms: bigger, and has rendered preview.
Anybody can edit any past message!
Chat notifications in upper right

# Course Management

CoCalc has a full course management system, allowing an instructor to easily:

Create projects for all students,
Send assignments to students. An assignment is any directory of files (e.g., Jupyter notebook, Latex document, etc.)
Watch students in realtime while they are working on assignments, and chat with them.
Collect assignments
Grade and return assignments
See full history of how all their students did the assignments.
Peer grading

§5. Culture

Commercial versus Academic?

Sage and Jupyter are not-for-profit academic projects. CoCalc is a commercial project. WHY?

I hosted "the online pari/Magma calculator" and "the modular forms calculator" at Harvard 2000-2005.
I developed, maintained and hosted sagenb.org (the Sage Notebook) 2007--2014, and learned a lot about the challenges of hosting something like this at a University:
- Attacks by hackers
- Malware
- Periodically getting our internet connection automatically cut by the University
- Legal: violation by users expose me to liability.
I started CoCalc as a commercial project, not an academic one:
- Generate money to support Sage development (instead, CoCalc has lost about $500K)
- Be sustainable (not depend on grants)
- Be fulltime: I work much better when I focus fulltime, and grants only reduce my teaching very little.
- I wasn't getting grants anymore
I'm still faculty at UW, but have been on unpaid leave for 2 years.

???

Maybe mention that when I finally talked with the higher ups at UW about running sagenb, they said "absolutely no way".

Make your software available via CoCalc

CoCalc includes (almost) everything anybody ever requested since April 2013.
- List of all installed software
- We have to update this image every few days atomically, even though thousand+ of projects running
- We had to solve the problem of running a 1000+ copies of a several hundred GB Docker image across a cluster, and be able to update it without breaking existing projects. Designed and implemented many approaches...
Get your favorite software into CoCalc... by simply asking
Then link to CoCalc from your software's site as an option for people to easily try out your software or use it in teaching.

???

Explain how we solve the image-server problem, if time permits.
hsy: as part of "the story", you could add that cocalc being commercial forces you to leave your domain-specific academic area. e.g. suddenly it became relevant to efficiently host gigabytes of genome and astronomy data for students to process. (besides installing their specific software tools) ... this also underlines the progressive generalization of smc → cocalc
compgen -c | wc -l counts 5077 executables

§6. More Live Demoing?

???

Project Log
Filesystem Snapshots

Thanks!

To all the people who've significantly worked on CoCalc: Jonathan Lee, Nicholas Ruhland, Harald Schilly, Hal Snyder, John Jeng, Timothy Clemans, Travis Scholl, Keith Clawson, Andy Huchala, Simon Luu
To everybody who has contributed to open source mathematical software.
I have a ton of CoCalc (and Sage!) stickers and magnets!
Questions?