CoCalc Shared Filestravel / 2018 / 2018-08-20to24-jupyter-con / jupytercon / talk / slides.mdOpen in CoCalc with one click!
Author: William A. Stein
Description: JupyterCon 2018 Talk - William Stein

class: middle, center

Real-time collaboration with Jupyter notebooks

(Conference Link)


William Stein

SageMath, Inc. and University of Washington

August 23, 2018 at JupyterCon (New York City)

???

  • Thank the organizers.

  • Remind people to interrupt me with questions at any time.


class: middle

Contents

  • §0. Background
  • §1. Demo
  • §2. Architecture
  • §3. Sync
  • §4. Functionality
  • §5. Culture
  • §6. Questions

--

I will focus on §3, possibly to the exclusion of §4-§6.


class: middle, center

§0. Background


class: middle, center

§0. Background

I spent the last 6 years creating CoCalc...

(NOTE: there are several other repos not shown here)


§0. Background

Who am I? William Stein --

  • Started SageMath: in 2004 -- huge open source research-level mathematics Python library, with over 600 contributors, and a million lines of code.

  • Started Cython: I came up with the name "Cython" and launched the project as a fork of "Pyrex". The potential I saw in Pyrex was critical to choosing Python as the implementation language of Sage.

  • Started Sage Notebook: in 2006 -- first serious mature browser-based web notebook.

  • Mathematician: Berkeley Ph.D., faculty at Harvard (2000-2005), UCSD (2005-2006), Univ of Washington (2006-present). Full professor. Published 3 books and over 40 research papers.

  • CEO, SageMath, Inc: On leave from UW for the last two years to work fulltime on CoCalc.


class: middle, center

§1. Demo


§1. Demo

Live Demo of https://CoCalc.com

  1. Create a new project on https://cocalc.com
  2. Jupyter notebook; select a kernel, then use assistant
  3. Terminal: Jupyter console.
  4. Chat on the side of the Jupyter notebook
  5. Add a collaborator, and edit notebook
  6. TimeTravel
  7. Project snapshots
  8. Sage worksheet (draw a 3d plot)
  9. LaTeX file

class: middle, center

§2. Architecture


§2. Architecture

The tech stack

  • Node.js: Javascript on server; highly async; same code runs in browser and project!

  • PostgreSQL: database; we make heavy use of LISTEN/NOTIFY

  • Python3: use for dozens of scripts to control things

  • React.js/Redux: the browser client heavily uses this; it provides a very functional and reactive approach to user interface implementation.

  • Typescript/CoffeeScript: used CoffeeScript heavily 2013-2017 for a more Python-looking approach to Javascript, and to avoid the bad parts. Javascript got way, way better, and now Typescript is vastly superior, and we're halfway through switching. The Jupyter code in CoCalc is already fully in Typescript.


§2. Architecture

Ways to Use CoCalc...

  • cocalc-docker -- 100% free and open source; run CoCalc on your own computer:
docker run -d -v ~/cocalc:/projects -p 443:443 sagemathinc/cocalc
  • https://CoCalc.com -- main production site, regularly hosts well over 1000 simultaneous running projects. Uses Kubernetes extensively.

  • Install directly -- Not well supported yet, except from within a CoCalc project itself.


class: middle, center

§3. Sync

Realtime Synchronized Editing for Every Document Type


§3. Sync

First implementation - differential sync (2013-2016)

  • I implemented the Differential Synchronization algorithm by Neil Fraser

  • Used Google's diff-match-patch for Javascript diffs/patches.

  • Problems in practice, at scale:

    • Strings: only implemented for strings (json doc corruption!)
    • Complexity: subtle horrible problems involving having to do O(n2) operations on backend, leading to server hangs...
    • Brittle: sync would not robustly work if the project wasn't actively involved...
    • Bad abstractions: algorithms is confusing and difficult (for me!) to think about.
             

???

  • try to describe the algorithm quickly in words:

    • "sort of like git commit; git pull; git push" by every client every few seconds.
    • server and client both do lots of diffs, which are expensive.
  • edge case of diff taking minutes, which was shifting

  • the "terminate after trying for a few seconds" diff functionality was documented, but not actually implemented...!


§3. Sync

Second implementation - revision log (2016-now)

  • Time travel: inspiration!

    • Jonathan Lee implemented a "TimeTravel slider", which shows all versions of a file
    • We had to record all the diffs as file is edited
    • I started using React.js, which got me into functional reactive programming... and RethinkDB, a database that pushed out realtime changes to clients.
    • Learned about how online multiplayer games sync state
    • All this together suggested a different approach to sync...
  • I switched CoCalc to RethinkDB (from Cassandra) with lots of use of "changefeeds". This made things like "change your project's title" update in realtime for all users. (We eventually switched to LISTEN/NOTIFY in PostgreSQL, since RethinkDB failed.)

  • I replaced the differential sync algorithm by a new one based on a revision log.

  • Fixed many very subtle bugs over the last 3 years. (No known sync bugs today.)

  • Also implemented sync for structured objects, with efficient update of strings fields. Needed for Jupyter notebooks, Sage worksheets, Todo lists, Chatrooms, etc.


§3. Sync

Prerequisite: a shared log

  • All browser clients and the project (where Jupyter runs) need to have some way to have an eventually consistent shared log:
            (timestamp_0, patch_0)
            (timestamp_1, patch_1)
            ...
            (timestamp_n, patch_n)

  • (2) A Websocket channel:
    • Use a websocket to broadcast log entries, and ensured messages are received somehow.
    • Primus and the primus-multiplex plugin very nicely do this.

Note: I'm currently revamping cocalc to use both to greatly reduce latency, while still providing safe longterm storage and connection robustness. I will also use (2) for chatty ephemeral sync.


§3. Sync

Sync algorithm

Definitions:

  • Patch log: sequence of triples (time, user, patch), with distinct (time, user).
  • State: current state of the document is simply the result of applying all patches in the patch log in order, where patches are applied on a "best effort basis" (by definition, no merge conflicts).

Algorithm:

  • Time: sync clock with a central server (a few seconds accuracy is enough).
  • Edit and send: When user changes doc, broadcast (time,user,patch) describing their change.
  • Receive and update: Clients receive patches, integrate into the patch log, and document is updated.
  • TimeTravel: Version of the document at time t0 by applying all patches up to time t0.

NOTE: If all clients stop editing, they will eventually have identical patch logs (assuming message distribution is robust), hence agree on the state of the document.

Implementation:

  • Find algorithms: to do everything efficiently, e.g., insert recent patch and recompute current doc state.
  • Storage tiers: "Oh crap, there are 150 million patches in the database..."
  • What is a patch? Text documents versus structured documents.

???

  • If I was a CS professor, maybe I would write a paper about this... I'm not.

  • Similar ideas will make it possible to make CoCalc itself more distributed, e.g., CoCalc project in a docker container on your laptop (hence super fast, offline, etc.) syncing with the main cocalc.com, and the main cocalc.com split into multiple regional sites.


§3. Sync

Documents

Two document types: plain text and "list of objects with primary keys" (db-doc).

Plain text documents, such as source code

  • defined by the diff-match-patch code

Structured documents, such as Jupyter notebooks

We view a structured document as a distributed eventually consistent object database table:

  • A table: give primary keys, specify that certain fields are atomic, and that other fields are strings that should be managed using the diff-match-patch algorithm (above).
  • Define get and set operations, inspired by Cassandra (a "big data" distributed database).
  • Figure out how make everything efficient (e.g., immutable.js helps)
  • Do NOT try to synchronize ipynb files! Instead define mappings:
                            { .ipynb file }  <-->  { object databases }

§3. Sync

Documents: Jupyter Notebooks

For Jupyter, the primary key fields are type and id.

A new Jupyter notebook might correspond to this table:

~/tmp$ more .a.ipynb.sage-jupyter2
{"type":"cell","id":"5dd784","pos":0,"input":""}
{"type":"file","last_load":1534979207135}
{"type":"settings","backend_state":"running","trust":true,"kernel":"sagemath"}

If you then type "2+2" and hit shift enter, the table becomes:

~/tmp$ more .a.ipynb.sage-jupyter2
{"output":{"0":{"data":{"text/plain":"5"},"exec_count":1}},"exec_count":1,"start":1534979235213,
"input":"2 + 3","state":"done","pos":0,"type":"cell","end":1534979235259,
"id":"5dd784","kernel":"sagemath"}
{"type":"cell","id":"c3d8fd","pos":1,"input":""}
{"type":"file","last_load":1534979207135}
{"type":"settings","backend_state":"running","trust":true,
"kernel":"sagemath","kernel_usage":{"cpu":0,"memory":7213056},"kernel_state":"idle"}

§3. Sync

Source Code

Most sync-related code in CoCalc runs in the browser and in the backend project, so node.js is very helpful.

Here's the crucial code:

These files are all in the main CoCalc git repo:

https://github.com/sagemathinc/cocalc

The code all (seems to) work very well at this point, but I would love to clean it up and rewrite it in Typescript, add way more testing, make it a separate project, etc.

???


class: middle, center

§4. Functionality

Jupyter Notebooks, LaTeX, Terminals and much more!


§4. Functionality

Browser-Based Code Notebooks

  • Sage Notebook: In 2006, Alex Clemesha, Tom Boothby, and I created The Sage Notebook
    • first serious browser based notebook, inspired by Mathematica and Macsyma notebooks and IPython's command line.
    • heavily developed until maybe 2010, by Tim Dummol, Jason Grout, and others.
  • Jupyter: 2011, the IPython project launched a notebook that looked much like sagenb, but was rewritten from scratch using much more modern tools:

    • Renamed to "Jupyter notebook", to be much more inclusive,
    • Fantastic grant funding (due to hard work of F. Perez and many others),
  • Issues with Jupyter due to lack of full backend state.

    • Close your browser while running a computation, and lose output.
    • Multiple clients opening the same notebook at once.
    • Large images and output embedded in ipynb file.

???

Point out that closing your browser is same thing as " (do your network connections ever temporarily fail?)."


§4. Functionality

Jupyter in CoCalc

There are (at least) three completely different ways to use Jupyter notebooks in CoCalc.

1. Classical with realtime sync

  • Jupyter embedded in an iframe, then heavily monkey patched:

    • Factor out large images
    • Add realtime sync
    • Timetravel for browsing pasts states of the notebook
  • A nightmare:

    • Large output, data, and images: Univ of Sheffield using R with Jupyter.
    • Subtle sync isues are impossible to avoid
    • Each new version of Jupyter can break it in subtle ways
    • Still loses output if connection interrupted.

§4. Functionality

Jupyter in CoCalc

1. Classical with realtime sync

2. Cocalc Jupyter: React.js rewrite

  • I got fed up with monkey patching classical Jupyter, since it was just too hackish.

  • In 2017, I reimplemented the entire Jupyter stack (except kernels), both frontend and back:

    • Maintained almost the same look and feel as classical Jupyter (unlike Google Collaboratory).
    • There's a ton of features when you try to implement them all!
    • This was a couple months hard slog
  • CoCalc Jupyter has full knowledge of notebook state on the backend:

    • Large images served directly over http (not in the document)
    • No lost output when network flaky or user refreshes browsers
    • Large output is buffered on the backend and can be obtained by explicit request
  • Shares some components and code with nteract (from Kyle Kelly at Netflix).


§4. Functionality

Jupyter in CoCalc

1. Classical with sync (...)

2. Cocalc Jupyter (...)

3. Plain Classical Jupyter

  • Port forwarding and base url's...

  • One click to start a classic server from a project, which only you and your collabs can access.

  • Fallback just in case (e.g., extension support)

  • Can also run JupyterLab this way.


§4. Functionality

Sage Worksheets in CoCalc

  • Weakness of Sage and Jupyter notebooks: editing that involves multiple cells can be awkward.

  • In 2013, I was designing a collaborative way to use Sage from CoCalc.

  • I was worried about the performance of hundreds of CodeMirror editors at once.

  • I created "Sage worksheets" which are a single CodeMirror editor with output "CodeMirror Widgets".

  • Rather than "cells" (lots of separate editors), you fully leverage using a single document:

    • multiple cursors
    • find/replace
    • range selection copy/paste (with any amount of input/output)
    • code folding
  • The current implementation is still a little bit flaky since it uses a single string, rather than a db-doc, for the state; also it does not use React.js. I'm doing a rewrite to fix this soon.

???

In Sage Notebook we used textareas (so no syntax highlighting, etc.) for performance reasons.


§4. Functionality

Collaborative IDE

Collaboratively do:

  • Sage development

  • Javascript dev -- I've done all CoCalc development in CoCalc on a Chromebook since 2013!

CoCalc has a new tiled code editor built on CodeMirror

  • Arbitrarily many simultaneous views on a document, like in Emacs (say).

  • Syntax highlighting, code folding, auto formatting, color themes, etc.

  • Work in progress: support VS Code's "language server protocol".


§4. Functionality

LaTeX editor

  • You can write collaborative research papers using the LaTeX editor.

  • Code editor with extra compilation and pdf preview functionality.

  • Special code for dealing with multiple users compiling simultaneously

  • You can run arbitrary code/scripts and use data as part of writing your paper, unlike any other web-based Latex editor.

  • SageTex is fully supported: easily run Python code in your document.


§4. Functionality

Command line terminal

  • CoCalc projects are Ubuntu 18.04 Docker containers.

  • I started with term.js and added color themes, support for "open filename", different char sets, etc.

  • Each terminal session has a corresponding .term file, so:

    • Multiple people can open the same terminal
    • Chat on the side of terminal
  • Click rocket ship to edit custom script that is run when terminal starts.

  • Type open filename to open a file from the terminal.


§4. Functionality

Chatrooms

  • You can chat on the side of any file

  • Chat messages are markdown, with math typesetting

  • Chatrooms: bigger, and has rendered preview.

  • Anybody can edit any past message!

  • Chat notifications in upper right


§4. Functionality

Course Management

CoCalc has a full course management system, allowing an instructor to easily:

  • Create projects for all students,

  • Send assignments to students. An assignment is any directory of files (e.g., Jupyter notebook, Latex document, etc.)

  • Watch students in realtime while they are working on assignments, and chat with them.

  • Collect assignments

  • Grade and return assignments

  • See full history of how all their students did the assignments.

  • Peer grading


§4. Functionality

R

1. Jupter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

#R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


class: middle, center

§5. Culture


§5. Culture

People

The Team

  • Current active code contributors: John Jeng, Harald Schilly, Hal Snyder, William Stein, Travis Scholl

  • Past significant contributors: Greg Bard, Rob Beezer, Keith Clawson, Tim Clemans, Andy Huchala, Jon Lee, Simon Luu, Nicholas Ruhland, Todd Zimmerman

  • Advisers/Investors: ask me

Rich diversity of users

  • College and high school teachers

  • Researchers from around the world

  • Wide range of disciplines (not just math!)


§5. Culture

Commercial versus Academic

Sage and Jupyter are not-for-profit academic projects. CoCalc is a commercial project. WHY?

  • I hosted "the online pari/Magma calculator" and "the modular forms calculator" at Harvard 2000-2005.

  • I developed, maintained and hosted sagenb.org (the Sage Notebook) 2007--2014, and learned a lot about the challenges of hosting something like this at a University:

    • Attacks by hackers
    • Malware
    • Periodically getting our internet connection automatically cut by the University
    • Legal: violation by users expose me to liability.
  • I started CoCalc as a commercial project, not an academic one:

    • Generate money to support Sage development (instead, CoCalc has lost about $500K)
    • Be sustainable (not depend on grants)
    • Be fulltime: I work much better when I focus fulltime, and grants only reduce my teaching very little.
    • I wasn't getting grants anymore
    • Conversation with Fernado Perez about how "Project Jupyter" made the opposite decision...
  • I'm still faculty at UW, but have been on unpaid leave for 2 years.

???

  • Maybe mention that when I finally talked with the higher ups at UW about running sagenb, they said "absolutely no way".

class: middle, center

§6. Questions