CoCalc Shared Filestravel / 2018 / 2018-08-20to24-jupyter-con / harvard / talk / slides.mdOpen in CoCalc with one click!
Authors: ℏal Snyder, William A. Stein
Description: William Stein talk at 2018 Big Data on CoCalc

class: middle, center

Making open source data analysis software easily available and more collaborative


William Stein

SageMath, Inc. and University of Washington

August 24, 2018 at Center of Mathematical Sciences and Applications, 2018 Big Data Conference

???

  • Thank the organizers.

  • Remind people to interrupt me with questions at any time.


class: middle

Contents

  • §0. Background
  • §1. Demo
  • §2. Architecture
  • §3. Sync
  • §4. Functionality
  • §5. Culture
  • §6. Questions

class: middle, center

§0. Background


class: middle, center

§0. Background

I spent the last 6 years creating CoCalc...

(NOTE: there are several other repos not shown here)


§0. Background

Who am I? William Stein --

  • Started SageMath: in 2004 -- huge open source research-level mathematics Python library, with over 600 contributors, and a million lines of code.

  • Started Cython: I came up with the name "Cython" and launched the project as a fork of "Pyrex". The potential I saw in Pyrex was critical to choosing Python as the implementation language of Sage.

  • Started Sage Notebook: in 2006 -- first serious mature browser-based web notebook.

  • Mathematician: Berkeley Ph.D., faculty at Harvard (2000-2005), UCSD (2005-2006), Univ of Washington (2006-present). Full professor. Published 3 books and over 40 research papers.

  • CEO, SageMath, Inc: On leave from UW for the last two years to work fulltime on CoCalc.


class: middle, center

§1. Demo


§1. Demo

Live Demo of https://CoCalc.com

If I'm going to suggest you use CoCalc when teaching your classes, I better put my arse on the line and do a live demo right now on the production website! So here we go...

  1. Create a new project on https://cocalc.com
  2. Jupyter notebook; select a kernel, then use assistant
  3. Terminal: Jupyter console.
  4. Chat on the side of the Jupyter notebook
  5. Add a collaborator, and edit notebook
  6. TimeTravel
  7. Project snapshots
  8. Sage worksheet
  9. LaTeX file, RTex, RMarkdown

class: middle, center

§2. Architecture


§2. Architecture

The tech stack

  • Node.js: Javascript on server; highly async; same code runs in browser and project!

  • PostgreSQL: database; we make heavy use of LISTEN/NOTIFY

  • Python3: use for dozens of scripts to control things

  • React.js/Redux: the browser client heavily uses this; it provides a very functional and reactive approach to user interface implementation.

  • Typescript/CoffeeScript: used CoffeeScript heavily 2013-2017 for a more Python-looking approach to Javascript, and to avoid the bad parts. Javascript got way, way better, and now Typescript is vastly superior, and we're halfway through switching. The Jupyter code in CoCalc is already fully in Typescript.


§2. Architecture

Ways to Use CoCalc...

  • cocalc-docker -- 100% free and open source; run CoCalc on your own computer:
docker run -d -v ~/cocalc:/projects -p 443:443 sagemathinc/cocalc
  • https://CoCalc.com -- main production site, regularly hosts well over 1000 simultaneous running projects. Uses Kubernetes extensively.

  • Install directly -- Not well supported yet, except from within a CoCalc project itself.

--

Big Data?

cocalc.com results in a lot of user data every day:

  • Challenge: lots of data to store, backup, etc.
  • Opportunity: we store every state of every document anybody works with.
    • Cleverly use it to make CoCalc better? improve error messages, suggest useful code snippets

    • But that is not what today's talk is about.


class: middle, center

§3. Sync

Realtime Synchronized Editing for Every Document Type


§3. Sync

First implementation - differential sync (2013-2016)

--

Second implementation - revision log (2016-now)

Definitions:

  • Patch log: sequence of triples (time, user, patch), with distinct (time, user).
  • State: current state of the document is simply the result of applying all patches in the patch log in order, where patches are applied on a "best effort basis" (by definition, no merge conflicts).

Algorithm:

  • Time: sync clock with a central server (a few seconds accuracy is enough).
  • Edit and send: When user changes doc, broadcast (time,user,patch) describing their change.
  • Receive and update: Clients receive patches, integrate into the patch log, and document is updated.
  • TimeTravel: Version of the document at time t0 by applying all patches up to time t0.

class: middle, center

§4. Functionality

Jupyter Notebooks, LaTeX, Terminals and much more!


§4. Functionality

Browser-Based Code Notebooks

  • Sage Notebook: In 2006, Alex Clemesha, Tom Boothby, and I created The Sage Notebook
    • first serious browser based notebook, inspired by Mathematica and Macsyma notebooks and IPython's command line.
    • heavily developed until maybe 2010, by Tim Dummol, Jason Grout, and others.
  • Jupyter: 2011, the IPython project launched a notebook that looked much like sagenb, but was rewritten from scratch using much more modern tools:

    • Renamed to "Jupyter notebook", to be much more inclusive,
    • Fantastic grant funding (due to hard work of F. Perez and many others),
  • Issues with Jupyter due to lack of full backend state.

    • Close your browser while running a computation, and lose output.
    • Multiple clients opening the same notebook at once.
    • Large images and output embedded in ipynb file.

???

Point out that closing your browser is same thing as " (do your network connections ever temporarily fail?)."


§4. Functionality

Jupyter in CoCalc

There are (at least) three completely different ways to use Jupyter notebooks in CoCalc.

1. Classical with realtime sync

  • Jupyter embedded in an iframe, then heavily monkey patched:

    • Factor out large images
    • Add realtime sync
    • Timetravel for browsing pasts states of the notebook
  • A nightmare:

    • Large output, data, and images: Univ of Sheffield using R with Jupyter.
    • Subtle sync isues are impossible to avoid
    • Each new version of Jupyter can break it in subtle ways
    • Still loses output if connection interrupted.

§4. Functionality

Jupyter in CoCalc

1. Classical with realtime sync

2. Cocalc Jupyter: React.js rewrite

  • I got fed up with monkey patching classical Jupyter, since it was just too hackish.

  • In 2017, I reimplemented the entire Jupyter stack (except kernels), both frontend and back:

    • Maintained almost the same look and feel as classical Jupyter (unlike Google Collaboratory).
    • There's a ton of features when you try to implement them all!
    • This was a couple months hard slog
  • CoCalc Jupyter has full knowledge of notebook state on the backend:

    • Large images served directly over http (not in the document)
    • No lost output when network flaky or user refreshes browsers
    • Large output is buffered on the backend and can be obtained by explicit request
  • Shares some components and code with nteract (from Kyle Kelly at Netflix).


§4. Functionality

Jupyter in CoCalc

1. Classical with sync (...)

2. Cocalc Jupyter (...)

3. Plain Classical Jupyter

  • Port forwarding and base url's...

  • One click to start a classic server from a project, which only you and your collabs can access.

  • Fallback just in case (e.g., extension support)

  • Can also run JupyterLab this way.


§4. Functionality

Sage Worksheets in CoCalc

  • Weakness of Sage and Jupyter notebooks: editing that involves multiple cells can be awkward.

  • In 2013, I was designing a collaborative way to use Sage from CoCalc.

  • I was worried about the performance of hundreds of CodeMirror editors at once.

  • I created "Sage worksheets" which are a single CodeMirror editor with output "CodeMirror Widgets".

  • Rather than "cells" (lots of separate editors), you fully leverage using a single document:

    • multiple cursors
    • find/replace
    • range selection copy/paste (with any amount of input/output)
    • code folding
  • The current implementation is still a little bit flaky since it uses a single string, rather than a db-doc, for the state; also it does not use React.js. I'm doing a rewrite to fix this soon.

???

In Sage Notebook we used textareas (so no syntax highlighting, etc.) for performance reasons.


§4. Functionality

Collaborative IDE

Collaboratively do:

  • Sage development

  • Javascript dev -- I've done all CoCalc development in CoCalc on a Chromebook since 2013!

CoCalc has a new tiled code editor built on CodeMirror

  • Arbitrarily many simultaneous views on a document, like in Emacs (say).

  • Syntax highlighting, code folding, auto formatting, color themes, etc.

  • Work in progress: support VS Code's "language server protocol".


§4. Functionality

LaTeX editor

  • You can write collaborative research papers using the LaTeX editor.

  • Code editor with extra compilation and pdf preview functionality.

  • Special code for dealing with multiple users compiling simultaneously

  • You can run arbitrary code/scripts and use data as part of writing your paper, unlike any other web-based Latex editor.

  • SageTex is fully supported: easily run Python code in your document.


§4. Functionality

Command line terminal

  • CoCalc projects are Ubuntu 18.04 Docker containers.

  • I started with term.js and added color themes, support for "open filename", different char sets, etc.

  • Each terminal session has a corresponding .term file, so:

    • Multiple people can open the same terminal
    • Chat on the side of terminal
  • Click rocket ship to edit custom script that is run when terminal starts.

  • Type open filename to open a file from the terminal.


§4. Functionality

Chatrooms

  • You can chat on the side of any file

  • Chat messages are markdown, with math typesetting

  • Chatrooms: bigger, and has rendered preview.

  • Anybody can edit any past message!

  • Chat notifications in upper right


§4. Functionality

Course Management

CoCalc has a full course management system, allowing an instructor to easily:

  • Create projects for all students,

  • Send assignments to students. An assignment is any directory of files (e.g., Jupyter notebook, Latex document, etc.)

  • Watch students in realtime while they are working on assignments, and chat with them.

  • Collect assignments

  • Grade and return assignments

  • See full history of how all their students did the assignments.

  • Peer grading


§4. Functionality

R

1. Jupter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


§4. Functionality

#R

1. Jupyter Kernel

2. Command Line

3. RMarkdown (Knitr)

4. RTex (Knitr)


class: middle, center

§5. Culture


§5. Culture

People

The Team

  • Current active code contributors: John Jeng, Harald Schilly, Hal Snyder, William Stein, Travis Scholl

  • Past significant contributors: Greg Bard, Rob Beezer, Keith Clawson, Tim Clemans, Andy Huchala, Jon Lee, Simon Luu, Nicholas Ruhland, Todd Zimmerman

  • Advisers/Investors: ask me

Rich diversity of users

  • College and high school teachers

  • Researchers from around the world

  • Wide range of disciplines (not just math!)


§5. Culture

Commercial versus Academic

Sage and Jupyter are not-for-profit academic projects. CoCalc is a commercial project. WHY?

  • I hosted "the online pari/Magma calculator" and "the modular forms calculator" at Harvard 2000-2005.

  • I developed, maintained and hosted sagenb.org (the Sage Notebook) 2007--2014, and learned a lot about the challenges of hosting something like this at a University:

    • Attacks by hackers
    • Malware
    • Periodically getting our internet connection automatically cut by the University
    • Legal: violation by users expose me to liability.
  • I started CoCalc as a commercial project, not an academic one:

    • Generate money to support Sage development (instead, CoCalc has lost about $500K)
    • Be sustainable (not depend on grants)
    • Be fulltime: I work much better when I focus fulltime, and grants only reduce my teaching very little.
    • I wasn't getting grants anymore
    • Conversation with Fernado Perez about how "Project Jupyter" made the opposite decision...
  • I'm still faculty at UW, but have been on unpaid leave for 2 years.

???

  • Maybe mention that when I finally talked with the higher ups at UW about running sagenb, they said "absolutely no way".

class: middle, center

§6. Questions