📚 The CoCalc Library - books, templates and other resources

Project: 📚 The Library - Shared Public Version

Path: cocalc-examples / dask-tutorial / 05_distributed.ipynb

Views: ⁹⁶¹⁰⁶
License: OTHER

Kernel: Python 3

Distributed

As we have seen so far, Dask allows you to simply construct graphs of tasks with dependencies, as well as have graphs created automatically for you using functional, Numpy or Pandas syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. So far we have been calling thing.compute() or dask.compute(thing) without worrying what this entails. Now we will discuss the options available for that execution, and in particular, the distributed scheduler, which comes with additional functionality.

Dask comes with four available schedulers:

"threaded": a scheduler backed by a thread pool
"processes": a scheduler backed by a process pool
"single-threaded" (aka "sync"): a synchronous scheduler, good for debugging
distributed: a distributed scheduler for executing graphs on multiple machines, see below.

To select one of these for computation, you can specify at the time of asking for a result, e.g.,

myvalue.compute(scheduler="single-threaded")  # for debugging

or set the current default, either temporarily or globally

with dask.config.set(scheduler='processes'):
    # set temporarily fo this block only
    myvalue.compute()

dask.config.set(scheduler='processes')
# set until further notice

Lets see the difference for the familiar case of the flights data

In [ ]:

import dask.dataframe as dd
import os
df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': object,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

# Maximum non-cancelled delay
largest_delay = df[~df.Cancelled].DepDelay.max()
largest_delay

In [ ]:

# each of the following gives the same results (you can check!)
# any surprises?
import time
for sch in ['threading', 'processes', 'sync']:
    t0 = time.time()
    _ = largest_delay.compute(scheduler=sch)
    print(sch, time.time() - t0)

Some Questions to Consider:

How much speedup is possible for this task (hint, look at the graph).
Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.
How much faster was using threads over a single thread? Why does this differ from the optimal speedup?
Why is the multiprocessing scheduler so much slower here?

For single-machine use, the threaded and multiprocessing schedulers are fine choices. They are solid, mature and performant, and require absolutely no set-up. As a rule of thumb, threaded will work well when the functions called release the GIL, whereas multiprocessing will always have a slower start-up time and suffer where a lot of communication is required between tasks. The number of workers is, in general, also important.

For scaling out work across a cluster, the distributed scheduler is required. Indeed, this is now generally preferred for all work, because it gives you additional monitoring information not available in the other schedulers. (Some of this monitoring is also available with an explicit progress bar and profiler, see here.)

Making a cluster

Simple method

The dask.distributed system is composed of a single centralized scheduler and one or more worker processes. Deploying a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a Client object, which lets you interact with the "cluster" (local threads or processes on your machine). For more information see here.

Note that Client() takes a lot of optional arguments, to configure the number of processes/threads, memory limits and other

In [ ]:

from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client()
client.cluster

Be sure to click the Dashboard link to open up the diagnostics dashboard.

Executing with the distributed client

Consider some trivial calculation, such as we've used before, where we have added sleep statements in order to simulate real work being done.

In [ ]:

from dask import delayed
import time

def inc(x):
    time.sleep(5)
    return x + 1

def dec(x):
    time.sleep(3)
    return x - 1

def add(x, y):
    time.sleep(7)
    return x + y

By default, creating a Client makes it the default scheduler. Any calls to .compute will use the cluster your client is attached to, unless you specify otherwise, as above.

In [ ]:

x = delayed(inc)(1)
y = delayed(dec)(2)
total = delayed(add)(x, y)
total.compute()

The tasks will appear in the web UI as they are processed by the cluster and, eventually, a result will be printed as output of the cell above. Note that the kernel is blocked while waiting for the result. The resulting tasks block graph might look something like below. Hovering over each block gives which function it related to, and how long it took to execute.

You can also see a simplified version of the graph being executed on Graph pane of the dashboard, so long as the calculation is in-flight.

Let's return to the flights computation from before, and see what happens on the dashboard (you may wish to have both the notebook and dashboard side-by-side). How did does this perform compared to before?

In [ ]:

%time largest_delay.compute()

In this particular case, this should be as fast or faster than the best case, threading, above. Why do you suppose this is? You should start your reading here, and in particular note that the distributed scheduler was a complete rewrite with more intelligence around sharing of intermediate results and which tasks run on which worker. This will result in better performance in some cases, but still larger latency and overhead compared to the threaded scheduler, so there will be rare cases where it performs worse. Fortunately, the dashboard now gives us a lot more diagnostic information. Look at the Profile page of the dashboard to fund out what takes the biggest fraction of CPU time for the computation we just performed?

If all you want to do is execute computations created using delayed, or run calculations based on the higher-level data collections (see the coming sections), then that is about all you need to know to scale your work up to cluster scale. However, there is more detail to know about the distributed scheduler that will help with efficient usage. See the chapter Distributed, Advanced.

Exercise

Run the following computations while looking at the diagnostics page. In each case what is taking the most time?

In [ ]:

# Number of flights
_ = len(df)

In [ ]:

# Number of non-cancelled flights
_ = len(df[~df.Cancelled])

In [ ]:

# Number of non-cancelled flights per-airport
_ = df[~df.Cancelled].groupby('Origin').Origin.count().compute()

In [ ]:

# Average departure delay from each airport?
_ = df[~df.Cancelled].groupby('Origin').DepDelay.mean().compute()

In [ ]:

# Average departure delay per day-of-week
_ = df.groupby(df.Date.dt.dayofweek).DepDelay.mean().compute()

Distributed

Some Questions to Consider:

Making a cluster

Simple method

Executing with the distributed client

Exercise

Product

Resources

Company