Parallel and Distributed Machine Learning
Dask-ML has resources for parallel and distributed machine learning.
Types of Scaling
There are a couple of distinct scaling problems you might face. The scaling strategy depends on which problem you're facing.
Large Models: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.
Large Datasets: Data is larger than RAM, and sampling isn't an option.
For in-memory problems, just use scikit-learn (or your favorite ML library).
For large models, use
dask_ml.joblib
and your favorite scikit-learn estimatorFor large datasets, use
dask_ml
estimators
Scikit-Learn in 5 Minutes
Scikit-Learn has a nice, consistent API.
You instantiate an
Estimator
(e.g.LinearRegression
,RandomForestClassifier
, etc.). All of the models hyperparameters (user-specified parameters, not the ones learned by the estimator) are passed to the estimator when it's created.You call
estimator.fit(X, y)
to train the estimator.Use
estimator
to inspect attributes, make predictions, etc.
Let's generate some random data.
We'll fit a LogisitcRegression.
Create the estimator and fit it.
Inspect the learned attributes.
Check the accuracy.
Hyperparameters
Most models have hyperparameters. They affect the fit, but are specified up front instead of learned during training.
Hyperparameter Optimization
There are a few ways to learn the best hyperparameters while training. One is GridSearchCV
. As the name implies, this does a brute-force search over a grid of hyperparameter combinations.
Single-machine parallelism with scikit-learn
Scikit-Learn has nice single-machine parallelism, via Joblib. Any scikit-learn estimator that can operate in parallel exposes an n_jobs
keyword. This controls the number of CPU cores that will be used.
Multi-machine parallelism with Dask
Dask can talk to scikit-learn (via joblib) so that your cluster is used to train a model.
If you run this on a laptop, it will take quite some time, but the CPU usage will be satisfyingly near 100% for the duration. To run faster, you would need a disrtibuted cluster. That would mean putting something in the call to Client
something like
Details on the many ways to create a cluster can be found here.
Let's try it on a larger problem (more hyperparameters).
Training on Large Datasets
Sometimes you'll want to train on a larger than memory dataset. dask-ml
has implemented estimators that work well on dask arrays and dataframes that may be larger than your machine's RAM.
We'll make a small (random) dataset locally using scikit-learn.
The small dataset will be the template for our large random dataset. We'll use dask.delayed
to adapt sklearn.datasets.make_blobs
, so that the actual dataset is being generated on our workers.
The algorithms implemented in Dask-ML are scalable. They handle larger-than-memory datasets just fine.
They follow the scikit-learn API, so if you're familiar with scikit-learn, you'll feel at home with Dask-ML.