Welcome to the Dask Tutorial
Contents
You can run this notebook in a live session or view it on Github.
Welcome to the Dask Tutorial¶
Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
Dask can scale up to your full laptop capacity and out to a cloud cluster.
An example Dask computation¶
In the following lines of code, we’re reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don’t worry about the code, this is just for a quick demonstration. We’ll go over all of this in the next notebook. :)
Note for learners: This might be heavy for Binder.
Note for instructors: Don’t forget to open the Dask Dashboard!
[1]:
import dask.dataframe as dd
from dask.distributed import Client
[2]:
client = Client()
client
[2]:
Client
Client-845b1dac-168d-11ee-8eb8-6045bd777373
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
91ea8798
Dashboard: http://127.0.0.1:8787/status | Workers: 2 |
Total threads: 2 | Total memory: 6.77 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-7976f3a2-3aff-4784-95bb-9426ad30d2e0
Comm: tcp://127.0.0.1:34735 | Workers: 2 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 2 |
Started: Just now | Total memory: 6.77 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:38123 | Total threads: 1 |
Dashboard: http://127.0.0.1:41927/status | Memory: 3.38 GiB |
Nanny: tcp://127.0.0.1:43723 | |
Local directory: /tmp/dask-worker-space/worker-x071ar75 |
Worker: 1
Comm: tcp://127.0.0.1:45859 | Total threads: 1 |
Dashboard: http://127.0.0.1:35955/status | Memory: 3.38 GiB |
Nanny: tcp://127.0.0.1:37341 | |
Local directory: /tmp/dask-worker-space/worker-c88ae90u |
[3]:
ddf = dd.read_parquet(
"s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
columns=["passenger_count", "tip_amount"],
storage_options={"anon": True},
)
[4]:
result = ddf.groupby("passenger_count").tip_amount.mean().compute()
result
[4]:
passenger_count
0 1.590343
1 1.752130
2 1.705595
3 1.579748
4 1.459269
5 1.728534
6 1.680769
7 3.863473
8 5.060718
9 5.075917
Name: tip_amount, dtype: float64
What is Dask?¶
There are many parts to the “Dask” the project: * Collections/API also known as “core-library”. * Distributed – to create clusters * Intergrations and broader ecosystem
Dask Collections¶
Dask provides multi-core and distributed+parallel execution on larger-than-memory datasets
We can think of Dask’s APIs (also called collections) at a high and a low level:
High-level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and pandas but can operate in parallel on datasets that don’t fit into memory.
Low-level collections: Dask also provides low-level Delayed and Futures collections that give you finer control to build custom parallel and distributed computations.
Dask Cluster¶
Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:
Dask Ecosystem¶
In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:
Dask-ML (parallel scikit-learn-style API)
Dask-image
Dask-cuDF
Dask-sql
Dask-snowflake
Dask-mongo
Dask-bigquery
Community libraries that have built-in dask integrations like:
Xarray
XGBoost
Prefect
Airflow
Dask deployment libraries - Dask-kubernetes - Dask-YARN - Dask-gateway - Dask-cloudprovider - jobqueue
… When we talk about the Dask project we include all these efforts as part of the community.
Dask Use Cases¶
Dask is used in multiple fields such as:
Geospatial
Finance
Astrophysics
Microbiology
Environmental science
Check out the Dask use cases page that provides a number of sample workflows.
Prepare¶
git clone http://github.com/dask/dask-tutorial
and then install necessary packages. There are three different ways to achieve this, pick the one that best suits you, and only pick one option. They are, in order of preference:
In the main repo directory
conda env create -f binder/environment.yml
conda activate dask-tutorial
You will need the following core libraries
conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension
Note that these options will alter your existing environment, potentially changing the versions of packages you already have installed.
Tutorial Structure¶
Each section is a Jupyter notebook. There’s a mixture of text, code, and exercises.
Overview - dask’s place in the universe.
Dataframe - parallelized operations on many pandas dataframes spread across your cluster.
Array - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.
Delayed - the single-function way to parallelize general python code.
Deployment/Distributed - Dask’s scheduler for clusters, with details of how to view the UI.
Distributed Futures - non-blocking results that compute asynchronously.
Conclusion
If you haven’t used Jupyterlab, it’s similar to the Jupyter Notebook. If you haven’t used the Notebook, the quick intro is
There are two modes: command and edit
From command mode, press
Enter
to edit a cell (like this markdown cell)From edit mode, press
Esc
to change to command modePress
shift+enter
to execute a cell and move to the next cell.
The toolbar has commands for executing, converting, and creating cells.
Exercise: Print Hello, world!
¶
Each notebook will have exercises for you to solve. You’ll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.
Print the text “Hello, world!”.
[5]:
# Your code here
The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell, in case later sections of the notebook depend on the output from the solution.
[6]:
print("Hello, world!")
Hello, world!
Useful Links¶
Reference
Ask for help
`dask
<http://stackoverflow.com/questions/tagged/dask>`__ tag on Stack Overflow, for usage questionsgithub issues for bug reports and feature requests
discourse forum for general, non-bug, questions and discussion
Attend a live tutorial