Welcome to the Dask Tutorial

An example Dask computation¶

In the following lines of code, we’re reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don’t worry about the code, this is just for a quick demonstration. We’ll go over all of this in the next notebook. :)

Note for learners: This might be heavy for Binder.

Note for instructors: Don’t forget to open the Dask Dashboard!

[1]:

import dask.dataframe as dd
from dask.distributed import Client

[2]:

client = Client()
client

[2]:

Client

Client-845b1dac-168d-11ee-8eb8-6045bd777373

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

91ea8798

Dashboard: http://127.0.0.1:8787/status	Workers: 2
Total threads: 2	Total memory: 6.77 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-7976f3a2-3aff-4784-95bb-9426ad30d2e0

Comm: tcp://127.0.0.1:34735	Workers: 2
Dashboard: http://127.0.0.1:8787/status	Total threads: 2
Started: Just now	Total memory: 6.77 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:38123	Total threads: 1
Dashboard: http://127.0.0.1:41927/status	Memory: 3.38 GiB
Nanny: tcp://127.0.0.1:43723
Local directory: /tmp/dask-worker-space/worker-x071ar75

Worker: 1

Comm: tcp://127.0.0.1:45859	Total threads: 1
Dashboard: http://127.0.0.1:35955/status	Memory: 3.38 GiB
Nanny: tcp://127.0.0.1:37341
Local directory: /tmp/dask-worker-space/worker-c88ae90u

[3]:

ddf = dd.read_parquet(
    "s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
    columns=["passenger_count", "tip_amount"],
    storage_options={"anon": True},
)

[4]:

result = ddf.groupby("passenger_count").tip_amount.mean().compute()
result

[4]:

passenger_count
0    1.590343
1    1.752130
2    1.705595
3    1.579748
4    1.459269
5    1.728534
6    1.680769
7    3.863473
8    5.060718
9    5.075917
Name: tip_amount, dtype: float64

What is Dask?¶

There are many parts to the “Dask” the project: * Collections/API also known as “core-library”. * Distributed – to create clusters * Intergrations and broader ecosystem

Dask Collections¶

Dask provides multi-core and distributed+parallel execution on larger-than-memory datasets

We can think of Dask’s APIs (also called collections) at a high and a low level:

High vs Low level clothes analogy

High-level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and pandas but can operate in parallel on datasets that don’t fit into memory.
Low-level collections: Dask also provides low-level Delayed and Futures collections that give you finer control to build custom parallel and distributed computations.

Dask Cluster¶

Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:

Distributed overview

Dask Ecosystem¶

In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:

Dask-ML (parallel scikit-learn-style API)
Dask-image
Dask-cuDF
Dask-sql
Dask-snowflake
Dask-mongo
Dask-bigquery

Community libraries that have built-in dask integrations like:

Xarray
XGBoost
Prefect
Airflow

Dask deployment libraries - Dask-kubernetes - Dask-YARN - Dask-gateway - Dask-cloudprovider - jobqueue

… When we talk about the Dask project we include all these efforts as part of the community.

Dask Use Cases¶

Dask is used in multiple fields such as:

Geospatial
Finance
Astrophysics
Microbiology
Environmental science

Check out the Dask use cases page that provides a number of sample workflows.

Prepare¶

git clone http://github.com/dask/dask-tutorial

and then install necessary packages. There are three different ways to achieve this, pick the one that best suits you, and only pick one option. They are, in order of preference:

In the main repo directory

conda env create -f binder/environment.yml
conda activate dask-tutorial

You will need the following core libraries

conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension

Note that these options will alter your existing environment, potentially changing the versions of packages you already have installed.

Tutorial Structure¶

Each section is a Jupyter notebook. There’s a mixture of text, code, and exercises.

Overview - dask’s place in the universe.
Dataframe - parallelized operations on many pandas dataframes spread across your cluster.
Array - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.
Delayed - the single-function way to parallelize general python code.
Deployment/Distributed - Dask’s scheduler for clusters, with details of how to view the UI.
Distributed Futures - non-blocking results that compute asynchronously.
Conclusion

If you haven’t used Jupyterlab, it’s similar to the Jupyter Notebook. If you haven’t used the Notebook, the quick intro is

There are two modes: command and edit
From command mode, press Enter to edit a cell (like this markdown cell)
From edit mode, press Esc to change to command mode
Press shift+enter to execute a cell and move to the next cell.

The toolbar has commands for executing, converting, and creating cells.

Exercise: Print `Hello, world!`¶

Each notebook will have exercises for you to solve. You’ll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.

Print the text “Hello, world!”.

[5]:

# Your code here

The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell, in case later sections of the notebook depend on the output from the solution.

[6]:

print("Hello, world!")

Hello, world!

Useful Links¶

Reference
- Docs
- Examples
- Code
- Blog
Ask for help
- `dask <http://stackoverflow.com/questions/tagged/dask>`__ tag on Stack Overflow, for usage questions
- github issues for bug reports and feature requests
- discourse forum for general, non-bug, questions and discussion
- Attend a live tutorial

Dask Tutorial documentation

Contents

Welcome to the Dask Tutorial¶

An example Dask computation¶

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

What is Dask?¶

Dask Collections¶

Dask Cluster¶

Dask Ecosystem¶

Dask Use Cases¶

Prepare¶

Tutorial Structure¶

Exercise: Print `Hello, world!`¶

Useful Links¶

Dask Tutorial documentation

Welcome to the Dask Tutorial

Contents

Welcome to the Dask Tutorial¶

An example Dask computation¶

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

What is Dask?¶

Dask Collections¶

Dask Cluster¶

Dask Ecosystem¶

Dask Use Cases¶

Prepare¶

Tutorial Structure¶

Exercise: Print Hello, world!¶

Useful Links¶

Exercise: Print `Hello, world!`¶