You can run this notebook in a live session Binder or view it on Github.

Introduction

Welcome to the Dask Tutorial.

Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.

Dask can scale down to your laptop and up to a cluster. Here, we’ll use an environment you setup on your laptop to analyze medium sized datasets in parallel locally.

Overview

Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.

We can think of Dask at a high and a low level

  • High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.

  • Low Level schedulers: Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads. These schedulers are low-latency (around 1ms) and work hard to run computations in a small memory footprint. Dask’s schedulers are an alternative to direct use of threading or multiprocessing libraries in complex cases or other task scheduling systems like Luigi or IPython parallel.

Different users operate at different levels but it is useful to understand both.

The Dask use cases provides a number of sample workflows where Dask should be a good fit.

Prepare

You should clone this repository

git clone http://github.com/dask/dask-tutorial

The included file environment.yml in the binder subdirectory contains a list of all of the packages needed to run this tutorial. To install them using conda, you can do

conda env create -f binder/environment.yml
conda activate dask-tutorial

Do this before running this notebook

Finally, run the following script to download and create data for analysis.

[ ]:
# in directory dask-tutorial/
# this takes a little while
%run prep.py

Tutorial Structure

Each section is a Jupyter notebook. There’s a mixture of text, code, and exercises.

If you haven’t used Jupyterlab, it’s similar to the Jupyter Notebook. If you haven’t used the Notebook, the quick intro is

  1. There are two modes: command and edit

  2. From command mode, press Enter to edit a cell (like this markdown cell)

  3. From edit mode, press Esc to change to command mode

  4. Press shift+enter to execute a cell and move to the next cell.

The toolbar has commands for executing, converting, and creating cells.

The layout of the tutorial will be as follows: - Foundations: an explanation of what Dask is, how it works, and how to use lower-level primitives to set up computations. Casual users may wish to skip this section, although we consider it useful knowledge for all users. - Distributed: information on running Dask on the distributed scheduler, which enables scale-up to distributed settings and enhanced monitoring of task operations. The distributed scheduler is now generally the recommended engine for executing task work, even on single workstations or laptops. - Collections: convenient abstractions giving a familiar feel to big data - bag: Python iterators with a functional paradigm, such as found in func/iter-tools and toolz - generalize lists/generators to big data; this will seem very familiar to users of PySpark’s RDD - array: massive multi-dimensional numerical data, with Numpy functionality - dataframes: massive tabular data, with Pandas functionality

Whereas there is a wealth of information in the documentation, linked above, here we aim to give practical advice to aid your understanding and application of Dask in everyday situations. This means that you should not expect every feature of Dask to be covered, but the examples hopefully are similar to the kinds of work-flows that you have in mind.

Exercise: Print Hello, world!

Each notebook will have exercises for you to solve. You’ll be given a blank or partially completed cell, followed by a “magic” cell that will load the solution. For example

Print the text “Hello, world!”.

[ ]:
# Your code here
[ ]:
%load solutions/00-hello-world.py

The above cell needs to be executed twice, once to load the solution and once to run it.