Welcome to the Dask Tutorial.
Dask is a parallel computing library that scales the existing Python ecosystem. This tutorial will introduce Dask and parallel data analysis more generally.
Dask can scale down to your laptop laptop and up to a cluster. Here, we’ll use an environment you setup on your laptop to analyze medium sized datasets in parallel locally.
Dask provides multi-core and distributed parallel execution on larger-than-memory datasets.
We can think of Dask at a high and a low level
High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.
Low Level schedulers: Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads. These schedulers are low-latency (around 1ms) and work hard to run computations in a small memory footprint. Dask’s schedulers are an alternative to direct use of
multiprocessinglibraries in complex cases or other task scheduling systems like
Different users operate at different levels but it is useful to understand both.
The Dask use cases provides a number of sample workflows where Dask should be a good fit.
You should clone this repository
git clone http://github.com/dask/dask-tutorial
The included file
environment.yml in the
binder subdirectory contains a list of all of the packages needed to run this tutorial. To install them using
conda, you can do
conda env create -f binder/environment.yml conda activate dask-tutorial
Do this before running this notebook
Finally, run the following script to download and create data for analysis.
# in directory dask-tutorial/ # this takes a little while %run prep.py
Each section is a Jupyter notebook. There’s a mixture of text, code, and exercises.
If you haven’t used Jupyterlab, it’s similar to the Jupyter Notebook. If you haven’t used the Notebook, the quick intro is
There are two modes: command and edit
From command mode, press
Enterto edit a cell (like this markdown cell)
From edit mode, press
Escto change to command mode
shift+enterto execute a cell and move to the next cell.
The toolbar has commands for executing, converting, and creating cells.
The layout of the tutorial will be as follows: - Foundations: an explanation of what Dask is, how it works, and how to use lower-level primitives to set up computations. Casual users may wish to skip this section, although we consider it useful knowledge for all users. - Distributed: information on running Dask on the distributed scheduler, which enables scale-up to distributed settings and enhanced monitoring of task operations. The distributed scheduler is now generally the recommended engine for executing task work, even on single workstations or laptops. - Collections: convenient abstractions giving a familiar feel to big data - bag: Python iterators with a functional paradigm, such as found in func/iter-tools and toolz - generalize lists/generators to big data; this will seem very familiar to users of PySpark’s RDD - array: massive multi-dimensional numerical data, with Numpy functionality - dataframes: massive tabular data, with Pandas functionality
Whereas there is a wealth of information in the documentation, linked above, here we aim to give practical advice to aid your understanding and application of Dask in everyday situations. This means that you should not expect every feature of Dask to be covered, but the examples hopefully are similar to the kinds of work-flows that you have in mind.
Each notebook will have exercises for you to solve. You’ll be given a blank or partially completed cell, followed by a “magic” cell that will load the solution. For example
Print the text “Hello, world!”.
# Your code here
The above cell needs to be executed twice, once to load the solution and once to run it.