{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Welcome to the Dask Tutorial\n", "\n", "\"Dask\n", "\n", "Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.\n", "\n", "Dask can scale up to your full laptop capacity and out to a cloud cluster." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An example Dask computation\n", "\n", "In the following lines of code, we're reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don't worry about the code, this is just for a quick demonstration. We'll go over all of this in the next notebook. :)\n", "\n", "**Note for learners:** This might be heavy for Binder.\n", "\n", "**Note for instructors:** Don't forget to open the Dask Dashboard!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "from dask.distributed import Client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = Client()\n", "client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddf = dd.read_parquet(\n", " \"s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet\",\n", " columns=[\"passenger_count\", \"tip_amount\"],\n", " storage_options={\"anon\": True},\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = ddf.groupby(\"passenger_count\").tip_amount.mean().compute()\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is [Dask](\"https://www.dask.org/\")?\n", "\n", "There are many parts to the \"Dask\" the project:\n", "* Collections/API also known as \"core-library\".\n", "* Distributed -- to create clusters\n", "* Intergrations and broader ecosystem\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dask Collections\n", "\n", "Dask provides **multi-core** and **distributed+parallel** execution on **larger-than-memory** datasets\n", "\n", "We can think of Dask's APIs (also called collections) at a high and a low level:\n", "\n", "
\n", "\"High\n", "
\n", "\n", "* **High-level collections:** Dask provides high-level Array, Bag, and DataFrame\n", " collections that mimic NumPy, lists, and pandas but can operate in parallel on\n", " datasets that don't fit into memory.\n", "* **Low-level collections:** Dask also provides low-level Delayed and Futures\n", " collections that give you finer control to build custom parallel and distributed computations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dask Cluster\n", "\n", "Most of the times when you are using Dask, you will be using a distributed scheduler, which exists in the context of a Dask cluster. The Dask cluster is structured as:\n", "\n", "
\n", "\"Distributed\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dask Ecosystem\n", "\n", "In addition to the core Dask library and its distributed scheduler, the Dask ecosystem connects several additional initiatives, including:\n", "\n", "- Dask-ML (parallel scikit-learn-style API)\n", "- Dask-image\n", "- Dask-cuDF\n", "- Dask-sql\n", "- Dask-snowflake\n", "- Dask-mongo\n", "- Dask-bigquery\n", "\n", "Community libraries that have built-in dask integrations like:\n", "\n", "- Xarray\n", "- XGBoost\n", "- Prefect\n", "- Airflow\n", "\n", "Dask deployment libraries\n", "- Dask-kubernetes\n", "- Dask-YARN\n", "- Dask-gateway\n", "- Dask-cloudprovider\n", "- jobqueue\n", "\n", "... When we talk about the Dask project we include all these efforts as part of the community. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dask Use Cases\n", "\n", "Dask is used in multiple fields such as:\n", "\n", "* Geospatial\n", "* Finance\n", "* Astrophysics\n", "* Microbiology\n", "* Environmental science\n", "\n", "Check out the Dask [use cases](https://stories.dask.org/en/latest/) page that provides a number of sample workflows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### 1. You should clone this repository\n", "\n", "\n", " git clone http://github.com/dask/dask-tutorial\n", "\n", "and then install necessary packages.\n", "There are three different ways to achieve this, pick the one that best suits you, and ***only pick one option***.\n", "They are, in order of preference:\n", "\n", "#### 2a) Create a conda environment (preferred)\n", "\n", "In the main repo directory\n", "\n", "\n", " conda env create -f binder/environment.yml\n", " conda activate dask-tutorial\n", "\n", "\n", "#### 2b) Install into an existing environment\n", "\n", "You will need the following core libraries\n", "\n", "\n", " conda install -c conda-forge ipycytoscape jupyterlab python-graphviz matplotlib zarr xarray pooch pyarrow s3fs scipy dask distributed dask-labextension\n", "\n", "Note that these options will alter your existing environment, potentially changing the versions of packages you already\n", "have installed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tutorial Structure\n", "\n", "Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.\n", "\n", "0. [Overview](00_overview.ipynb) - dask's place in the universe.\n", "\n", "1. [Dataframe](01_dataframe.ipynb) - parallelized operations on many pandas dataframes spread across your cluster.\n", "\n", "2. [Array](02_array.ipynb) - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.\n", "\n", "3. [Delayed](03_dask.delayed.ipynb) - the single-function way to parallelize general python code.\n", "\n", "4. [Deployment/Distributed](04_distributed.ipynb) - Dask's scheduler for clusters, with details of how to view the UI.\n", "\n", "5. [Distributed Futures](05_futures.ipynb) - non-blocking results that compute asynchronously.\n", "\n", "6. Conclusion\n", "\n", "\n", "If you haven't used Jupyterlab, it's similar to the Jupyter Notebook. If you haven't used the Notebook, the quick intro is\n", "\n", "1. There are two modes: command and edit\n", "2. From command mode, press `Enter` to edit a cell (like this markdown cell)\n", "3. From edit mode, press `Esc` to change to command mode\n", "4. Press `shift+enter` to execute a cell and move to the next cell.\n", "\n", "The toolbar has commands for executing, converting, and creating cells." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Print `Hello, world!`\n", "Each notebook will have exercises for you to solve. You'll be given a blank or partially completed cell, followed by a hidden cell with a solution. For example.\n", "\n", "\n", "Print the text \"Hello, world!\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell has the solution. Click the ellipses to expand the solution, and always make sure to run the solution cell,\n", "in case later sections of the notebook depend on the output from the solution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [] }, "outputs": [], "source": [ "print(\"Hello, world!\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Useful Links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Reference\n", " * [Docs](https://dask.org/)\n", " * [Examples](https://examples.dask.org/)\n", " * [Code](https://github.com/dask/dask/)\n", " * [Blog](https://blog.dask.org/)\n", "* Ask for help\n", " * [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions\n", " * [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests\n", " * [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion\n", " * Attend a live tutorial" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }