{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Welcome to the Dask Tutorial\n", "\n", "\n", "\n", "Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.\n", "\n", "Dask can scale up to your full laptop capacity and out to a cloud cluster." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An example Dask computation\n", "\n", "In the following lines of code, we're reading the NYC taxi cab data from 2015 and finding the mean tip amount. Don't worry about the code, this is just for a quick demonstration. We'll go over all of this in the next notebook. :)\n", "\n", "**Note for learners:** This might be heavy for Binder.\n", "\n", "**Note for instructors:** Don't forget to open the Dask Dashboard!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "from dask.distributed import Client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = Client()\n", "client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddf = dd.read_parquet(\n", " \"s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet\",\n", " columns=[\"passenger_count\", \"tip_amount\"],\n", " storage_options={\"anon\": True},\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = ddf.groupby(\"passenger_count\").tip_amount.mean().compute()\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is [Dask](\"https://www.dask.org/\")?\n", "\n", "There are many parts to the \"Dask\" the project:\n", "* Collections/API also known as \"core-library\".\n", "* Distributed -- to create clusters\n", "* Intergrations and broader ecosystem\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dask Collections\n", "\n", "Dask provides **multi-core** and **distributed+parallel** execution on **larger-than-memory** datasets\n", "\n", "We can think of Dask's APIs (also called collections) at a high and a low level:\n", "\n", "