diff --git a/00-README.ipynb b/00-README.ipynb new file mode 100644 index 0000000..4a343ee --- /dev/null +++ b/00-README.ipynb @@ -0,0 +1,124 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f3151511", + "metadata": {}, + "source": [ + "# Python in High Performance Computing\n", + "\n", + "This binder image includes several exercices from the CSC course \"Python in High Performance Computing\". The course is part of PRACE Training activity at CSC (https://www.futurelearn.com/courses/python-in-hpc). \n", + "\n", + "Also, it includes material from a Dask tutorial given at SciPy 2020 conference.\n", + "\n", + "## NOTE : Exercices with \"-->\" are suggestions to start\n", + "\n", + "## Exercises\n", + "\n", + "\n", + "\n", + "### Performance analysis\n", + "\n", + "1. Read the doc at [Profiling apps](performance/cprofile.ipynb)\n", + "2. Open a terminal and do the following exercice\n", + "\n", + " - **[--> Using cProfile](performance/cprofile)**\n", + "\n", + "### Multiprocessing\n", + "\n", + "1. Read the doc at [Python Multiprocessing](multiprocessing/Multiprocessing.ipynb)\n", + "2. Open a terminal and do the following exercices\n", + "\n", + " - [Simple calculation](multiprocessing/simple-calculation)\n", + " - [Work distribution](multiprocessing/work-distribution)\n", + "\n", + "### Parallel programming with mpi4py\n", + "\n", + "1. Read the doc at [MPI on Python](mpi/MPI_on_Python.ipynb)\n", + "2. Open a terminal and do the following exercices\n", + "\n", + " - **[--> Hello World](mpi/hello-world)**\n", + " - [Simple message exchange](mpi/message-exchange)\n", + " - [Message chain](mpi/message-chain)\n", + " - **[--> Non-blocking communication](mpi/non-blocking)**\n", + " - **[--> Collective operations](mpi/collectives)**\n", + "\n", + "### Dask\n", + "\n", + "1. Open the following notebooks to experiment (no need of a terminal)\n", + "\n", + " - **[--> Delayed](dask/01_dask.delayed.ipynb)**\n", + " - [Understanding 'Lazy'](dask/01x_lazy.ipynb)\n", + " - [Bags](dask/02_bag.ipynb)\n", + " - [Arrays](dask/03_array.ipynb)\n", + " - **[--> Dataframe](dask/04_dataframe.ipynb)**\n", + " - **[--> Distributed mode](dask/05_distributed.ipynb)**\n", + " - [Distributed advanced](dask/06_distributed_advanced.ipynb)\n", + " - [Storage optimization](dask/07_dataframe_storage.ipynb)\n", + " - [Machine Learning](dask/08_machine_learning.ipynb)\n", + "\n", + "### Bonus exercises\n", + "\n", + " - [Game of life](numpy/game-of-life)\n", + " - [Rotation with broadcasting](numpy/broadcast-rotation)\n", + " - [Two dimensional heat equation](numpy/heat-equation)\n", + " - [Parallel heat equation](mpi/heat-equation)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b4dbf0a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/00-overview.md b/00-overview.md new file mode 100644 index 0000000..51887f0 --- /dev/null +++ b/00-overview.md @@ -0,0 +1,39 @@ +--- +title: Python and High-Performance Computing +lang: en +--- + +# Efficiency + +- Python is an interpreted language + - no pre-compiled binaries, all code is translated on-the-fly to + machine instructions + - byte-code as a middle step which may be stored (.pyc) + +- All objects are dynamic in Python + - nothing is fixed == optimisation nightmare + - lot of overhead from metadata + +- Flexibility is good, but comes with a cost! + + +# Improving Python performance + +- Array based computations with NumPy +- Using extended Cython programming language +- Embed compiled code in a Python program + - C/C++, Fortran +- Utilize parallel processing + + +# Parallelisation strategies for Python + +- Global Interpreter Lock (GIL) + - CPython's memory management is not thread-safe + - no threads possible, except for I/O etc. + - affects overall performance of threading + +- Process-based "threading" with multiprocessing + - fork independent processes that have a limited way to communicate + +- **Message-passing** is the Way to Go to achieve true parallelism in Python diff --git a/02-performance-analysis.md b/02-performance-analysis.md new file mode 100644 index 0000000..bce003a --- /dev/null +++ b/02-performance-analysis.md @@ -0,0 +1,147 @@ +--- +title: Performance analysis +lang: en +--- + +# Performance measurement {.section} + +# Measuring application performance + +- Correctness is the most import factor in any application + - Premature optimization is the root of all evil\! +- Before starting to optimize application, one should measure where time is + spent + - Typically 90 % of time is spent in 10 % of application + +
+- Mind the algorithm! + - Recursive calculation of Fibonacci numbers +
+ +
+ + + +| | Speedup | +|--------------------------------|-----------| +| Pure Python | 1 | +| Pure C | 126 | +| Pure Python (better algorithm) | 24e6 | + + + +
+ + + +# Measuring application performance + +- Applications own timers +- **timeit** module +- **cProfile** module +- Full fedged profiling tools: TAU, Intel Vtune, Python Tools for Visual + Studio ... + + +# Measuring application performance + +- Python **time** module can be used for measuring time spent in specific + part of the program + - `time.perf_counter()` : include time spent in other processes + - `time.process_time()` : only time in current process + +```python +import time + +t0 = time.process_time() +for n in range(niter): + heavy_calculation() +t1 = time.process_time() + +print('Time spent in heavy calculation', t1-t0) +``` + + +# timeit module + +- Easy timing of small bits of Python code +- Tries to avoid common pitfalls in measuring execution times +- Command line interface and Python interface +- `%timeit` magic in IPython + +```python +In [1]: from mymodule import func +In [2]: %timeit func() + +10 loops, best of 3: 433 msec per loop +``` +```bash +$ python -m timeit -s "from mymodule import func" "func()" + +10 loops, best of 3: 433 msec per loop +``` + + +# cProfile + +- Execution profile of Python program + - Time spent in different parts of the program + - Call graphs +- Python API: +- Profiling whole program from command line + +```python +import cProfile +... + +# profile statement and save results to a file func.prof +cProfile.run('func()', 'func.prof') +``` +```bash +$ python -m cProfile -o myprof.prof myprogram.py +``` + + +# Investigating profile with pstats + +- Printing execution time of selected functions +- Sorting by function name, time, cumulative time, ... +- Python module interface and interactive browser + +
+ +``` +In [1]: from pstats import Stats +In [2]: p = Stats('myprof.prof') +In [3]: p.strip_dirs() +In [4]: p.sort_stats('time') +In [5]: p.print_stats(5) + +Mon Oct 12 10:11:00 2016 my.prof +... +``` + +
+
+ +```bash +$ python -m pstats myprof.prof + +Welcome to the profile statistics +% strip +% sort time +% stats 5 + +Mon Oct 12 10:11:00 2016 my.prof +... +``` + +
+ + +# Summary + +- Python has various built-in tools for measuring application performance +- **time** module +- **timeit** module +- **cProfile** and **pstats** modules diff --git a/03-multiprocessing.md b/03-multiprocessing.md new file mode 100644 index 0000000..d468e43 --- /dev/null +++ b/03-multiprocessing.md @@ -0,0 +1,270 @@ +--- +title: Multiprocessing +lang: en +--- + +# Processes and threads + +![](img/processes-threads.png) + +
+ +## Process + +- Independent execution units +- Have their own state information and *own memory* address space + +
+
+ +## Thread + +- A single process may contain multiple threads +- Have their own state information, but *share* the *same memory* + address space + +
+ + +# Processes and threads + +![](img/processes-threads.png) + +
+ +## Process + +- Long-lived: created when parallel program started, killed when + program is finished +- Explicit communication between processes + +
+
+ +## Thread + +- Short-lived: created when entering a parallel region, destroyed + (joined) when region ends +- Communication through shared memory + +
+ +# Processes and threads + +![](img/processes-threads.png) + +
+ +## Process + +- MPI + - good performance + - scales from a laptop to a supercomputer + +
+
+ +## Thread + +- OpenMP + - C / Fortran, not Python +- threading module + - only for I/O bound tasks (maybe) + - Global Interpreter Lock (GIL) limits usability + +
+ + +# Processes and threads + +![](img/processes-threads.png) + +
+ +## Process + +- MPI + - good performance + - scales from a laptop to a supercomputer + +
+
+ +## ~~Thread~~ Process + +- multiprocessing module + - relies on OS for forking worker processes that mimic threads + - limited communication between the parallel processes + +
+ + +# Multiprocessing + +- Underlying OS used to spawn new independent subprocesses +- processes are independent and execute code in an asynchronous manner + - no guarantee on the order of execution +- Communication possible only through dedicated, shared communication + channels + - Queues, Pipes + - must be created before a new process is forked + + +# Spawn a process + +```python +from multiprocessing import Process +import os + +def hello(name): + print 'Hello', name + print 'My PID is', os.getpid() + print "My parent's PID is", os.getppid() + +# Create a new process +p = Process(target=hello, args=('Alice', )) + +# Start the process +p.start() +print 'Spawned a new process from PID', os.getpid() + +# End the process +p.join() +``` + + +# Communication + +- Sharing data + - shared memory, data manager +- Pipes + - direct communication between two processes +- Queues + - work sharing among a group of processes +- Pool of workers + - offloading tasks to a group of worker processes + + +# Queues + +- FIFO (*first-in-first-out*) task queues that can be used to distribute + work among processes +- Shared among all processes + - all processes can add and retrieve data from the queue +- Automatically takes care of locking, so can be used safely with minimal + hassle + + +# Queues + +```python +from multiprocessing import Process, Queue + +def f(q): + while True: + x = q.get() + if x is None: + break + print(x**2) + +q = Queue() +for i in range(100): + q.put(i) +# task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 99] + +for i in range(3): + q.put(None) + p = Process(target=f, args=(q, )) + p.start() +``` + + +# Queues + +```python +from multiprocessing import Process, Queue + +def f(q): + while True: + x = q.get() + if x is None: # if sentinel, stop execution + break + print(x**2) + +q = Queue() +for i in range(100): + q.put(i) +# task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 99] + +for i in range(3): + q.put(None) # add sentinels to the queue to signal STOP + p = Process(target=f, args=(q, )) + p.start() +``` + + +# Pool of workers + +- Group of processes that carry out tasks assigned to them + 1. Master process submits tasks to the pool + 2. Pool of worker processes perform the tasks + 3. Master process retrieves the results from the pool +- Blocking and non-blocking (= asynchronous) calls available + + +# Pool of workers + +```python +from multiprocessing import Pool +import time + +def f(x): + return x**2 + +pool = Pool(8) + +# Blocking execution (with a single process) +result = pool.apply(f, (4,)) +print(result) + +# Non-blocking execution "in the background" +result = pool.apply_async(f, (12,)) +while not result.ready(): + time.sleep(1) +print(result.get()) +# an alternative to "sleeping" is to use e.g. result.get(timeout=1) +``` + + +# Pool of workers + +```python +from multiprocessing import Pool +import time + +def f(x): + return x**2 + +pool = Pool(8) + +# calculate x**2 in parallel for x in 0..9 +result = pool.map(f, range(10)) +print(result) + +# non-blocking alternative +result = pool.map_async(f, range(10)) +while not result.ready(): + time.sleep(1) +print(result.get()) +``` + + +# Summary + +- Parallelism achieved by launching new OS processes +- Only limited communication possible + - work sharing: queues / pool of workers +- Non-blocking execution available + - do something else while waiting for results +- Further information: + https://docs.python.org/2/library/multiprocessing.html diff --git a/04-mpi4py.md b/04-mpi4py.md new file mode 100644 index 0000000..9e7d533 --- /dev/null +++ b/04-mpi4py.md @@ -0,0 +1,801 @@ +--- +title: MPI for Python +lang: en +--- + +# Message Passing Interface {.section} + +# Message passing interface + +- MPI is an application programming interface (API) for communication + between separate processes +- MPI programs are portable and scalable + - the same program can run on different types of computers, from PC's + to supercomputers + - the most widely used approach for distributed parallel computing +- MPI is flexible and comprehensive + - large (over 300 procedures) + - concise (often only 6 procedures are needed) +- MPI standard defines C and Fortran interfaces + - MPI for Python (mpi4py) provides an unofficial Python interface + + +# Processes and threads + +![](img/processes-threads-highlight-proc.svg){.center width=80%} + + +
+ +## Process + +- Independent execution units +- Have their own state information and *own memory* address space + +
+
+ +## Thread + +- A single process may contain multiple threads +- Have their own state information, but *share* the *same memory* + address space + +
+ + +# Execution model + +- MPI program is launched as a set of *independent*, *identical processes* + - execute the same program code and instructions + - can reside in different nodes (or even in different computers) +- The way to launch a MPI program depends on the system + - mpiexec, mpirun, srun, aprun, ... + - mpiexec/mpirun in training class + - srun on puhti.csc.fi + + +# MPI rank + +- Rank: ID number given to a process + - it is possible to query for rank + - processes can perform different tasks based on their rank + +```python +if (rank == 0): + # do something +elif (rank == 1): + # do something else +else: + # all other processes do something different +``` + + +# Data model + +- Each MPI process has its own *separate* memory space, i.e. all + variables and data structures are *local* to the process +- Processes can exchange data by sending and receiving messages + +![](img/data-model.svg){.center width=90%} + + +# MPI communicator + +- Communicator: a group containing all the processes that will participate + in communication + - in mpi4py most MPI calls are implemented as methods of a + communicator object + - `MPI_COMM_WORLD` contains all processes (`MPI.COMM_WORLD` in + mpi4py) + - user can define custom communicators + + +# Routines in MPI for Python + +- Communication between processes + - sending and receiving messages between two processes + - sending and receiving messages between several processes +- Synchronization between processes +- Communicator creation and manipulation +- Advanced features (e.g. user defined datatypes, one-sided communication + and parallel I/O) + + +# Getting started + +- Basic methods of communicator object + - `Get_size()` Number of processes in communicator + - `Get_rank()` rank of this process + +```python +from mpi4py import MPI + +comm = MPI.COMM_WORLD # communicator object containing all processes + +size = comm.Get_size() +rank = comm.Get_rank() + +print("I am rank %d in group of %d processes" % (rank, size)) +``` + + +# Running an example program + +```bash +$ mpiexec -n 4 python3 hello.py + +I am rank 2 in group of 4 processes +I am rank 0 in group of 4 processes +I am rank 3 in group of 4 processes +I am rank 1 in group of 4 processes +``` + +```python +from mpi4py import MPI + +comm = MPI.COMM_WORLD # communicator object containing all processes + +size = comm.Get_size() +rank = comm.Get_rank() + +print("I am rank %d in group of %d processes" % (rank, size)) +``` + + +# Point-to-Point Communication {.section} + +# MPI communication + +
+ +- Data is local to the MPI processes + - They need to *communicate* to coordinate work +- Point-to-point communication + - Messages are sent between two processes +- Collective communication + - Involving a number of processes at the same time + +
+ +
+ +![](img/communication-schematic.svg){.center width=50%} + +
+ + +# MPI point-to-point operations + +- One process *sends* a message to another process that *receives* it +- Sends and receives in a program should match - one receive per send +- Each message contains + - The actual *data* that is to be sent + - The *datatype* of each element of data + - The *number of elements* the data consists of + - An identification number for the message (*tag*) + - The ranks of the *source* and *destination* process +- With **mpi4py** it is often enough to specify only *data* and + *source* and *destination* + +# Sending and receiving data + +- Sending and receiving a dictionary + +```python +from mpi4py import MPI + +comm = MPI.COMM_WORLD # communicator object containing all processes +rank = comm.Get_rank() + +if rank == 0: + data = {'a': 7, 'b': 3.14} + comm.send(data, dest=1) +elif rank == 1: + data = comm.recv(source=0) +``` + + +# Sending and receiving data + +- Arbitrary Python objects can be communicated with the send and + receive methods of a communicator + +
+ +`.send(data, dest)` + : `data`{.input} + : Python object to send + + `dest`{.input} + : destination rank + +
+
+ +`.recv(source)` + : `source`{.input} + : source rank + : note: data is provided as return value + +
+ +- Destination and source ranks have to match! + + +# Blocking routines & deadlocks + +- `send()` and `recv()` are *blocking* routines + - the functions exit only once it is safe to use the data (memory) + involved in the communication +- Completion depends on other processes => risk for *deadlocks* + - for example, if all processes call `recv()` there is no-one left to + call a corresponding `send()` and the program is *stuck forever* + + +# Typical point-to-point communication patterns + +![](img/comm_patt.svg){.center width=100%} + +
+ +- Incorrect ordering of sends and receives may result in a deadlock + + +# Case study: parallel sum + +
+![](img/parallel-sum-0.svg){.center width=70%} +
+ +
+## Initial state + +An array A containing floating point numbers read from a a file by the first +MPI task (rank 0). + +## Goal + +Calculate the total sum of all elements in array A in parallel. +
+ + +# Case study: parallel sum + +
+![](img/parallel-sum-0.svg){.center width=70%} +
+ +
+## Parallel algorithm + +
+1. Scatter the data
+   1.1. receive operation for scatter
+   1.2. send operation for scatter
+2. Compute partial sums in parallel
+3. Gather the partial sums
+   3.1. receive operation for gather
+   3.2. send operation for gather
+4. Compute the total sum
+
+ +
+ + +# Step 1.1: Receive operation for scatter + +![](img/parallel-sum-1.1.png){.center width=55%} + + +# Step 1.2: Send operation for scatter + +![](img/parallel-sum-1.2.png){.center width=55%} + + +# Step 2: Compute partial sums in parallel + +![](img/parallel-sum-2.png){.center width=55%} + + +# Step 3.1: Receive operation for gather + +![](img/parallel-sum-3.1.png){.center width=55%} + + +# Step 3.2: Send operation for gather + +![](img/parallel-sum-3.2.png){.center width=55%} + + +# Step 4: Compute the total sum + +![](img/parallel-sum-4.png){.center width=55%} + + +# Communicating NumPy arrays + +- Arbitrary Python objects are converted to byte streams (pickled) when + sending and back to Python objects (unpickled) when receiving + - these conversions may be a serious overhead to communication +- Contiguous memory buffers (such as NumPy arrays) can be communicated + with very little overhead using upper case methods: + - `Send(data, dest)` + - `Recv(data, source)` + - note the difference in receiving: the data array has to exist at the + time of call + + +# Send/receive a NumPy array + +- Note the difference between upper/lower case! + - send/recv: general Python objects, slow + - Send/Recv: continuous arrays, fast + +```python +from mpi4py import MPI +import numpy + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() + +data = numpy.empty(100, dtype=float) +if rank == 0: + data[:] = numpy.arange(100, dtype=float) + comm.Send(data, dest=1) +elif rank == 1: + comm.Recv(data, source=0) +``` + + +# Combined send and receive + +- Send one message and receive another with a single command + - reduces risk for deadlocks +- Destination and source ranks can be same or different + - `MPI.PROC_NULL` can be used for *no destination/source* + +```python +data = numpy.arange(10, dtype=float) * (rank + 1) +buffer = numpy.empty(data.shape, dtype=data.dtype) + +if rank == 0: + dest, source = 1, 1 +elif rank == 1: + dest, source = 0, 0 + +comm.Sendrecv(data, dest=dest, recvbuf=buffer, source=source) +``` + + +# MPI datatypes + +- MPI has a number of predefined datatypes to represent data + - e.g. `MPI.INT` for integer and `MPI.DOUBLE` for float +- No need to specify the datatype for Python objects or Numpy arrays + - objects are serialised as byte streams + - automatic detection for NumPy arrays +- If needed, one can also define custom datatypes + - for example to use non-contiguous data buffers + +# Summary + +- Point-to-point communication = messages are sent between two MPI + processes +- Point-to-point operations enable any parallel communication pattern (in + principle) +- Arbitrary Python objects (that can be pickled!) + - `send` / `recv` + - `sendrecv` +- Memory buffers such as Numpy arrays + - `Send` / `Recv` + - `Sendrecv` + + +# Non-blocking Communication {.section} + +# Non-blocking communication + +- Non-blocking sends and receives + - `isend` & `irecv` + - returns immediately and sends/receives in background + - return value is a Request object +- Enables some computing concurrently with communication +- Avoids many common dead-lock situations + + +# Non-blocking communication + +- Have to finalize send/receive operations + - `wait()` + - Waits for the communication started with `isend` or `irecv` to + finish (blocking) + - `test()` + - Tests if the communication has finished (non-blocking) +- You can mix non-blocking and blocking p2p routines + - e.g., receive `isend` with `recv` + + +# Example: non-blocking send/receive + +```python +rank = comm.Get_rank() +size = comm.Get_size() + +if rank == 0: + data = arange(size, dtype=float) * (rank + 1) + req = comm.Isend(data, dest=1) # start a send + calculate_something(rank) # .. do something else .. + req.wait() # wait for send to finish + # safe to read/write data again + +elif rank == 1: + data = empty(size, float) + req = comm.Irecv(data, source=0) # post a receive + calculate_something(rank) # .. do something else .. + req.wait() # wait for receive to finish + # data is now ready for use +``` + + +# Multiple non-blocking operations + +- Methods `waitall()` and `waitany()` may come handy when dealing with + multiple non-blocking operations (available in the `MPI.Request` class) + - `Request.waitall(requests)` + - wait for all initiated requests to complete + - `Request.waitany(requests)` + - wait for any initiated request to complete +- For example, assuming `requests` is a list of request objects, one can wait + for all of them to be finished with: + +~~~python +MPI.Request.waitall(requests) +~~~ + + +# Example: non-blocking message chain + + + +~~~python +from mpi4py import MPI +import numpy + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() +size = comm.Get_size() + +data = numpy.arange(10, dtype=float) * (rank + 1) # send buffer +buffer = numpy.zeros(10, dtype=float) # receive buffer + +tgt = rank + 1 +src = rank - 1 +if rank == 0: + src = MPI.PROC_NULL +if rank == size - 1: + tgt = MPI.PROC_NULL + +req = [] +req.append(comm.Isend(data, dest=tgt)) +req.append(comm.Irecv(buffer, source=src)) + +MPI.Request.waitall(req) +~~~ + + + + +# Overlapping computation and communication + +
+~~~python +request_in = comm.Irecv(ghost_data) +request_out = comm.Isend(border_data) + +compute(ghost_independent_data) +request_in.wait() + +compute(border_data) +request_out.wait() +~~~ +
+ +
+![](img/non-blocking-pattern.png) +
+ + +# Summary + +- Non-blocking communication is usually the smart way to do point-to-point + communication in MPI +- Non-blocking communication realization + - `isend` / `Isend` + - `irecv` / `Irecv` + - `request.wait()` + + +# Communicators {.section} + +# Communicators + +- The communicator determines the "communication universe" + - The source and destination of a message is identified by process rank + *within* the communicator +- So far: `MPI.COMM_WORLD` +- Processes can be divided into subcommunicators + - Task level parallelism with process groups performing separate tasks + - Collective communication within a group of processes + - Parallel I/O + + +# Communicators + +
+- Communicators are dynamic +- A task can belong simultaneously to several communicators + - Unique rank in each communicator +
+
+![](img/communicator.svg){.center width=80%} +
+ + + +# User-defined communicators + +- By default a single, universal communicator exists to which all + processes belong (`MPI.COMM_WORLD`) +- One can create new communicators, e.g. by splitting this into + sub-groups + +```python +comm = MPI.COMM_WORLD +rank = comm.Get_rank() + +color = rank % 4 + +local_comm = comm.Split(color) +local_rank = local_comm.Get_rank() + +print("Global rank: %d Local rank: %d" % (rank, local_rank)) +``` + + +# Collective Communication {.section} + +# Collective communication + +- Collective communication transmits data among all processes in a process + group (communicator) + - these routines must be called by all the processes in the group + - amount of sent and received data must match +- Collective communication includes + - data movement + - collective computation + - synchronization +- Example + - `comm.barrier()` makes every task hold until all tasks in the + communicator `comm` have called it + + +# Collective communication + +- Collective communication typically outperforms point-to-point + communication +- Code becomes more compact (and efficient!) and easier to maintain: + - For example, communicating a Numpy array of 1M elements from task 0 to all + other tasks: + +
+ +```python +if rank == 0: + for i in range(1, size): + comm.Send(data, i) +else: + comm.Recv(data, 0) +``` + +
+
+ +```python +comm.Bcast(data, 0) +``` + +
+ + +# Broadcast + +- Send the same data from one process to all the other + +![](img/mpi-bcast.svg){.center width=80%} + + +# Broadcast + +- Broadcast sends same data to all processes + +```python +from mpi4py import MPI +import numpy + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() + +if rank == 0: + py_data = {'key1' : 0.0, 'key2' : 11} # Python object + data = np.arange(8) / 10. # NumPy array +else: + py_data = None + data = np.zeros(8) + +new_data = comm.bcast(py_data, root=0) + +comm.Bcast(data, root=0) +``` + + +# Scatter + +- Send equal amount of data from one process to others +- Segments A, B, ... may contain multiple elements + +![](img/mpi-scatter.svg){.center width=80%} + + +# Scatter + +- Scatter distributes data to processes + +```python +from mpi4py import MPI +from numpy import arange, empty + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() +size = comm.Get_size() +if rank == 0: + py_data = range(size) + data = arange(size**2, dtype=float) +else: + py_data = None + data = None + +new_data = comm.scatter(py_data, root=0) # returns the value + +buffer = empty(size, float) # prepare a receive buffer +comm.Scatter(data, buffer, root=0) # in-place modification +``` + + +# Gather + +- Collect data from all the process to one process +- Segments A, B, ... may contain multiple elements + +![](img/mpi-gather.svg){.center width=80%} + + +# Gather + +- Gather pulls data from all processes + +```python +from mpi4py import MPI +from numpy import arange, zeros + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() +size = comm.Get_size() + +data = arange(10, dtype=float) * (rank + 1) +buffer = zeros(size * 10, float) + +n = comm.gather(rank, root=0) # returns the value +comm.Gather(data, buffer, root=0) # in-place modification +``` + + +# Reduce + +- Applies an operation over set of processes and places result in + single process + +![](img/mpi-reduce.svg){.center width=80%} + +# Reduce + +- Reduce gathers data and applies an operation on it + +```python +from mpi4py import MPI +from numpy import arange, empty + +comm = MPI.COMM_WORLD +rank = comm.Get_rank() +size = comm.Get_size() + +data = arange(10 * size, dtype=float) * (rank + 1) +buffer = zeros(size * 10, float) + +n = comm.reduce(rank, op=MPI.SUM, root=0) # returns the value +comm.Reduce(data, buffer, op=MPI.SUM, root=0) # in-place modification +``` + + +# Other common collective operations + +Scatterv + : each process receives different amount of data + +Gatherv + : each process sends different amount of data + +Allreduce + : all processes receive the results of reduction + +Alltoall + : each process sends and receives to/from each other + +Alltoallv + : each process sends and receives different amount of data + + + +# Non-blocking collectives + +- New in MPI 3: no support in mpi4py +- Non-blocking collectives enable the overlapping of communication and + computation together with the benefits of collective communication +- Restrictions + - have to be called in same order by all ranks in a communicator + - mixing of blocking and non-blocking collectives is not allowed + + + + +# Common mistakes with collectives + +1. Using a collective operation within one branch of an if-else test based on + the rank of the process + - for example: `if rank == 0: comm.bcast(...)` + - all processes in a communicator must call a collective routine! +2. Assuming that all processes making a collective call would complete at + the same time. +3. Using the input buffer also as an output buffer: + - for example: `comm.Scatter(a, a, MPI.SUM)` + - always use different memory locations (arrays) for input and output! + + +# Summary + +- Collective communications involve all the processes within a + communicator + - all processes must call them +- Collective operations make code more transparent and compact +- Collective routines allow optimizations by MPI library +- MPI-3 contains also non-blocking collectives, but these are currently + not supported by MPI for Python + + +# On-line resources + +- Documentation for mpi4py is quite limited + - short on-line manual available at + [https://mpi4py.readthedocs.io/](https://mpi4py.readthedocs.io/) +- Some good references: + - "A Python Introduction to Parallel Programming with MPI" *by Jeremy + Bejarano* [http://materials.jeremybejarano.com/MPIwithPython/](http://materials.jeremybejarano.com/MPIwithPython/) + - "mpi4py examples" *by Jörg Bornschein* [https://github.com/jbornschein/mpi4py-examples](https://github.com/jbornschein/mpi4py-examples) + + +# Summary + +- mpi4py provides Python interface to MPI +- MPI calls via communicator object +- Possible to communicate arbitrary Python objects +- NumPy arrays can be communicated with nearly same speed as in C/Fortran diff --git a/README.md b/README.md index 22d94d9..87d93d2 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,13 @@ # Python in High Performance Computing -Exercise material and model answers for the CSC course "Python in High Performance Computing". The course is part of PRACE Training activity at CSC. +This binder image includes several exercices from the CSC course "Python in High Performance Computing". The course is part of PRACE Training activity at CSC (https://www.futurelearn.com/courses/python-in-hpc). -This master branch contains always the material for latest course, past -courses are stored in tags. +Also, it includes material from a Dask tutorial given at SciPy 2020 conference. -Online version of the course is run regularly in [FutureLearn](https://www.futurelearn.com/courses/python-in-hpc). - -Articles and videos of the course are also available in a simple form in this [site](docs/mooc/index.md). +## NOTE : Exercices with "-->" are suggestions to start ## Exercises -[General instructions](exercise-instructions.md) - ### Basic array manipulation @@ -41,18 +36,7 @@ Articles and videos of the course are also available in a simple form in this [s ### Performance analysis - - [Using cProfile](performance/cprofile) - -### Optimising with Cython - - - [Creating simple extension](cython/simple-extension) - - [Using static typing](cython/static-typing) - - [Using C-functions](cython/c-functions) - - [Optimising heat equation](cython/heat-equation) - -### Interfacing with libraries - - - [C libraries](interface/c) + - **[--> Using cProfile](performance/cprofile)** ### Multiprocessing @@ -61,11 +45,23 @@ Articles and videos of the course are also available in a simple form in this [s ### Parallel programming with mpi4py - - [Hello World](mpi/hello-world) + - **[--> Hello World](mpi/hello-world)** - [Simple message exchange](mpi/message-exchange) - [Message chain](mpi/message-chain) - - [Non-blocking communication](mpi/non-blocking) - - [Collective operations](mpi/collectives) + - **[--> Non-blocking communication](mpi/non-blocking)** + - **[--> Collective operations](mpi/collectives)** + +### Dask + + - **[--> Delayed](dask/01_dask.delayed.ipynb)** + - [Understanding 'Lazy'](dask/01x_lazy.ipynb) + - [Bags](dask/02_bag.ipynb) + - [Arrays](dask/03_array.ipynb) + - **[--> Dataframe](dask/04_dataframe.ipynb)** + - **[--> Distributed mode](dask/05_distributed.ipynb)** + - [Distributed advanced](dask/06_distributed_advanced.ipynb) + - [Storage optimization](dask/07_dataframe_storage.ipynb) + - [Machine Learning](dask/08_machine_learning.ipynb) ### Bonus exercises diff --git a/binder/apt.txt b/binder/apt.txt new file mode 100644 index 0000000..4d95609 --- /dev/null +++ b/binder/apt.txt @@ -0,0 +1 @@ +graphviz diff --git a/binder/environment.yml b/binder/environment.yml new file mode 100644 index 0000000..b85ec04 --- /dev/null +++ b/binder/environment.yml @@ -0,0 +1,41 @@ +name: env-hpc + +channels: + - conda-forge + - williamfgc + +dependencies: + - python=3.8 + - mpi4py + - openmpi + - cython + - cffi + - numexpr + - nodejs + - jupyterlab>=2.0.0,<3 + - numpy>=1.18.1 + - h5py + - scipy>=1.3.0 + - toolz + - bokeh>=2.0.0 + - dask=2021.08.0 + - dask-labextension>=2.0.0 + - distributed=2021.08.0 + - notebook + - matplotlib + - Pillow + - pandas>=1.0.1 + - pandas-datareader + - pytables + - scikit-learn>=0.22.1 + - scikit-image>=0.15.0 + - snakeviz + - ujson + - pip + - s3fs + - fastparquet + - dask-ml + - ipywidgets>=7.5 + - cachey + - python-graphviz + - zarr diff --git a/binder/jupyterlab-workspace.json b/binder/jupyterlab-workspace.json new file mode 100644 index 0000000..0a5ad1b --- /dev/null +++ b/binder/jupyterlab-workspace.json @@ -0,0 +1,94 @@ +{ + "data": { + "file-browser-filebrowser:cwd": { + "path": "" + }, + "dask-dashboard-launcher:individual-progress": { + "data": { + "route": "individual-progress", + "label": "Progress" + } + }, + "dask-dashboard-launcher:individual-task-stream": { + "data": { + "route": "individual-task-stream", + "label": "Task Stream" + } + }, + "layout-restorer:data": { + "main": { + "dock": { + "type": "split-area", + "orientation": "horizontal", + "sizes": [ + 0.5, + 0.5 + ], + "children": [ + { + "type": "tab-area", + "currentIndex": 0, + "widgets": [ + "notebook:00_overview.ipynb" + ] + }, + { + "type": "split-area", + "orientation": "vertical", + "sizes": [ + 0.5, + 0.5 + ], + "children": [ + { + "type": "tab-area", + "currentIndex": 0, + "widgets": [ + "dask-dashboard-launcher:individual-task-stream" + ] + }, + { + "type": "tab-area", + "currentIndex": 0, + "widgets": [ + "dask-dashboard-launcher:individual-progress" + ] + } + ] + } + ] + }, + "mode": "multiple-document", + "current": "notebook:00_overview.ipynb" + }, + "left": { + "collapsed": false, + "current": "filebrowser", + "widgets": [ + "filebrowser", + "running-sessions", + "dask-dashboard-launcher", + "command-palette", + "tab-manager" + ] + }, + "right": { + "collapsed": true, + "widgets": [] + } + }, + "dask-dashboard-launcher": { + "url": "DASK_DASHBOARD_URL", + "cluster": "" + }, + "notebook:00_overview.ipynb": { + "data": { + "path": "00_overview.ipynb", + "factory": "Notebook" + } + } + }, + "metadata": { + "id": "/lab" + } +} diff --git a/binder/postBuild b/binder/postBuild new file mode 100755 index 0000000..4925180 --- /dev/null +++ b/binder/postBuild @@ -0,0 +1,6 @@ +#!/bin/bash + +# Install the JupyterLab dask-labextension +jupyter labextension install dask-labextension +jupyter labextension install @jupyter-widgets/jupyterlab-manager +jupyter labextension install @bokeh/jupyter_bokeh diff --git a/binder/start b/binder/start new file mode 100755 index 0000000..792ee7a --- /dev/null +++ b/binder/start @@ -0,0 +1,10 @@ +#!/bin/bash + +# Replace DASK_DASHBOARD_URL with the proxy location +sed -i -e "s|DASK_DASHBOARD_URL|/user/${JUPYTERHUB_USER}/proxy/8787|g" binder/jupyterlab-workspace.json + +# Import the workspace +jupyter lab workspaces import binder/jupyterlab-workspace.json +export DASK_TUTORIAL_SMALL=1 + +exec "$@" diff --git a/cython/c-functions/README.md b/cython/c-functions/README.md deleted file mode 100644 index c79f63c..0000000 --- a/cython/c-functions/README.md +++ /dev/null @@ -1,36 +0,0 @@ -## Using C-functions - -Fibonacci numbers are a sequence of integers defined by the recurrence -relation - - Fn = Fn-1 + Fn-2 - -with the initial values F0=0, F1=1. - -The module [fib.py](fib.py) contains a function `fibonacci(n)` that -calculates recursively Fn. The function can be used e.g. as - -```python -from fib import fibonacci - -fibonacci(30) -``` - -Make a Cython version of the module, and investigate how adding type -information and making `fibonacci` a C-function affects performance -(hint: function needs to be called both from Python and C). Use -`timeit` for performance measurements, either from command line - -```bash -$ python3 -m timeit -s "from fib import fibonacci" "fibonacci(30)" -``` - -or within IPython - -```python -In []: %timeit fibonacci(30) -``` - -**Note:** this recursive algorithm is very inefficient way of calculating -Fibonacci numbers and pure Python implemention of better algorithm -outperforms Cython implementation drastically. diff --git a/cython/c-functions/fib.py b/cython/c-functions/fib.py deleted file mode 100644 index 0b2bfa4..0000000 --- a/cython/c-functions/fib.py +++ /dev/null @@ -1,4 +0,0 @@ -def fibonacci(n): - if n < 2: - return n - return fibonacci(n-2) + fibonacci(n-1) diff --git a/cython/c-functions/solution/fib.pyx b/cython/c-functions/solution/fib.pyx deleted file mode 100644 index f8946d6..0000000 --- a/cython/c-functions/solution/fib.pyx +++ /dev/null @@ -1,9 +0,0 @@ -cpdef int fibonacci(int n): - if n < 2: - return n - return fibonacci(n-2) + fibonacci(n-1) - -def fibonacci_py(n): - if n < 2: - return n - return fibonacci_py(n-2) + fibonacci_py(n-1) diff --git a/cython/c-functions/solution/fib_py.py b/cython/c-functions/solution/fib_py.py deleted file mode 100644 index 5b8692a..0000000 --- a/cython/c-functions/solution/fib_py.py +++ /dev/null @@ -1,12 +0,0 @@ -from functools import lru_cache - -def fibonacci(n): - if n < 2: - return n - return fibonacci(n-2) + fibonacci(n-1) - -@lru_cache(maxsize=None) -def fibonacci_cached(n): - if n < 2: - return n - return fibonacci_cached(n-2) + fibonacci_cached(n-1) diff --git a/cython/c-functions/solution/setup.py b/cython/c-functions/solution/setup.py deleted file mode 100644 index cb4073b..0000000 --- a/cython/c-functions/solution/setup.py +++ /dev/null @@ -1,11 +0,0 @@ -from distutils.core import setup, Extension -from Cython.Build import cythonize - -ext = Extension("fib", - sources=["fib.pyx"], - ) - -setup( - ext_modules=cythonize(ext) -) - diff --git a/cython/c-functions/solution/test_fib.py b/cython/c-functions/solution/test_fib.py deleted file mode 100644 index e333721..0000000 --- a/cython/c-functions/solution/test_fib.py +++ /dev/null @@ -1,26 +0,0 @@ -from fib import fibonacci -from fib_py import fibonacci as fibonacci_py, fibonacci_cached -from timeit import repeat - -ncython = 100 -npython = 10 -ncached = 10000000 - -# Pure Python -time_python = repeat("fibonacci_py(30)", number=npython, globals=locals()) -time_python = min(time_python) / npython - -# Cython -time_cython = repeat("fibonacci(30)", number=ncython, globals=locals()) -time_cython = min(time_cython) / ncython - -# Python, cached -time_cached = repeat("fibonacci_cached(30)", number=ncached, globals=locals()) -time_cached = min(time_cached) / ncached - -print("Pure Python: {:5.4f} s".format(time_python)) -print("Cython: {:5.4f} ms".format(time_cython*1.e3)) -print("Speedup: {:5.1f}".format(time_python / time_cython)) -print("Pure Python cached: {:5.4f} us".format(time_cached*1.e6)) -print("Speedup over Cython: {:5.1e}".format(time_cython / time_cached)) - diff --git a/cython/heat-equation/README.md b/cython/heat-equation/README.md deleted file mode 100644 index 6f3a72b..0000000 --- a/cython/heat-equation/README.md +++ /dev/null @@ -1,29 +0,0 @@ -## Optimising heat equation with Cython - -### Creating a Cython extension - -Write a `setup.py` for creating a Cython version of [heat.py](heat.py) -module, and use it from the main program [heat_main.py](heat_main.py). -How much does simple Cythonization (i.e. diminishing the interpreting -overhead) improve the performance? - -### Optimising - -Based on the profile in the performance measurement -[exercise](../../performance/cprofile) optimise the most time -consuming part of the algorithm. If you did not finish the profiling -exercise, you can look at example profile [here](profile.md). - -Utilize all the tricks you have learned so far (type declarations, -fast array indexing, compiler directives, C functions, ...). - -Investigate how the different optimizations affect the performance. You -can use applications own timers and/or **timeit**. Annotated HTML-report with -`cython -a …` can be useful when tuning performance. - -When finished with the optimisation, compare performance to -Python/NumPy model solution (in -[numpy/heat-equation](../../numpy/heat-equation)), which uses array -operations. You can play around also with larger input data as provided in -[bottle_medium.dat](bottle_medium.dat) and [bottle_large.dat](bottle_large.dat). - diff --git a/cython/heat-equation/bottle.dat b/cython/heat-equation/bottle.dat deleted file mode 120000 index fcc9630..0000000 --- a/cython/heat-equation/bottle.dat +++ /dev/null @@ -1 +0,0 @@ -../../numpy/heat-equation/bottle.dat \ No newline at end of file diff --git a/cython/heat-equation/bottle_large.dat b/cython/heat-equation/bottle_large.dat deleted file mode 120000 index 39f979b..0000000 --- a/cython/heat-equation/bottle_large.dat +++ /dev/null @@ -1 +0,0 @@ -../../numpy/heat-equation/bottle_large.dat \ No newline at end of file diff --git a/cython/heat-equation/bottle_medium.dat b/cython/heat-equation/bottle_medium.dat deleted file mode 120000 index d730869..0000000 --- a/cython/heat-equation/bottle_medium.dat +++ /dev/null @@ -1 +0,0 @@ -../../numpy/heat-equation/bottle_medium.dat \ No newline at end of file diff --git a/cython/heat-equation/heat.py b/cython/heat-equation/heat.py deleted file mode 100644 index cd6a03d..0000000 --- a/cython/heat-equation/heat.py +++ /dev/null @@ -1,54 +0,0 @@ -import numpy as np -import matplotlib -matplotlib.use('Agg') -import matplotlib.pyplot as plt - -# Set the colormap -plt.rcParams['image.cmap'] = 'BrBG' - -def evolve(u, u_previous, a, dt, dx2, dy2): - """Explicit time evolution. - u: new temperature field - u_previous: previous field - a: diffusion constant - dt: time step. """ - - n, m = u.shape - - for i in range(1, n-1): - for j in range(1, m-1): - u[i, j] = u_previous[i, j] + a * dt * ( \ - (u_previous[i+1, j] - 2*u_previous[i, j] + \ - u_previous[i-1, j]) / dx2 + \ - (u_previous[i, j+1] - 2*u_previous[i, j] + \ - u_previous[i, j-1]) / dy2 ) - u_previous[:] = u[:] - -def iterate(field, field0, a, dx, dy, timesteps, image_interval): - """Run fixed number of time steps of heat equation""" - - dx2 = dx**2 - dy2 = dy**2 - - # For stability, this is the largest interval possible - # for the size of the time-step: - dt = dx2*dy2 / ( 2*a*(dx2+dy2) ) - - for m in range(1, timesteps+1): - evolve(field, field0, a, dt, dx2, dy2) - if m % image_interval == 0: - write_field(field, m) - -def init_fields(filename): - # Read the initial temperature field from file - field = np.loadtxt(filename) - field0 = field.copy() # Array for field of previous time step - return field, field0 - -def write_field(field, step): - plt.gca().clear() - plt.imshow(field) - plt.axis('off') - plt.savefig('heat_{0:03d}.png'.format(step)) - - diff --git a/cython/heat-equation/heat_main.py b/cython/heat-equation/heat_main.py deleted file mode 100644 index b129e5c..0000000 --- a/cython/heat-equation/heat_main.py +++ /dev/null @@ -1,55 +0,0 @@ -from __future__ import print_function -import time -import argparse - -from heat import init_fields, write_field, iterate - - -def main(input_file='bottle.dat', a=0.5, dx=0.1, dy=0.1, - timesteps=200, image_interval=4000): - - # Initialise the temperature field - field, field0 = init_fields(input_file) - - print("Heat equation solver") - print("Diffusion constant: {}".format(a)) - print("Input file: {}".format(input_file)) - print("Parameters") - print("----------") - print(" nx={} ny={} dx={} dy={}".format(field.shape[0], field.shape[1], - dx, dy)) - print(" time steps={} image interval={}".format(timesteps, - image_interval)) - - # Plot/save initial field - write_field(field, 0) - # Iterate - t0 = time.time() - iterate(field, field0, a, dx, dy, timesteps, image_interval) - t1 = time.time() - # Plot/save final field - write_field(field, timesteps) - - print("Simulation finished in {0} s".format(t1-t0)) - -if __name__ == '__main__': - - # Process command line arguments - parser = argparse.ArgumentParser(description='Heat equation') - parser.add_argument('-dx', type=float, default=0.01, - help='grid spacing in x-direction') - parser.add_argument('-dy', type=float, default=0.01, - help='grid spacing in y-direction') - parser.add_argument('-a', type=float, default=0.5, - help='diffusion constant') - parser.add_argument('-n', type=int, default=200, - help='number of time steps') - parser.add_argument('-i', type=int, default=4000, - help='image interval') - parser.add_argument('-f', type=str, default='bottle.dat', - help='input file') - - args = parser.parse_args() - - main(args.f, args.a, args.dx, args.dy, args.n, args.i) - diff --git a/cython/heat-equation/profile.md b/cython/heat-equation/profile.md deleted file mode 100644 index b846704..0000000 --- a/cython/heat-equation/profile.md +++ /dev/null @@ -1,21 +0,0 @@ -## Example profile for heat equation solver - -``` - 591444 function calls (582598 primitive calls) in 15.498 seconds - - Ordered by: internal time - List reduced from 3224 to 10 due to restriction <10> - - ncalls tottime percall cumtime percall filename:lineno(function) - 200 14.837 0.074 14.837 0.074 heat.py:9(evolve) - 2 0.070 0.035 0.070 0.035 {built-in method matplotlib._png.write_png} - 241 0.052 0.000 0.052 0.000 {built-in method marshal.loads} - 2467 0.023 0.000 0.040 0.000 inspect.py:614(cleandoc) - 1052/959 0.023 0.000 0.066 0.000 {built-in method builtins.__build_class__} - 33/31 0.018 0.001 0.023 0.001 {built-in method _imp.create_dynamic} - 3228 0.014 0.000 0.014 0.000 {built-in method numpy.array} - 40000 0.014 0.000 0.017 0.000 npyio.py:771(floatconv) - 274/1 0.013 0.000 15.498 15.498 {built-in method builtins.exec} - 556 0.011 0.000 0.011 0.000 :78(acquire) - -``` diff --git a/cython/heat-equation/solution/heat.pyx b/cython/heat-equation/solution/heat.pyx deleted file mode 100644 index 6d4a0cb..0000000 --- a/cython/heat-equation/solution/heat.pyx +++ /dev/null @@ -1,70 +0,0 @@ -import numpy as np -cimport numpy as cnp -import cython - -import matplotlib -matplotlib.use('Agg') -import matplotlib.pyplot as plt - -# Set the colormap -plt.rcParams['image.cmap'] = 'BrBG' - -@cython.boundscheck(False) -@cython.wraparound(False) -@cython.cdivision(True) -@cython.profile(True) -cdef evolve(cnp.ndarray[cnp.double_t, ndim=2] u, - cnp.ndarray[cnp.double_t, ndim=2] u_previous, - double a, double dt, double dx2, double dy2): - """Explicit time evolution. - u: new temperature field - u_previous: previous field - a: diffusion constant - dt: time step. """ - - cdef int n = u.shape[0] - cdef int m = u.shape[1] - - cdef int i,j - - # Multiplication is more efficient than division - cdef double dx2inv = 1. / dx2 - cdef double dy2inv = 1. / dy2 - - for i in range(1, n-1): - for j in range(1, m-1): - u[i, j] = u_previous[i, j] + a * dt * ( \ - (u_previous[i+1, j] - 2*u_previous[i, j] + \ - u_previous[i-1, j]) * dx2inv + \ - (u_previous[i, j+1] - 2*u_previous[i, j] + \ - u_previous[i, j-1]) * dy2inv ) - u_previous[:] = u[:] - -def iterate(field, field0, a, dx, dy, timesteps, image_interval): - """Run fixed number of time steps of heat equation""" - - dx2 = dx**2 - dy2 = dy**2 - - # For stability, this is the largest interval possible - # for the size of the time-step: - dt = dx2*dy2 / ( 2*a*(dx2+dy2) ) - - for m in range(1, timesteps+1): - evolve(field, field0, a, dt, dx2, dy2) - if m % image_interval == 0: - write_field(field, m) - -def init_fields(filename): - # Read the initial temperature field from file - field = np.loadtxt(filename) - field0 = field.copy() # Array for field of previous time step - return field, field0 - -def write_field(field, step): - plt.gca().clear() - plt.imshow(field) - plt.axis('off') - plt.savefig('heat_{0:03d}.png'.format(step)) - - diff --git a/cython/heat-equation/solution/heat_main.py b/cython/heat-equation/solution/heat_main.py deleted file mode 100644 index b129e5c..0000000 --- a/cython/heat-equation/solution/heat_main.py +++ /dev/null @@ -1,55 +0,0 @@ -from __future__ import print_function -import time -import argparse - -from heat import init_fields, write_field, iterate - - -def main(input_file='bottle.dat', a=0.5, dx=0.1, dy=0.1, - timesteps=200, image_interval=4000): - - # Initialise the temperature field - field, field0 = init_fields(input_file) - - print("Heat equation solver") - print("Diffusion constant: {}".format(a)) - print("Input file: {}".format(input_file)) - print("Parameters") - print("----------") - print(" nx={} ny={} dx={} dy={}".format(field.shape[0], field.shape[1], - dx, dy)) - print(" time steps={} image interval={}".format(timesteps, - image_interval)) - - # Plot/save initial field - write_field(field, 0) - # Iterate - t0 = time.time() - iterate(field, field0, a, dx, dy, timesteps, image_interval) - t1 = time.time() - # Plot/save final field - write_field(field, timesteps) - - print("Simulation finished in {0} s".format(t1-t0)) - -if __name__ == '__main__': - - # Process command line arguments - parser = argparse.ArgumentParser(description='Heat equation') - parser.add_argument('-dx', type=float, default=0.01, - help='grid spacing in x-direction') - parser.add_argument('-dy', type=float, default=0.01, - help='grid spacing in y-direction') - parser.add_argument('-a', type=float, default=0.5, - help='diffusion constant') - parser.add_argument('-n', type=int, default=200, - help='number of time steps') - parser.add_argument('-i', type=int, default=4000, - help='image interval') - parser.add_argument('-f', type=str, default='bottle.dat', - help='input file') - - args = parser.parse_args() - - main(args.f, args.a, args.dx, args.dy, args.n, args.i) - diff --git a/cython/heat-equation/solution/setup.py b/cython/heat-equation/solution/setup.py deleted file mode 100644 index 3906cb8..0000000 --- a/cython/heat-equation/solution/setup.py +++ /dev/null @@ -1,6 +0,0 @@ -from distutils.core import setup, Extension -from Cython.Build import cythonize - -setup( - ext_modules=cythonize("heat.pyx"), -) diff --git a/cython/simple-extension/README.md b/cython/simple-extension/README.md deleted file mode 100644 index d4b33d5..0000000 --- a/cython/simple-extension/README.md +++ /dev/null @@ -1,21 +0,0 @@ -## Simple Cython extension - -### Creating a Cython extension -Create a simple Cython module (you can name it e.g. `cyt_module.pyx`) -containing the following function: -``` -def subtract(x, y): - result = x - y - return result -``` - -Create then a **setup.py** for building the extension module. -Try to utilize the module e.g. as -``` -from cyt_module import subtract - -subtract(4.5, 2) -``` -in interactive interpreter or in a simple script. Try different argument -types. - diff --git a/cython/simple-extension/solution/cyt_module.pyx b/cython/simple-extension/solution/cyt_module.pyx deleted file mode 100644 index 0059fa1..0000000 --- a/cython/simple-extension/solution/cyt_module.pyx +++ /dev/null @@ -1,3 +0,0 @@ -def subtract(x, y): - result = x - y - return result diff --git a/cython/simple-extension/solution/setup.py b/cython/simple-extension/solution/setup.py deleted file mode 100644 index c944b64..0000000 --- a/cython/simple-extension/solution/setup.py +++ /dev/null @@ -1,6 +0,0 @@ -from distutils.core import setup, Extension -from Cython.Build import cythonize - -setup( - ext_modules=cythonize("cyt_module.pyx") -) diff --git a/cython/static-typing/README.md b/cython/static-typing/README.md deleted file mode 100644 index 1c98277..0000000 --- a/cython/static-typing/README.md +++ /dev/null @@ -1,21 +0,0 @@ -## Using static typing - -Continue with the simple Cython module for subtracting two numbers: -``` -def subtract(x, y): - result = x - y - return result -``` - -Declare the function internal variable `result` as integer. Try to call the -function with different types of arguments (integers and floats), what kind of -results you do get? - -Next, declare also the function arguments as integers, and rebuild the module -(Note: if working with interactive interpreter you need to exit or -reload the module). What happens when you now call the function with -floating point arguments? - -Finally, try to declare arguments as floating point numbers (while keeping -`result` as integer), what happens? - diff --git a/cython/static-typing/cyt_module.pyx b/cython/static-typing/cyt_module.pyx deleted file mode 100644 index 0059fa1..0000000 --- a/cython/static-typing/cyt_module.pyx +++ /dev/null @@ -1,3 +0,0 @@ -def subtract(x, y): - result = x - y - return result diff --git a/cython/static-typing/setup.py b/cython/static-typing/setup.py deleted file mode 100644 index c944b64..0000000 --- a/cython/static-typing/setup.py +++ /dev/null @@ -1,6 +0,0 @@ -from distutils.core import setup, Extension -from Cython.Build import cythonize - -setup( - ext_modules=cythonize("cyt_module.pyx") -) diff --git a/cython/static-typing/solution/cyt_module.pyx b/cython/static-typing/solution/cyt_module.pyx deleted file mode 100644 index 942222b..0000000 --- a/cython/static-typing/solution/cyt_module.pyx +++ /dev/null @@ -1,4 +0,0 @@ -def subtract(int x, int y): - cdef int result - result = x - y - return result diff --git a/cython/static-typing/solution/setup.py b/cython/static-typing/solution/setup.py deleted file mode 100644 index c944b64..0000000 --- a/cython/static-typing/solution/setup.py +++ /dev/null @@ -1,6 +0,0 @@ -from distutils.core import setup, Extension -from Cython.Build import cythonize - -setup( - ext_modules=cythonize("cyt_module.pyx") -) diff --git a/dask/01_dask.delayed.ipynb b/dask/01_dask.delayed.ipynb new file mode 100644 index 0000000..e21bad8 --- /dev/null +++ b/dask/01_dask.delayed.ipynb @@ -0,0 +1,824 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "#0 - Prepare the environment\n", + "\n" + ], + "metadata": { + "id": "D__H_IQO1CYl" + } + }, + { + "cell_type": "code", + "source": [ + "!python -m pip install \"dask[complete]\"" + ], + "metadata": { + "id": "RnIBnGqRys4P" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Ty3fHi3yjYM" + }, + "source": [ + "\"Dask\n", + "\n", + "# Parallelize code with `dask.delayed`\n", + "\n", + "In this section we parallelize simple for-loop style code with Dask and `dask.delayed`. Often, this is the only function that you will need to convert functions for use with Dask.\n", + "\n", + "This is a simple way to use `dask` to parallelize existing codebases or build [complex systems](https://blog.dask.org/2018/02/09/credit-models-with-dask). This will also help us to develop an understanding for later sections.\n", + "\n", + "**Related Documentation**\n", + "\n", + "* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)\n", + "* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)\n", + "* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)\n", + "* [Delayed examples](https://examples.dask.org/delayed.html)\n", + "* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bkfgAv-myjYN" + }, + "source": [ + "As we'll see in the [distributed scheduler notebook](05_distributed.ipynb), Dask has several ways of executing code in parallel. We'll use the distributed scheduler by creating a `dask.distributed.Client`. For now, this will provide us with some nice diagnostics. We'll talk about schedulers in depth later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MkkdG6SAyjYO" + }, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fFDIJJKEyjYO" + }, + "source": [ + "## Basics\n", + "\n", + "First let's make some toy functions, `inc` and `add`, that sleep for a while to simulate work. We'll then time running these functions normally.\n", + "\n", + "In the next section we'll parallelize this code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OL49VTeJyjYO" + }, + "outputs": [], + "source": [ + "from time import sleep\n", + "\n", + "def inc(x):\n", + " sleep(1)\n", + " return x + 1\n", + "\n", + "def add(x, y):\n", + " sleep(1)\n", + " return x + y" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KgZ2W0cVyjYP" + }, + "source": [ + "We time the execution of this normal code using the `%%time` magic, which is a special function of the Jupyter Notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JcECfN17yjYP" + }, + "outputs": [], + "source": [ + "%%time\n", + "# This takes three seconds to run because we call each\n", + "# function sequentially, one after the other\n", + "\n", + "x = inc(1)\n", + "y = inc(2)\n", + "z = add(x, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uUdJNtaiyjYP" + }, + "source": [ + "### Parallelize with the `dask.delayed` decorator\n", + "\n", + "Those two increment calls *could* be called in parallel, because they are totally independent of one-another.\n", + "\n", + "We'll transform the `inc` and `add` functions using the `dask.delayed` function. When we call the delayed version by passing the arguments, exactly as before, the original function isn't actually called yet - which is why the cell execution finishes very quickly.\n", + "Instead, a *delayed object* is made, which keeps track of the function to call and the arguments to pass to it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eFTHIdtuyjYP" + }, + "outputs": [], + "source": [ + "from dask import delayed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vMTAK61lyjYP" + }, + "outputs": [], + "source": [ + "%%time\n", + "# This runs immediately, all it does is build a graph\n", + "\n", + "x = delayed(inc)(1)\n", + "y = delayed(inc)(2)\n", + "z = delayed(add)(x, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cKC-tohgyjYQ" + }, + "source": [ + "This ran immediately, since nothing has really happened yet.\n", + "\n", + "To get the result, call `compute`. Notice that this runs faster than the original code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YDZWoImxyjYQ" + }, + "outputs": [], + "source": [ + "%%time\n", + "# This actually runs our computation using a local thread pool\n", + "\n", + "z.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x0t4hp3myjYQ" + }, + "source": [ + "## What just happened?\n", + "\n", + "The `z` object is a lazy `Delayed` object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another. We can evaluate the result with `.compute()` as above or we can visualize the task graph for this value with `.visualize()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WVHdtHC4yjYQ" + }, + "outputs": [], + "source": [ + "z" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "anspmi42yjYQ" + }, + "outputs": [], + "source": [ + "# Look at the task graph for `z`\n", + "z.visualize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xt0LXrFbyjYQ" + }, + "source": [ + "Notice that this includes the names of the functions from before, and the logical flow of the outputs of the `inc` functions to the inputs of `add`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QuFdCk99yjYQ" + }, + "source": [ + "### Some questions to consider:\n", + "\n", + "- Why did we go from 3s to 2s? Why weren't we able to parallelize down to 1s?\n", + "- What would have happened if the inc and add functions didn't include the `sleep(1)`? Would Dask still be able to speed up this code?\n", + "- What if we have multiple outputs or also want to get access to x or y?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5BJ346pXyjYQ" + }, + "source": [ + "## Exercise: Parallelize a for loop\n", + "\n", + "`for` loops are one of the most common things that we want to parallelize. Use `dask.delayed` on `inc` and `sum` to parallelize the computation below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tHhIBtD-yjYQ" + }, + "outputs": [], + "source": [ + "data = [1, 2, 3, 4, 5, 6, 7, 8]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sr9Yh9CJyjYR" + }, + "outputs": [], + "source": [ + "%%time\n", + "# Sequential code\n", + "\n", + "results = []\n", + "for x in data:\n", + " y = inc(x)\n", + " results.append(y)\n", + "\n", + "total = sum(results)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TWakOv6DyjYR" + }, + "outputs": [], + "source": [ + "total" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uZkEHmiYyjYR" + }, + "outputs": [], + "source": [ + "%%time\n", + "# Your parallel code here..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JQHBoqOxyjYR" + }, + "source": [ + "How do the graph visualizations compare with the given solution, compared to a version with the `sum` function used directly rather than wrapped with `delayed`? Can you explain the latter version? You might find the result of the following expression illuminating\n", + "```python\n", + "delayed(inc)(1) + delayed(inc)(2)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vklm4jM2yjYR" + }, + "source": [ + "## Exercise: Parallelizing a for-loop code with control flow\n", + "\n", + "Often we want to delay only *some* functions, running a few of them immediately. This is especially helpful when those functions are fast and help us to determine what other slower functions we should call. This decision, to delay or not to delay, is usually where we need to be thoughtful when using `dask.delayed`.\n", + "\n", + "In the example below we iterate through a list of inputs. If that input is even then we want to call `inc`. If the input is odd then we want to call `double`. This `is_even` decision to call `inc` or `double` has to be made immediately (not lazily) in order for our graph-building Python code to proceed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IvT-dT0gyjYR" + }, + "outputs": [], + "source": [ + "def double(x):\n", + " sleep(1)\n", + " return 2 * x\n", + "\n", + "def is_even(x):\n", + " return not x % 2\n", + "\n", + "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "g61VYVJDyjYR" + }, + "outputs": [], + "source": [ + "%%time\n", + "# Sequential code\n", + "\n", + "results = []\n", + "for x in data:\n", + " if is_even(x):\n", + " y = double(x)\n", + " else:\n", + " y = inc(x)\n", + " results.append(y)\n", + "\n", + "total = sum(results)\n", + "print(total)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ximAhsaAyjYR" + }, + "outputs": [], + "source": [ + "%%time\n", + "# Your parallel code here...\n", + "# TODO: parallelize the sequential code above using dask.delayed\n", + "# You will need to delay some functions, but not all" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hYEZxOBUyjYR" + }, + "outputs": [], + "source": [ + "%time total.compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0sX_AzlSyjYR" + }, + "outputs": [], + "source": [ + "total.visualize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h-eN3QMeyjYS" + }, + "source": [ + "### Some questions to consider:\n", + "\n", + "- What are other examples of control flow where we can't use delayed?\n", + "- What would have happened if we had delayed the evaluation of `is_even(x)` in the example above?\n", + "- What are your thoughts on delaying `sum`? This function is both computational but also fast to run." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C0WLxFhYyjYS" + }, + "source": [ + "## Exercise: Parallelizing a Pandas Groupby Reduction\n", + "\n", + "In this exercise we read several CSV files and perform a groupby operation in parallel. We are given sequential code to do this and parallelize it with `dask.delayed`.\n", + "\n", + "The computation we will parallelize is to compute the mean departure delay per airport from some historical flight data. We will do this by using `dask.delayed` together with `pandas`. In a future section we will do this same exercise with `dask.dataframe`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FMMozPrkyjYS" + }, + "source": [ + "## Create data\n", + "\n", + "Run this code to prep some data.\n", + "\n", + "This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)." + ] + }, + { + "cell_type": "code", + "source": [ + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/prep.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/accounts.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/config.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/sources.py\n", + "!mkdir data\n" + ], + "metadata": { + "id": "cfpI4Qs3zHhD" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "B0Tk0KDwyjYS" + }, + "outputs": [], + "source": [ + "%run prep.py -d flights" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L8imuytMyjYS" + }, + "source": [ + "### Inspect data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E510UHCJyjYS" + }, + "outputs": [], + "source": [ + "import os\n", + "sorted(os.listdir(os.path.join('data', 'nycflights')))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WRgQis8xyjYV" + }, + "source": [ + "### Read one file with `pandas.read_csv` and compute mean departure delay" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Wf-74OJOyjYV" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(os.path.join('data', 'nycflights', '1990.csv'))\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VIKTRQDVyjYV" + }, + "outputs": [], + "source": [ + "# What is the schema?\n", + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WB1MMhQGyjYV" + }, + "outputs": [], + "source": [ + "# What originating airports are in the data?\n", + "df.Origin.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OgojpaNxyjYW" + }, + "outputs": [], + "source": [ + "# Mean departure delay per-airport for one year\n", + "df.groupby('Origin').DepDelay.mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PlPr5RS-yjYW" + }, + "source": [ + "### Sequential code: Mean Departure Delay Per Airport\n", + "\n", + "The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E3Q-VEQByjYW" + }, + "outputs": [], + "source": [ + "from glob import glob\n", + "filenames = sorted(glob(os.path.join('data', 'nycflights', '*.csv')))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AK6iiYPtyjYW" + }, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "sums = []\n", + "counts = []\n", + "for fn in filenames:\n", + " # Read in file\n", + " df = pd.read_csv(fn)\n", + "\n", + " # Groupby origin airport\n", + " by_origin = df.groupby('Origin')\n", + "\n", + " # Sum of all departure delays by origin\n", + " total = by_origin.DepDelay.sum()\n", + "\n", + " # Number of flights by origin\n", + " count = by_origin.DepDelay.count()\n", + "\n", + " # Save the intermediates\n", + " sums.append(total)\n", + " counts.append(count)\n", + "\n", + "# Combine intermediates to get total mean-delay-per-origin\n", + "total_delays = sum(sums)\n", + "n_flights = sum(counts)\n", + "mean = total_delays / n_flights" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KdCgIehhyjYW" + }, + "outputs": [], + "source": [ + "mean" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJ-3vJYiyjYW" + }, + "source": [ + "### Parallelize the code above\n", + "\n", + "Use `dask.delayed` to parallelize the code above. Some extra things you will need to know.\n", + "\n", + "1. Methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.\n", + "\n", + " ```python\n", + " x = delayed(np.arange)(10)\n", + " y = (x + 1)[::2].sum() # everything here was delayed\n", + " ```\n", + "2. Calling the `.compute()` method works well when you have a single output. When you have multiple outputs you might want to use the `dask.compute` function:\n", + "\n", + " ```python\n", + " >>> from dask import compute\n", + " >>> x = delayed(np.arange)(10)\n", + " >>> y = x ** 2\n", + " >>> min_, max_ = compute(y.min(), y.max())\n", + " >>> min_, max_\n", + " (0, 81)\n", + " ```\n", + " \n", + " This way Dask can share the intermediate values (like `y = x**2`)\n", + " \n", + "So your goal is to parallelize the code above (which has been copied below) using `dask.delayed`. You may also want to visualize a bit of the computation to see if you're doing it correctly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_DoKsD4CyjYW" + }, + "outputs": [], + "source": [ + "from dask import compute" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VdmkgZwqyjYW" + }, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "# copied sequential code\n", + "\n", + "sums = []\n", + "counts = []\n", + "for fn in filenames:\n", + " # Read in file\n", + " df = pd.read_csv(fn)\n", + "\n", + " # Groupby origin airport\n", + " by_origin = df.groupby('Origin')\n", + "\n", + " # Sum of all departure delays by origin\n", + " total = by_origin.DepDelay.sum()\n", + "\n", + " # Number of flights by origin\n", + " count = by_origin.DepDelay.count()\n", + "\n", + " # Save the intermediates\n", + " sums.append(total)\n", + " counts.append(count)\n", + "\n", + "# Combine intermediates to get total mean-delay-per-origin\n", + "total_delays = sum(sums)\n", + "n_flights = sum(counts)\n", + "mean = total_delays / n_flights" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BCm_RNynyjYW" + }, + "outputs": [], + "source": [ + "mean" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yVdfUNAsyjYW" + }, + "outputs": [], + "source": [ + "%%time\n", + "# your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MDNhnnokyjYX" + }, + "outputs": [], + "source": [ + "# ensure the results still match\n", + "mean" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q3yaVgSZyjYX" + }, + "source": [ + "### Some questions to consider:\n", + "\n", + "- How much speedup did you get? Is this how much speedup you'd expect?\n", + "- Experiment with where to call `compute`. What happens when you call it on `sums` and `counts`? What happens if you wait and call it on `mean`?\n", + "- Experiment with delaying the call to `sum`. What does the graph look like if `sum` is delayed? What does the graph look like if it isn't?\n", + "- Can you think of any reason why you'd want to do the reduction one way over the other?\n", + "\n", + "### Learn More\n", + "\n", + "Visit the [Delayed documentation](https://docs.dask.org/en/latest/delayed.html). In particular, this [delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU) will reinforce the concepts you learned here and the [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html) document collects advice on using `dask.delayed` well." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4MZwEJeUyjYX" + }, + "source": [ + "## Close the Client\n", + "\n", + "Before moving on to the next exercise, make sure to close your client or stop this kernel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V1HkMqOyyjYX" + }, + "outputs": [], + "source": [ + "client.close()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + }, + "colab": { + "provenance": [], + "include_colab_link": true + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/dask/01x_lazy.ipynb b/dask/01x_lazy.ipynb new file mode 100644 index 0000000..eb33178 --- /dev/null +++ b/dask/01x_lazy.ipynb @@ -0,0 +1,727 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "#0 - Prepare the environment" + ], + "metadata": { + "id": "tYclpfd31yB4" + } + }, + { + "cell_type": "code", + "source": [ + "!python -m pip install \"dask[complete]\"" + ], + "metadata": { + "id": "yfU09xHb1tVq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/prep.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/accounts.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/config.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/dask/sources.py\n", + "!wget https://raw.githubusercontent.com/lsteffenel/hpc-python/refs/heads/master/README.md\n", + "!mkdir data" + ], + "metadata": { + "id": "_Dh1Hsed2GQx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dhWwKZYh1sPg" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gg9ecWva1sPg" + }, + "source": [ + "# Lazy execution" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lqqrSo9J1sPg" + }, + "source": [ + "Here we discuss some of the concepts behind dask, and lazy execution of code. You do not need to go through this material if you are eager to get on with the tutorial, but it may help understand the concepts underlying dask, how these things fit in with techniques you might already be using, and how to understand things that can go wrong." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6emxpbi61sPh" + }, + "source": [ + "## Prelude" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-s2T32mH1sPh" + }, + "source": [ + "As Python programmers, you probably already perform certain *tricks* to enable computation of larger-than-memory datasets, parallel execution or delayed/background execution. Perhaps with this phrasing, it is not clear what we mean, but a few examples should make things clearer. The point of Dask is to make simple things easy and complex things possible!\n", + "\n", + "Aside from the [detailed introduction](http://dask.pydata.org/en/latest/), we can summarize the basics of Dask as follows:\n", + "\n", + "- process data that doesn't fit into memory by breaking it into blocks and specifying task chains\n", + "- parallelize execution of tasks across cores and even nodes of a cluster\n", + "- move computation to the data rather than the other way around, to minimize communication overhead\n", + "\n", + "All of this allows you to get the most out of your computation resources, but program in a way that is very familiar: for-loops to build basic tasks, Python iterators, and the NumPy (array) and Pandas (dataframe) functions for multi-dimensional or tabular data, respectively.\n", + "\n", + "The remainder of this notebook will take you through the first of these programming paradigms. This is more detail than some users will want, who can skip ahead to the iterator, array and dataframe sections; but there will be some data processing tasks that don't easily fit into those abstractions and need to fall back to the methods here.\n", + "\n", + "We include a few examples at the end of the notebooks showing that the ideas behind how Dask is built are not actually that novel, and experienced programmers will have met parts of the design in other situations before. Those examples are left for the interested." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Oouyiz9v1sPh" + }, + "source": [ + "## Dask is a graph execution engine" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y9K-tLwy1sPh" + }, + "source": [ + "Dask allows you to construct a prescription for the calculation you want to carry out. That may sound strange, but a simple example will demonstrate that you can achieve this while programming with perfectly ordinary Python functions and for-loops. We saw this in the previous notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rbpgtBdF1sPi" + }, + "outputs": [], + "source": [ + "from dask import delayed\n", + "\n", + "@delayed\n", + "def inc(x):\n", + " return x + 1\n", + "\n", + "@delayed\n", + "def add(x, y):\n", + " return x + y" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6azvl-Gs1sPi" + }, + "source": [ + "Here we have used the delayed annotation to show that we want these functions to operate lazily — to save the set of inputs and execute only on demand. `dask.delayed` is also a function which can do this, without the annotation, leaving the original function unchanged, e.g.,\n", + "```python\n", + " delayed_inc = delayed(inc)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c1KTz3Ow1sPi" + }, + "outputs": [], + "source": [ + "# this looks like ordinary code\n", + "x = inc(15)\n", + "y = inc(30)\n", + "total = add(x, y)\n", + "# x, y and total are all delayed objects.\n", + "# They contain a prescription of how to carry out the computation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dgp-SLJ71sPi" + }, + "source": [ + "Calling a delayed function created a delayed object (`x, y, total`) which can be examined interactively. Making these objects is somewhat equivalent to constructs like the `lambda` or function wrappers (see below). Each holds a simple dictionary describing the task graph, a full specification of how to carry out the computation.\n", + "\n", + "We can visualize the chain of calculations that the object `total` corresponds to as follows; the circles are functions, rectangles are data/results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HNpSaktp1sPj" + }, + "outputs": [], + "source": [ + "total.visualize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iYZ0kfwM1sPj" + }, + "source": [ + "But so far, no functions have actually been executed. This demonstrated the division between the graph-creation part of Dask (`delayed()`, in this example) and the graph execution part of Dask.\n", + "\n", + "To run the \"graph\" in the visualization, and actually get a result, do:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dzsqSzja1sPj" + }, + "outputs": [], + "source": [ + "# execute all tasks\n", + "total.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kcFdr2Bq1sPj" + }, + "source": [ + "**Why should you care about this?**\n", + "\n", + "By building a specification of the calculation we want to carry out before executing anything, we can pass the specification to an *execution engine* for evaluation. In the case of Dask, this execution engine could be running on many nodes of a cluster, so you have access to the full number of CPU cores and memory across all the machines. Dask will intelligently execute your calculation with care for minimizing the amount of data held in memory, while parallelizing over the tasks that make up a graph. Notice that in the animated diagram below, where four workers are processing the (simple) graph, execution progresses vertically up the branches first, so that intermediate results can be expunged before moving onto a new branch.\n", + "\n", + "With `delayed` and normal pythonic looped code, very complex graphs can be built up and passed on to Dask for execution. See a nice example of [simulated complex ETL](https://blog.dask.org/2017/01/24/dask-custom) work flow.\n", + "\n", + "![this](https://github.com/lsteffenel/hpc-python/blob/master/dask/images/grid_search_schedule.gif?raw=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3IY506_61sPj" + }, + "source": [ + "### Exercise" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mOf6qGMH1sPj" + }, + "source": [ + "We will apply `delayed` to a real data processing task, albeit a simple one.\n", + "\n", + "Consider reading three CSV files with `pd.read_csv` and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask, for a handy speed-up factor of more than two (there are only three inputs to parallelize over)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sMErX_hH1sPj" + }, + "outputs": [], + "source": [ + "%run prep.py -d accounts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Bgjwxj9-1sPj" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os\n", + "filenames = [os.path.join('data', 'accounts.%d.csv' % i) for i in [0, 1, 2]]\n", + "filenames" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gRV-xnMU1sPj" + }, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "# normal, sequential code\n", + "a = pd.read_csv(filenames[0])\n", + "b = pd.read_csv(filenames[1])\n", + "c = pd.read_csv(filenames[2])\n", + "\n", + "na = len(a)\n", + "nb = len(b)\n", + "nc = len(c)\n", + "\n", + "total = sum([na, nb, nc])\n", + "print(total)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kl48ZPeD1sPj" + }, + "source": [ + "Your task is to recreate this graph again using the delayed function on the original Python code. The three functions you want to delay are `pd.read_csv`, `len` and `sum`.." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MMKgFwQX1sPk" + }, + "source": [ + "```python\n", + "delayed_read_csv = delayed(pd.read_csv)\n", + "a = delayed_read_csv(filenames[0])\n", + "...\n", + "\n", + "total = ...\n", + "\n", + "# execute\n", + "%time total.compute() \n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XEdUQZsx1sPk" + }, + "outputs": [], + "source": [ + "# your verbose code here" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DCb168xA1sPk" + }, + "source": [ + "Next, repeat this using loops, rather than writing out all the variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lDNqFku51sPk" + }, + "outputs": [], + "source": [ + "# your concise code here" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rWSTsFCU1sPk" + }, + "source": [ + "**Notes**\n", + "\n", + "Delayed objects support various operations:\n", + "```python\n", + " x2 = x + 1\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ui6Ca7mt1sPk" + }, + "source": [ + "if `x` was a delayed result (like `total`, above), then so is `x2`. Supported operations include arithmetic operators, item or slice selection, attribute access and method calls - essentially anything that could be phrased as a `lambda` expression.\n", + "\n", + "Operations which are *not* supported include mutation, setter methods, iteration (for) and bool (predicate)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KotMwxJ11sPk" + }, + "source": [ + "## Appendix: Further detail and examples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6V9pjoNU1sPk" + }, + "source": [ + "The following examples show that the kinds of things Dask does are not so far removed from normal Python programming when dealing with big data. These examples are **only meant for experts**, typical users can continue with the next notebook in the tutorial." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Azz3yWX1sPk" + }, + "source": [ + "### Example 1: simple word count" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tEF1NXEx1sPk" + }, + "source": [ + "This directory contains a file called `README.md`. How would you count the number of words in that file?\n", + "\n", + "The simplest approach would be to load all the data into memory, split on whitespace and count the number of results. Here we use a regular expression to split words." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ABzNlj1q1sPl" + }, + "outputs": [], + "source": [ + "import re\n", + "splitter = re.compile('\\w+')\n", + "with open('README.md', 'r') as f:\n", + " data = f.read()\n", + "result = len(splitter.findall(data))\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eOu1upyq1sPl" + }, + "source": [ + "The trouble with this approach is that it does not scale - if the file is very large, it, and the generated list of words, might fill up memory. We can easily avoid that, because we only need a simple sum, and each line is totally independent of the others. Now we evaluate each piece of data and immediately free up the space again, so we could perform this on arbitrarily-large files. Note that there is often a trade-off between time-efficiency and memory footprint: the following uses very little memory, but may be slower for files that do not fill a large faction of memory. In general, one would like chunks small enough not to stress memory, but big enough for efficient use of the CPU." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_28zNCG61sPl" + }, + "outputs": [], + "source": [ + "result = 0\n", + "with open('README.md', 'r') as f:\n", + " for line in f:\n", + " result += len(splitter.findall(line))\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uuh1D-k71sPl" + }, + "source": [ + "### Example 2: background execution" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F2USbXgU1sPl" + }, + "source": [ + "There are many tasks that take a while to complete, but don't actually require much of the CPU, for example anything that requires communication over a network, or input from a user. In typical sequential programming, execution would need to halt while the process completes, and then continue execution. That would be dreadful for user experience (imagine the slow progress bar that locks up the application and cannot be canceled), and wasteful of time (the CPU could have been doing useful work in the meantime).\n", + "\n", + "For example, we can launch processes and get their output as follows:\n", + "```python\n", + " import subprocess\n", + " p = subprocess.Popen(command, stdout=subprocess.PIPE)\n", + " p.returncode\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "43Lga1h01sPl" + }, + "source": [ + "The task is run in a separate process, and the return-code will remain `None` until it completes, when it will change to `0`. To get the result back, we need `out = p.communicate()[0]` (which would block if the process was not complete)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AtB2jv0r1sPm" + }, + "source": [ + "Similarly, we can launch Python processes and threads in the background. Some methods allow mapping over multiple inputs and gathering the results, more on that later. The thread starts and the cell completes immediately, but the data associated with the download only appears in the queue object some time later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oNKk21Ol1sPm" + }, + "outputs": [], + "source": [ + "# Edit sources.py to configure source locations\n", + "import sources\n", + "sources.lazy_url" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "44RX6ORx1sPm" + }, + "outputs": [], + "source": [ + "import threading\n", + "import queue\n", + "import urllib\n", + "\n", + "def get_webdata(url, q):\n", + " u = urllib.request.urlopen(url)\n", + " # raise ValueError\n", + " q.put(u.read())\n", + "\n", + "q = queue.Queue()\n", + "t = threading.Thread(target=get_webdata, args=(sources.lazy_url, q))\n", + "t.start()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nrF1w-S-1sPm" + }, + "outputs": [], + "source": [ + "# fetch result back into this thread. If the worker thread is not done, this would wait.\n", + "q.get()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Iug-wjFt1sPm" + }, + "source": [ + "Consider: what would you see if there had been an exception within the `get_webdata` function? You could uncomment the `raise` line, above, and re-execute the two cells. What happens? Is there any way to debug the execution to find the root cause of the error?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MX_6U4uL1sPm" + }, + "source": [ + "### Example 3: delayed execution" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "emys9SFf1sPm" + }, + "source": [ + "There are many ways in Python to specify the computation you want to execute, but only run it *later*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gLKPLF7s1sPm" + }, + "outputs": [], + "source": [ + "def add(x, y):\n", + " return x + y\n", + "\n", + "# Sometimes we defer computations with strings\n", + "x = 15\n", + "y = 30\n", + "z = \"add(x, y)\"\n", + "eval(z)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "J6NZkGHb1sPm" + }, + "outputs": [], + "source": [ + "# we can use lambda or other \"closure\"\n", + "x = 15\n", + "y = 30\n", + "z = lambda: add(x, y)\n", + "z()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9pjtVE0M1sPn" + }, + "outputs": [], + "source": [ + "# A very similar thing happens in functools.partial\n", + "\n", + "import functools\n", + "z = functools.partial(add, x, y)\n", + "z()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D11ZABMw1sPn" + }, + "outputs": [], + "source": [ + "# Python generators are delayed execution by default\n", + "# Many Python functions expect such iterable objects\n", + "\n", + "def gen():\n", + " res = x\n", + " yield res\n", + " res += y\n", + " yield res\n", + "\n", + "g = gen()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8cGva_Fh1sPn" + }, + "outputs": [], + "source": [ + "# run once: we get one value and execution halts within the generator\n", + "# run again and the execution completes\n", + "next(g)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s5KWeb0y1sPn" + }, + "source": [ + "### Dask graphs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "spwcyqcl1sPn" + }, + "source": [ + "Any Dask object, such as `total`, above, has an attribute which describes the calculations necessary to produce that result. Indeed, this is exactly the graph that we have been talking about, which can be visualized. We see that it is a simple dictionary, in which the keys are unique task identifiers, and the values are the functions and inputs for calculation.\n", + "\n", + "`delayed` is a handy mechanism for creating the Dask graph, but the adventurous may wish to play with the full fexibility afforded by building the graph dictionaries directly. Detailed information can be found [here](http://dask.pydata.org/en/latest/graphs.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W1JTK90u1sPn" + }, + "outputs": [], + "source": [ + "total.dask" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CRac1LyO1sPn" + }, + "outputs": [], + "source": [ + "dict(total.dask)" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + }, + "colab": { + "provenance": [], + "include_colab_link": true + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/dask/02_bag.ipynb b/dask/02_bag.ipynb new file mode 100644 index 0000000..dc5da4a --- /dev/null +++ b/dask/02_bag.ipynb @@ -0,0 +1,717 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bag: Parallel Lists for semi-structured data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask-bag excels in processing data that can be represented as a sequence of arbitrary inputs. We'll refer to this as \"messy\" data, because it can contain complex nested structures, missing fields, mixtures of data types, etc. The *functional* programming style fits very nicely with standard Python iteration, such as can be found in the `itertools` module.\n", + "\n", + "Messy data is often encountered at the beginning of data processing pipelines when large volumes of raw data are first consumed. The initial set of data might be JSON, CSV, XML, or any other format that does not enforce strict structure and datatypes.\n", + "For this reason, the initial data massaging and processing is often done with Python `list`s, `dict`s, and `set`s.\n", + "\n", + "These core data structures are optimized for general-purpose storage and processing. Adding streaming computation with iterators/generator expressions or libraries like `itertools` or [`toolz`](https://toolz.readthedocs.io/en/latest/) let us process large volumes in a small space. If we combine this with parallel processing then we can churn through a fair amount of data.\n", + "\n", + "Dask.bag is a high level Dask collection to automate common workloads of this form. In a nutshell\n", + "\n", + " dask.bag = map, filter, toolz + parallel execution\n", + " \n", + "**Related Documentation**\n", + "\n", + "* [Bag documentation](https://docs.dask.org/en/latest/bag.html)\n", + "* [Bag screencast](https://youtu.be/-qIiJ1XtSv0)\n", + "* [Bag API](https://docs.dask.org/en/latest/bag-api.html)\n", + "* [Bag examples](https://examples.dask.org/bag.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run prep.py -d accounts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, we'll use the distributed scheduler. Schedulers will be explained in depth [later](05_distributed.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can create a `Bag` from a Python sequence, from files, from data on S3, etc.\n", + "We demonstrate using `.take()` to show elements of the data. (Doing `.take(1)` results in a tuple with one element)\n", + "\n", + "Note that the data are partitioned into blocks, and there are many items per block. In the first example, the two partitions contain five elements each, and in the following two, each file is partitioned into one or more bytes blocks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# each element is an integer\n", + "import dask.bag as db\n", + "b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], npartitions=2)\n", + "b.take(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# each element is a text file, where each line is a JSON object\n", + "# note that the compression is handled automatically\n", + "import os\n", + "b = db.read_text(os.path.join('data', 'accounts.*.json.gz'))\n", + "b.take(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Edit sources.py to configure source locations\n", + "import sources\n", + "sources.bag_url" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Requires `s3fs` library\n", + "# each partition is a remote CSV text file\n", + "b = db.read_text(sources.bag_url,\n", + " storage_options={'anon': True})\n", + "b.take(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manipulation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`Bag` objects hold the standard functional API found in projects like the Python standard library, `toolz`, or `pyspark`, including `map`, `filter`, `groupby`, etc..\n", + "\n", + "Operations on `Bag` objects create new bags. Call the `.compute()` method to trigger execution, as we saw for `Delayed` objects. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_even(n):\n", + " return n % 2 == 0\n", + "\n", + "b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])\n", + "c = b.filter(is_even).map(lambda x: x ** 2)\n", + "c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# blocking form: wait for completion (which is very fast in this case)\n", + "c.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example: Accounts JSON data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've created a fake dataset of gzipped JSON data in your data directory. This is like the example used in the `DataFrame` example we will see later, except that it has bundled up all of the entires for each individual `id` into a single record. This is similar to data that you might collect off of a document store database or a web API.\n", + "\n", + "Each line is a JSON encoded dictionary with the following keys\n", + "\n", + "* id: Unique identifier of the customer\n", + "* name: Name of the customer\n", + "* transactions: List of `transaction-id`, `amount` pairs, one for each transaction for the customer in that file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "filename = os.path.join('data', 'accounts.*.json.gz')\n", + "lines = db.read_text(filename)\n", + "lines.take(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data comes out of the file as lines of text. Notice that file decompression happened automatically. We can make this data look more reasonable by mapping the `json.loads` function onto our bag." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "js = lines.map(json.loads)\n", + "# take: inspect first few elements\n", + "js.take(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Basic Queries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once we parse our JSON data into proper Python objects (`dict`s, `list`s, etc.) we can perform more interesting queries by creating small Python functions to run on our data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# filter: keep only some elements of the sequence\n", + "js.filter(lambda record: record['name'] == 'Alice').take(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def count_transactions(d):\n", + " return {'name': d['name'], 'count': len(d['transactions'])}\n", + "\n", + "# map: apply a function to each element\n", + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .map(count_transactions)\n", + " .take(5))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pluck: select a field, as from a dictionary, element[field]\n", + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .map(count_transactions)\n", + " .pluck('count')\n", + " .take(5))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Average number of transactions for all of the Alice entries\n", + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .map(count_transactions)\n", + " .pluck('count')\n", + " .mean()\n", + " .compute())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Use `flatten` to de-nest" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the example below we see the use of `.flatten()` to flatten results. We compute the average amount for all transactions for all Alices." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .pluck('transactions')\n", + " .take(3))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .pluck('transactions')\n", + " .flatten()\n", + " .take(3))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .pluck('transactions')\n", + " .flatten()\n", + " .pluck('amount')\n", + " .take(3))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(js.filter(lambda record: record['name'] == 'Alice')\n", + " .pluck('transactions')\n", + " .flatten()\n", + " .pluck('amount')\n", + " .mean()\n", + " .compute())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Groupby and Foldby" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Often we want to group data by some function or key. We can do this either with the `.groupby` method, which is straightforward but forces a full shuffle of the data (expensive) or with the harder-to-use but faster `.foldby` method, which does a streaming combined groupby and reduction.\n", + "\n", + "* `groupby`: Shuffles data so that all items with the same key are in the same key-value pair\n", + "* `foldby`: Walks through the data accumulating a result per key\n", + "\n", + "*Note: the full groupby is particularly bad. In actual workloads you would do well to use `foldby` or switch to `DataFrame`s if possible.*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `groupby`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Groupby collects items in your collection so that all items with the same value under some function are collected together into a key-value pair." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b = db.from_sequence(['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'])\n", + "b.groupby(len).compute() # names grouped by length" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b = db.from_sequence(list(range(10)))\n", + "b.groupby(lambda x: x % 2).compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b.groupby(lambda x: x % 2).starmap(lambda k, v: (k, max(v))).compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `foldby`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Foldby can be quite odd at first. It is similar to the following functions from other libraries:\n", + "\n", + "* [`toolz.reduceby`](http://toolz.readthedocs.io/en/latest/streaming-analytics.html#streaming-split-apply-combine)\n", + "* [`pyspark.RDD.combineByKey`](http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/)\n", + "\n", + "When using `foldby` you provide \n", + "\n", + "1. A key function on which to group elements\n", + "2. A binary operator such as you would pass to `reduce` that you use to perform reduction per each group\n", + "3. A combine binary operator that can combine the results of two `reduce` calls on different parts of your dataset.\n", + "\n", + "Your reduction must be associative. It will happen in parallel in each of the partitions of your dataset. Then all of these intermediate results will be combined by the `combine` binary operator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b.foldby(lambda x: x % 2, binop=max, combine=max).compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example with account data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We find the number of people with the same name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# Warning, this one takes a while...\n", + "result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()\n", + "print(sorted(result))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# This one is comparatively fast and produces the same result.\n", + "from operator import add\n", + "def incr(tot, _):\n", + " return tot + 1\n", + "\n", + "result = js.foldby(key='name', \n", + " binop=incr, \n", + " initial=0, \n", + " combine=add, \n", + " combine_initial=0).compute()\n", + "print(sorted(result))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: compute total amount per name" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We want to groupby (or foldby) the `name` key, then add up the all of the amounts for each name.\n", + "\n", + "Steps\n", + "\n", + "1. Create a small function that, given a dictionary like \n", + "\n", + " {'name': 'Alice', 'transactions': [{'amount': 1, 'id': 123}, {'amount': 2, 'id': 456}]}\n", + " \n", + " produces the sum of the amounts, e.g. `3`\n", + " \n", + "2. Slightly change the binary operator of the `foldby` example above so that the binary operator doesn't count the number of entries, but instead accumulates the sum of the amounts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the same reasons that Pandas is often faster than pure Python, `dask.dataframe` can be faster than `dask.bag`. We will work more with DataFrames later, but from the point of view of a Bag, it is frequently the end-point of the \"messy\" part of data ingestion—once the data can be made into a data-frame, then complex split-apply-combine logic will become much more straight-forward and efficient.\n", + "\n", + "You can transform a bag with a simple tuple or flat dictionary structure into a `dask.dataframe` with the `to_dataframe` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df1 = js.to_dataframe()\n", + "df1.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This now looks like a well-defined DataFrame, and we can apply Pandas-like computations to it efficiently." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using a Dask DataFrame, how long does it take to do our prior computation of numbers of people with the same name? It turns out that `dask.dataframe.groupby()` beats `dask.bag.groupby()` by more than an order of magnitude; but it still cannot match `dask.bag.foldby()` for this case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time df1.groupby('name').id.count().compute().head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Denormalization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This DataFrame format is less-than-optimal because the `transactions` column is filled with nested data so Pandas has to revert to `object` dtype, which is quite slow in Pandas. Ideally we want to transform to a dataframe only after we have flattened our data so that each record is a single `int`, `string`, `float`, etc.." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def denormalize(record):\n", + " # returns a list for each person, one item per transaction\n", + " return [{'id': record['id'], \n", + " 'name': record['name'], \n", + " 'amount': transaction['amount'], \n", + " 'transaction-id': transaction['transaction-id']}\n", + " for transaction in record['transactions']]\n", + "\n", + "transactions = js.map(denormalize).flatten()\n", + "transactions.take(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = transactions.to_dataframe()\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "# number of transactions per name\n", + "# note that the time here includes the data load and ingestion\n", + "df.groupby('name')['transaction-id'].count().compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Limitations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Bags provide very general computation (any Python function.) This generality\n", + "comes at cost. Bags have the following known limitations\n", + "\n", + "1. Bag operations tend to be slower than array/dataframe computations in the\n", + " same way that Python tends to be slower than NumPy/Pandas\n", + "2. ``Bag.groupby`` is slow. You should try to use ``Bag.foldby`` if possible.\n", + " Using ``Bag.foldby`` requires more thought. Even better, consider creating\n", + " a normalised dataframe." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learn More\n", + "\n", + "* [Bag documentation](https://docs.dask.org/en/latest/bag.html)\n", + "* [Bag screencast](https://youtu.be/-qIiJ1XtSv0)\n", + "* [Bag API](https://docs.dask.org/en/latest/bag-api.html)\n", + "* [Bag examples](https://examples.dask.org/bag.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Shutdown" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.shutdown()" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/dask/03_array.ipynb b/dask/03_array.ipynb new file mode 100644 index 0000000..9ab5068 --- /dev/null +++ b/dask/03_array.ipynb @@ -0,0 +1,994 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Arrays" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.\n", + "\n", + "* **Parallel**: Uses all of the cores on your computer\n", + "* **Larger-than-memory**: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.\n", + "* **Blocked Algorithms**: Perform large computations by performing many smaller computations\n", + "\n", + "In this notebook, we'll build some understanding by implementing some blocked algorithms from scratch.\n", + "We'll then use Dask Array to analyze large datasets, in parallel, using a familiar NumPy-like API.\n", + "\n", + "**Related Documentation**\n", + "\n", + "* [Array documentation](https://docs.dask.org/en/latest/array.html)\n", + "* [Array screencast](https://youtu.be/9h_61hXCDuI)\n", + "* [Array API](https://docs.dask.org/en/latest/array-api.html)\n", + "* [Array examples](https://examples.dask.org/array.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run prep.py -d random" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Blocked Algorithms" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A *blocked algorithm* executes on a large dataset by breaking it up into many small blocks.\n", + "\n", + "For example, consider taking the sum of a billion numbers. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.\n", + "\n", + "We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)\n", + "\n", + "We do exactly this with Python and NumPy in the following example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load data with h5py\n", + "# this creates a pointer to the data, but does not actually load\n", + "import h5py\n", + "import os\n", + "f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')\n", + "dset = f['/x']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Compute sum using blocked algorithm**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before using dask, let's consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.\n", + "\n", + "Here we compute the sum of this large array on disk by \n", + "\n", + "1. Computing the sum of each 1,000,000 sized chunk of the array\n", + "2. Computing the sum of the 1,000 intermediate sums\n", + "\n", + "Note that this is a sequential process in the notebook kernel, both the loading and summing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compute sum of large array, one million numbers at a time\n", + "sums = []\n", + "for i in range(0, 1_000_000_000, 1_000_000):\n", + " chunk = dset[i: i + 1_000_000] # pull out numpy array\n", + " sums.append(chunk.sum())\n", + "\n", + "total = sum(sums)\n", + "print(total)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: Compute the mean using a blocked algorithm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we've seen the simple example above, try doing a slightly more complicated problem. Compute the mean of the array, assuming for a moment that we don't happen to already know how many elements are in the data. You can do this by changing the code above with the following alterations:\n", + "\n", + "1. Compute the sum of each block\n", + "2. Compute the length of each block\n", + "3. Compute the sum of the 1,000 intermediate sums and the sum of the 1,000 intermediate lengths and divide one by the other\n", + "\n", + "This approach is overkill for our case but does nicely generalize if we don't know the size of the array or individual blocks beforehand." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compute the mean of the array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/03_mean_by_block.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`dask.array` contains these algorithms\n", + "--------------------------------------------\n", + "\n", + "Dask.array is a NumPy-like library that does these kinds of tricks to operate on large datasets that don't fit into memory. It extends beyond the linear problems discussed above to full N-Dimensional algorithms and a decent subset of the NumPy interface." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Create `dask.array` object**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can create a `dask.array` `Array` object with the `da.from_array` function. This function accepts\n", + "\n", + "1. `data`: Any object that supports NumPy slicing, like `dset`\n", + "2. `chunks`: A chunk size to tell us how to block up our array, like `(1_000_000,)`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask.array as da\n", + "x = da.from_array(dset, chunks=(1_000_000,))\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Manipulate `dask.array` object as you would a numpy array**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have an `Array` we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..\n", + "\n", + "The interface is familiar, but the actual work is different. `dask_array.sum()` does not do the same thing as `numpy_array.sum()`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**What's the difference?**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`dask_array.sum()` builds an expression of the computation. It does not do the computation yet. `numpy_array.sum()` computes the sum immediately." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Why the difference?*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask arrays are split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = x.sum()\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Compute result**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask.array objects are lazily evaluated. Operations like `.sum` build up a graph of blocked tasks to execute. \n", + "\n", + "We ask for the final result with a call to `.compute()`. This triggers the actual computation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: Compute the mean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the variance, std, etc.. This should be a small change to the example above.\n", + "\n", + "Look at what other operations you can do with the Jupyter notebook's tab-completion." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Does this match your result from before?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Performance and Parallelism\n", + "-------------------------------\n", + "\n", + "\n", + "\n", + "In our first examples we used `for` loops to walk through the array one block at a time. For simple operations like `sum` this is optimal. However for complex operations we may want to traverse through the array differently. In particular we may want the following:\n", + "\n", + "1. Use multiple cores in parallel\n", + "2. Chain operations on a single blocks before moving on to the next one\n", + "\n", + "`Dask.array` translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads. We'll discuss more about this in the next section.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks\n", + "2. Take the mean along one axis\n", + "3. Take every 100th element" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import dask.array as da\n", + "\n", + "x = da.random.normal(10, 0.1, size=(20000, 20000), # 400 million element array \n", + " chunks=(1000, 1000)) # Cut into 1000x1000 sized chunks\n", + "y = x.mean(axis=0)[::100] # Perform NumPy-style operations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x.nbytes / 1e9 # Gigabytes of the input processed lazily" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "y.compute() # Time to compute the result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Performance comparison\n", + "---------------------------\n", + "\n", + "The following experiment was performed on a heavy personal laptop. Your performance may vary. If you attempt the NumPy version then please ensure that you have more than 4GB of main memory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**NumPy: 19s, Needs gigabytes of memory**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "import numpy as np\n", + "\n", + "%%time \n", + "x = np.random.normal(10, 0.1, size=(20000, 20000)) \n", + "y = x.mean(axis=0)[::100] \n", + "y\n", + "\n", + "CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s\n", + "Wall time: 19.7 s\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Dask Array: 4s, Needs megabytes of memory**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "import dask.array as da\n", + "\n", + "%%time\n", + "x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))\n", + "y = x.mean(axis=0)[::100] \n", + "y.compute() \n", + "\n", + "CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s\n", + "Wall time: 4.01 s\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Discussion**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the Dask array computation ran in 4 seconds, but used 29.4 seconds of user CPU time. The numpy computation ran in 19.7 seconds and used 19.6 seconds of user CPU time.\n", + "\n", + "Dask finished faster, but used more total CPU time because Dask was able to transparently parallelize the computation because of the chunk size." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Questions*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* What happens if the dask chunks=(20000,20000)?\n", + " * Will the computation run in 4 seconds?\n", + " * How much memory will be used?\n", + "* What happens if the dask chunks=(25,25)?\n", + " * What happens to CPU and memory?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: Meteorological data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is 2GB of somewhat artifical weather data in HDF5 files in `data/weather-big/*.hdf5`. We'll use the `h5py` library to interact with this data and `dask.array` to compute on it.\n", + "\n", + "Our goal is to visualize the average temperature on the surface of the Earth for this month. This will require a mean over all of this data. We'll do this in the following steps\n", + "\n", + "1. Create `h5py.Dataset` objects for each of the days of data on disk (`dsets`)\n", + "2. Wrap these with `da.from_array` calls \n", + "3. Stack these datasets along time with a call to `da.stack`\n", + "4. Compute the mean along the newly stacked time axis with the `.mean()` method\n", + "5. Visualize the result with `matplotlib.pyplot.imshow`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run prep.py -d weather" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import h5py\n", + "from glob import glob\n", + "import os\n", + "\n", + "filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))\n", + "dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]\n", + "dsets[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dsets[0][:5, :5] # Slicing into h5py.Dataset object gives a numpy array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "import matplotlib.pyplot as plt\n", + "\n", + "fig = plt.figure(figsize=(16, 8))\n", + "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Integrate with `dask.array`**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make a list of `dask.array` objects out of your list of `h5py.Dataset` objects using the `da.from_array` function with a chunk size of `(500, 500)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]\n", + "arrays" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Stack this list of `dask.array` objects into a single `dask.array` object with `da.stack`**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Stack these along the first axis so that the shape of the resulting array is `(31, 5760, 11520)`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "x = da.stack(arrays, axis=0)\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Plot the mean of this array along the time (`0th`) axis**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "# complete the following:\n", + "fig = plt.figure(figsize=(16, 8))\n", + "plt.imshow(..., cmap='RdBu_r')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "result = x.mean(axis=0)\n", + "fig = plt.figure(figsize=(16, 8))\n", + "plt.imshow(result, cmap='RdBu_r');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Plot the difference of the first day from the mean**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "result = x[0] - x.mean(axis=0)\n", + "fig = plt.figure(figsize=(16, 8))\n", + "plt.imshow(result, cmap='RdBu_r');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: Subsample and store" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the above exercise the result of our computation is small, so we can call `compute` safely. Sometimes our result is still too large to fit into memory and we want to save it to disk. In these cases you can use one of the following two functions\n", + "\n", + "1. `da.store`: Store dask.array into any object that supports numpy setitem syntax, e.g.\n", + "\n", + " f = h5py.File('myfile.hdf5')\n", + " output = f.create_dataset(shape=..., dtype=...)\n", + " \n", + " da.store(my_dask_array, output)\n", + " \n", + "2. `da.to_hdf5`: A specialized function that creates and stores a `dask.array` object into an `HDF5` file.\n", + "\n", + " da.to_hdf5('data/myfile.hdf5', '/output', my_dask_array)\n", + " \n", + "The task in this exercise is to **use numpy step slicing to subsample the full dataset by a factor of two in both the latitude and longitude direction and then store this result to disk** using one of the functions listed above.\n", + "\n", + "As a reminder, Python slicing takes three elements\n", + "\n", + " start:stop:step\n", + "\n", + " >>> L = [1, 2, 3, 4, 5, 6, 7]\n", + " >>> L[::3]\n", + " [1, 4, 7]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "import h5py\n", + "from glob import glob\n", + "import os\n", + "import dask.array as da\n", + "\n", + "filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))\n", + "dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]\n", + "\n", + "arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]\n", + "\n", + "x = da.stack(arrays, axis=0)\n", + "\n", + "result = x[:, ::2, ::2]\n", + "\n", + "da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example: Lennard-Jones potential" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Lennard-Jones potential](https://en.wikipedia.org/wiki/Lennard-Jones_potential) is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.\n", + "\n", + "First, we'll run and profile the Numpy version on 7,000 particles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "# make a random collection of particles\n", + "def make_cluster(natoms, radius=40, seed=1981):\n", + " np.random.seed(seed)\n", + " cluster = np.random.normal(0, radius, (natoms,3))-0.5\n", + " return cluster\n", + "\n", + "def lj(r2):\n", + " sr6 = (1./r2)**3\n", + " pot = 4.*(sr6*sr6 - sr6)\n", + " return pot\n", + "\n", + "# build the matrix of distances\n", + "def distances(cluster):\n", + " diff = cluster[:, np.newaxis, :] - cluster[np.newaxis, :, :]\n", + " mat = (diff*diff).sum(-1)\n", + " return mat\n", + "\n", + "# the lj function is evaluated over the upper traingle\n", + "# after removing distances near zero\n", + "def potential(cluster):\n", + " d2 = distances(cluster)\n", + " dtri = np.triu(d2)\n", + " energy = lj(dtri[dtri > 1e-6]).sum()\n", + " return energy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cluster = make_cluster(int(7e3), radius=500)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time potential(cluster)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the most time consuming function is `distances`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# this would open in another browser tab\n", + "# %load_ext snakeviz\n", + "# %snakeviz potential(cluster)\n", + "\n", + "# alternative simple version given text results in this tab\n", + "%prun -s tottime potential(cluster)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dask version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here's the Dask version. Only the `potential` function needs to be rewritten to best utilize Dask.\n", + "\n", + "Note that `da.nansum` has been used over the full $NxN$ distance matrix to improve parallel efficiency.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask.array as da\n", + "\n", + "# compute the potential on the entire\n", + "# matrix of distances and ignore division by zero\n", + "def potential_dask(cluster):\n", + " d2 = distances(cluster)\n", + " energy = da.nansum(lj(d2))/2.\n", + " return energy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's convert the NumPy array to a Dask array. Since the entire NumPy array fits in memory it is more computationally efficient to chunk the array by number of CPU cores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from os import cpu_count\n", + "\n", + "dcluster = da.from_array(cluster, chunks=cluster.shape[0]//cpu_count())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This step should scale quite well with number of cores. The warnings are complaining about dividing by zero, which is why we used `da.nansum` in `potential_dask`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "e = potential_dask(dcluster)\n", + "%time e.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Limitations\n", + "-----------\n", + "\n", + "Dask Array does not implement the entire numpy interface. Users expecting this\n", + "will be disappointed. Notably Dask Array has the following failings:\n", + "\n", + "1. Dask does not implement all of ``np.linalg``. This has been done by a\n", + " number of excellent BLAS/LAPACK implementations and is the focus of\n", + " numerous ongoing academic research projects.\n", + "2. Dask Array does not support some operations where the resulting shape\n", + " depends on the values of the array. For those that it does support\n", + " (for example, masking one Dask Array with another boolean mask),\n", + " the chunk sizes will be unknown, which may cause issues with other\n", + " operations that need to know the chunk sizes.\n", + "3. Dask Array does not attempt operations like ``sort`` which are notoriously\n", + " difficult to do in parallel and are of somewhat diminished value on very\n", + " large data (you rarely actually need a full sort).\n", + " Often we include parallel-friendly alternatives like ``topk``.\n", + "4. Dask development is driven by immediate need, and so many lesser used\n", + " functions, like ``np.sometrue`` have not been implemented purely out of\n", + " laziness. These would make excellent community contributions.\n", + " \n", + "* [Array documentation](https://docs.dask.org/en/latest/array.html)\n", + "* [Array screencast](https://youtu.be/9h_61hXCDuI)\n", + "* [Array API](https://docs.dask.org/en/latest/array-api.html)\n", + "* [Array examples](https://examples.dask.org/array.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.shutdown()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/dask/04_dataframe.ipynb b/dask/04_dataframe.ipynb new file mode 100644 index 0000000..2296a9b --- /dev/null +++ b/dask/04_dataframe.ipynb @@ -0,0 +1,836 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Dask\n", + "\n", + "\n", + "# Dask DataFrames\n", + "\n", + "We finished Chapter 1 by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`. In this section we use `dask.dataframe` to automatically build similiar computations, for the common case of tabular computations. Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.\n", + "\n", + "In this notebook we use the same airline data as before, but now rather than write for-loops we let `dask.dataframe` construct our computations for us. The `dask.dataframe.read_csv` function can take a globstring like `\"data/nycflights/*.csv\"` and build parallel computations on all of our data at once.\n", + "\n", + "## When to use `dask.dataframe`\n", + "\n", + "Pandas is great for tabular datasets that fit in memory. Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. The demo dataset we're working with is only about 200MB, so that you can download it in a reasonable time, but `dask.dataframe` will scale to datasets much larger than memory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame` API. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n", + "\n", + "**Related Documentation**\n", + "\n", + "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", + "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n", + "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", + "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n", + "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)\n", + "\n", + "**Main Take-aways**\n", + "\n", + "1. Dask DataFrame should be familiar to Pandas users\n", + "2. The partitioning of dataframes is important for efficient execution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run prep.py -d flights" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask.distributed import Client\n", + "\n", + "client = Client(n_workers=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We create artifical data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from prep import accounts_csvs\n", + "accounts_csvs()\n", + "\n", + "import os\n", + "import dask\n", + "filename = os.path.join('data', 'accounts.*.csv')\n", + "filename" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same Dask DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "df = dd.read_csv(filename)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load and count number of rows\n", + "len(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What happened here?\n", + "- Dask investigated the input path and found that there are three matching files \n", + "- a set of jobs was intelligently created for each chunk - one per original CSV file in this case\n", + "- each file was loaded into a pandas dataframe, had `len()` applied to it\n", + "- the subtotals were combined to give you the final grand total." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Real Data\n", + "\n", + "Lets try this with an extract of flights in the USA across several years. This data is specific to flights out of the three airports in the New York City area." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", + " parse_dates={'Date': [0, 1, 2]})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the respresentation of the dataframe object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can view the start and end of the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "df.tail() # this fails" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What just happened?\n", + "\n", + "Unlike `pandas.read_csv` which reads in the entire file before inferring datatypes, `dask.dataframe.read_csv` only reads in a sample from the beginning of the file (or first file if using a glob). These inferred datatypes are then enforced when reading all partitions.\n", + "\n", + "In this case, the datatypes inferred in the sample are incorrect. The first `n` rows have no value for `CRSElapsedTime` (which pandas infers as a `float`), and later on turn out to be strings (`object` dtype). Note that Dask gives an informative error message about the mismatch. When this happens you have a few options:\n", + "\n", + "- Specify dtypes directly using the `dtype` keyword. This is the recommended solution, as it's the least error prone (better to be explicit than implicit) and also the most performant.\n", + "- Increase the size of the `sample` keyword (in bytes)\n", + "- Use `assume_missing` to make `dask` assume that columns inferred to be `int` (which don't allow missing values) are actually floats (which do allow missing values). In our particular case this doesn't apply.\n", + "\n", + "In our case we'll use the first option and directly specify the `dtypes` of the offending columns. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", + " parse_dates={'Date': [0, 1, 2]},\n", + " dtype={'TailNum': str,\n", + " 'CRSElapsedTime': float,\n", + " 'Cancelled': bool})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.tail() # now works" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Computations with `dask.dataframe`\n", + "\n", + "We compute the maximum of the `DepDelay` column. With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums\n", + "\n", + "```python\n", + "maxes = []\n", + "for fn in filenames:\n", + " df = pd.read_csv(fn)\n", + " maxes.append(df.DepDelay.max())\n", + " \n", + "final_max = max(maxes)\n", + "```\n", + "\n", + "We could wrap that `pd.read_csv` with `dask.delayed` so that it runs in parallel. Regardless, we're still having to think about loops, intermediate results (one per file) and the final reduction (`max` of the intermediate maxes). This is just noise around the real task, which pandas solves with\n", + "\n", + "```python\n", + "df = pd.read_csv(filename, dtype=dtype)\n", + "df.DepDelay.max()\n", + "```\n", + "\n", + "`dask.dataframe` lets us write pandas-like code, that operates on larger than memory datasets in parallel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%time df.DepDelay.max().compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This writes the delayed computation for us and then runs it. \n", + "\n", + "Some things to note:\n", + "\n", + "1. As with `dask.delayed`, we need to call `.compute()` when we're done. Up until this point everything is lazy.\n", + "2. Dask will delete intermediate results (like the full pandas dataframe for each file) as soon as possible.\n", + " - This lets us handle datasets that are larger than memory\n", + " - This means that repeated computations will have to load all of the data in each time (run the code above again, is it faster or slower than you would expect?)\n", + " \n", + "As with `Delayed` objects, you can view the underlying task graph using the `.visualize` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# notice the parallelism\n", + "df.DepDelay.max().visualize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises\n", + "\n", + "In this section we do a few `dask.dataframe` computations. If you are comfortable with Pandas then these should be familiar. You will have to think about when to call `compute`.\n", + "\n", + "### 1.) How many rows are in our dataset?\n", + "\n", + "If you aren't familiar with pandas, how would you check how many records are in a list of tuples?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_exo1.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.) In total, how many non-canceled flights were taken?\n", + "\n", + "With pandas, you would use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_exo2.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.) In total, how many non-cancelled flights were taken from each airport?\n", + "\n", + "*Hint*: use [`df.groupby`](https://pandas.pydata.org/pandas-docs/stable/groupby.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_exo3.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.) What was the average departure delay from each airport?\n", + "\n", + "Note, this is the same computation you did in the previous notebook (is this approach faster or slower?)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_exo4.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.) What day of the week has the worst average departure delay?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your code here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_exo5.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sharing Intermediate Results\n", + "\n", + "When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` hashes the arguments, allowing duplicate computations to be shared, and only computed once.\n", + "\n", + "For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights. Since dask operations are lazy, those values aren't the final results yet. They're just the recipe required to get the result.\n", + "\n", + "If we compute them with two calls to compute, there is no sharing of intermediate computations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "non_cancelled = df[~df.Cancelled]\n", + "mean_delay = non_cancelled.DepDelay.mean()\n", + "std_delay = non_cancelled.DepDelay.std()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "mean_delay_res = mean_delay.compute()\n", + "std_delay_res = std_delay.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But let's try by passing both to a single `compute` call." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using `dask.compute` takes roughly 1/2 the time. This is because the task graphs for both results are merged when calling `dask.compute`, allowing shared operations to only be done once instead of twice. In particular, using `dask.compute` only does the following once:\n", + "\n", + "- the calls to `read_csv`\n", + "- the filter (`df[~df.Cancelled]`)\n", + "- some of the necessary reductions (`sum`, `count`)\n", + "\n", + "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to save the graph to disk so that we can zoom in more easily):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dask.visualize(mean_delay, std_delay)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How does this compare to Pandas?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pandas is more mature and fully featured than `dask.dataframe`. If your data fits in memory then you should use Pandas. The `dask.dataframe` module gives you a limited `pandas` experience when you operate on datasets that don't fit comfortably in memory.\n", + "\n", + "During this tutorial we provide a small dataset consisting of a few CSV files. This dataset is 45MB on disk that expands to about 400MB in memory. This dataset is small enough that you would normally use Pandas.\n", + "\n", + "We've chosen this size so that exercises finish quickly. Dask.dataframe only really becomes meaningful for problems significantly larger than this, when Pandas breaks with the dreaded \n", + "\n", + " MemoryError: ...\n", + " \n", + "Furthermore, the distributed scheduler allows the same dataframe expressions to be executed across a cluster. To enable massive \"big data\" processing, one could execute data ingestion functions such as `read_csv`, where the data is held on storage accessible to every worker node (e.g., amazon's S3), and because most operations begin by selecting only some columns, transforming and filtering the data, only relatively small amounts of data need to be communicated between the machines.\n", + "\n", + "Dask.dataframe operations use `pandas` operations internally. Generally they run at about the same speed except in the following two cases:\n", + "\n", + "1. Dask introduces a bit of overhead, around 1ms per task. This is usually negligible.\n", + "2. When Pandas releases the GIL `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dask DataFrame Data Model\n", + "\n", + "For the most part, a Dask DataFrame feels like a pandas DataFrame.\n", + "So far, the biggest difference we've seen is that Dask operations are lazy; they build up a task graph instead of executing immediately (more details coming in [Schedulers](05_distributed.ipynb)).\n", + "This lets Dask do operations in parallel and out of core.\n", + "\n", + "In [Dask Arrays](03_array.ipynb), we saw that a `dask.array` was composed of many NumPy arrays, chunked along one or more dimensions.\n", + "It's similar for `dask.dataframe`: a Dask DataFrame is composed of many pandas DataFrames. For `dask.dataframe` the chunking happens only along the index.\n", + "\n", + "\n", + "\n", + "We call each chunk a *partition*, and the upper / lower bounds are *divisions*.\n", + "Dask *can* store information about the divisions. For now, partitions come up when you write custom functions to apply to Dask DataFrames" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Converting `CRSDepTime` to a timestamp\n", + "\n", + "This dataset stores timestamps as `HHMM`, which are read in as integers in `read_csv`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "crs_dep_time = df.CRSDepTime.head(10)\n", + "crs_dep_time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To convert these to timestamps of scheduled departure time, we need to convert these integers into `pd.Timedelta` objects, and then combine them with the `Date` column.\n", + "\n", + "In pandas we'd do this using the `pd.to_timedelta` function, and a bit of arithmetic:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Get the first 10 dates to complement our `crs_dep_time`\n", + "date = df.Date.head(10)\n", + "\n", + "# Get hours as an integer, convert to a timedelta\n", + "hours = crs_dep_time // 100\n", + "hours_timedelta = pd.to_timedelta(hours, unit='h')\n", + "\n", + "# Get minutes as an integer, convert to a timedelta\n", + "minutes = crs_dep_time % 100\n", + "minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n", + "\n", + "# Apply the timedeltas to offset the dates by the departure time\n", + "departure_timestamp = date + hours_timedelta + minutes_timedelta\n", + "departure_timestamp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Custom code and Dask Dataframe\n", + "\n", + "We could swap out `pd.to_timedelta` for `dd.to_timedelta` and do the same operations on the entire dask DataFrame. But let's say that Dask hadn't implemented a `dd.to_timedelta` that works on Dask DataFrames. What would you do then?\n", + "\n", + "`dask.dataframe` provides a few methods to make applying custom functions to Dask DataFrames easier:\n", + "\n", + "- [`map_partitions`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions)\n", + "- [`map_overlap`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_overlap)\n", + "- [`reduction`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reduction)\n", + "\n", + "Here we'll just be discussing `map_partitions`, which we can use to implement `to_timedelta` on our own:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Look at the docs for `map_partitions`\n", + "\n", + "help(df.CRSDepTime.map_partitions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The basic idea is to apply a function that operates on a DataFrame to each partition.\n", + "In this case, we'll apply `pd.to_timedelta`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hours = df.CRSDepTime // 100\n", + "# hours_timedelta = pd.to_timedelta(hours, unit='h')\n", + "hours_timedelta = hours.map_partitions(pd.to_timedelta, unit='h')\n", + "\n", + "minutes = df.CRSDepTime % 100\n", + "# minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n", + "minutes_timedelta = minutes.map_partitions(pd.to_timedelta, unit='m')\n", + "\n", + "departure_timestamp = df.Date + hours_timedelta + minutes_timedelta" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "departure_timestamp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "departure_timestamp.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise: Rewrite above to use a single call to `map_partitions`\n", + "\n", + "This will be slightly more efficient than two separate calls, as it reduces the number of tasks in the graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def compute_departure_timestamp(df):\n", + " pass # TODO: implement this" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "departure_timestamp = df.map_partitions(compute_departure_timestamp)\n", + "\n", + "departure_timestamp.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "outputs": [], + "source": [ + "%load solutions/04_map_partitions.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Limitations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What doesn't work?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Dask.dataframe only covers a small but well-used portion of the Pandas API.\n", + "This limitation is for two reasons:\n", + "\n", + "1. The Pandas API is *huge*\n", + "2. Some operations are genuinely hard to do in parallel (e.g. sort)\n", + "\n", + "Additionally, some important operations like ``set_index`` work, but are slower\n", + "than in Pandas because they include substantial shuffling of data, and may write out to disk." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learn More\n", + "\n", + "\n", + "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n", + "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n", + "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n", + "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n", + "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.shutdown()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/dask/05_distributed.ipynb b/dask/05_distributed.ipynb new file mode 100644 index 0000000..fcfe2b6 --- /dev/null +++ b/dask/05_distributed.ipynb @@ -0,0 +1,424 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Distributed" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we have seen so far, Dask allows you to simply construct graphs of tasks with dependencies, as well as have graphs created automatically for you using functional, Numpy or Pandas syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. So far we have been calling `thing.compute()` or `dask.compute(thing)` without worrying what this entails. Now we will discuss the options available for that execution, and in particular, the distributed scheduler, which comes with additional functionality.\n", + "\n", + "Dask comes with four available schedulers:\n", + "- \"threaded\" (aka \"threading\"): a scheduler backed by a thread pool\n", + "- \"processes\": a scheduler backed by a process pool\n", + "- \"single-threaded\" (aka \"sync\"): a synchronous scheduler, good for debugging\n", + "- distributed: a distributed scheduler for executing graphs on multiple machines, see below.\n", + "\n", + "To select one of these for computation, you can specify at the time of asking for a result, e.g.,\n", + "```python\n", + "myvalue.compute(scheduler=\"single-threaded\") # for debugging\n", + "```\n", + "\n", + "You can also set a default scheduler either temporarily\n", + "```python\n", + "with dask.config.set(scheduler='processes'):\n", + " # set temporarily for this block only\n", + " # all compute calls within this block will use the specified scheduler\n", + " myvalue.compute()\n", + " anothervalue.compute()\n", + "```\n", + "\n", + "Or globally\n", + "```python\n", + "# set until further notice\n", + "dask.config.set(scheduler='processes')\n", + "```\n", + "\n", + "Let's try out a few schedulers on the familiar case of the flights data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run prep.py -d flights" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dd.Scalar" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import dask.dataframe as dd\n", + "import os\n", + "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n", + " parse_dates={'Date': [0, 1, 2]},\n", + " dtype={'TailNum': object,\n", + " 'CRSElapsedTime': float,\n", + " 'Cancelled': bool})\n", + "\n", + "# Maximum average non-cancelled delay grouped by Airport\n", + "largest_delay = df[~df.Cancelled].groupby('Origin').DepDelay.mean().max()\n", + "largest_delay" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " threading, 0.2795 s; result, 17.05 hours\n", + " processes, 1.9635 s; result, 17.05 hours\n", + " sync, 0.2051 s; result, 17.05 hours\n" + ] + } + ], + "source": [ + "# each of the following gives the same results (you can check!)\n", + "# any surprises?\n", + "import time\n", + "for sch in ['threading', 'processes', 'sync']:\n", + " t0 = time.time()\n", + " r = largest_delay.compute(scheduler=sch)\n", + " t1 = time.time()\n", + " print(f\"{sch:>10}, {t1 - t0:0.4f} s; result, {r:0.2f} hours\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Some Questions to Consider:\n", + "\n", + "- How much speedup is possible for this task (hint, look at the graph).\n", + "- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.\n", + "- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?\n", + "- Why is the multiprocessing scheduler so much slower here?\n", + "\n", + "The `threaded` scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) most of the time. NumPy and pandas release the GIL in most places, so the `threaded` scheduler is the default for `dask.array` and `dask.dataframe`. The distributed scheduler, perhaps with `processes=False`, will also work well for these workloads on a single machine.\n", + "\n", + "For workloads that do hold the GIL, as is common with `dask.bag` and custom code wrapped with `dask.delayed`, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it's more intelligent and provides better diagnostics than the `processes` scheduler.\n", + "\n", + "https://docs.dask.org/en/latest/scheduling.html provides some additional details on choosing a scheduler.\n", + "\n", + "For scaling out work across a cluster, the distributed scheduler is required." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Making a cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Simple method" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `dask.distributed` system is composed of a single centralized scheduler and one or more worker processes. [Deploying](https://docs.dask.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a `Client` object, which lets you interact with the \"cluster\" (local threads or processes on your machine). For more information see [here](https://docs.dask.org/en/latest/setup/single-distributed.html). \n", + "\n", + "Note that `Client()` takes a lot of optional [arguments](https://distributed.dask.org/en/latest/local-cluster.html#api), to configure the number of processes/threads, memory limits and other " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d42d8fb969e9440ea17267bfa2fc6373", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='

LocalCluster

'), HBox(children=(HTML(value='\\n
\\n