Python Concurrency Tutorial

CONCURRENCY IN PYTHON

Concurrency in Python can be confusing. There are multiple modules (threading, _thread, multiprocessing, subprocess). There’s also the much hated GIL, but only for CPython (PyPy and Jython don’t have a GIL). Collections are not thread safe, except for some implementation details with CPython. This is a quick guide/tutorial on how to effectively write concurrent programs using Python.

The objective of this tutorial is to provide down-to-earth recommendations on how to approach each use case of concurrent (or parallel) code.

You’ll notice that, unlike some other more traditional “multithreading only” Python tutorials, I start with a summary of both Computer Architecture and concepts of Operating Systems. Understanding these basic concepts is fundamental to correctly build concurrent programs. Please skip these sections if you’re already familiar with these topics.

This tutorial is divided into the following sections:

Concurrency vs Parallelism
Computer Architecture
The role of the Operating System
Threads with Python
Thread Synchronization
Multiprocessing
High level libraries: concurrent.futures and parallel

This is a summary of my PyCon 2020 tutorial. You can find the original video and the source code here: https://github.com/santiagobasulto/pycon-concurrency-tutorial-2020

CONCURRENCY VS PARALLELISM

Parallelism is when several tasks are running at the same time. It’s the ultimate objective of concurrent programs. Concurrency is less than parallelism, it means we’re starting several tasks and juggling them in the same time period. However, at any particular moment, we’re doing only one at a time.

Applied to cooking, these would be examples of concurrency and parallelism.

Concurrency has only one person in the kitchen:

Start cutting onions
Start heating the pan
Finish cutting the onions

In this case it’s clear that we can’t do multiple things AT THE SAME TIME. We can start them all, but we have to bounce back and forth to control them.

Parallelism has multiple people in the kitchen:

Person 1 is cutting onions
Person 2 is cutting red peppers
Person 3 waits for the pan to heat up

In this case there are several tasks being done at the same time.

COMPUTER ARCHITECTURE

All modern computers can be simplified using the von Neumann architecture, which has 3 main components: Computing (CPU), Memory (RAM), I/O (hard drives, networks, video output).

Your code will always be performing tasks related to each of these components. Some instructions are “computational” ( x + 1), some are memory (x = 1), and some are I/O (fp = open('data.csv')).

To properly understand the details of concurrency in Python, it’s important to keep in mind the different “access times” of these components. Accessing the CPU is a lot faster than accessing RAM, let alone I/O. There’s a study comparing computer latency to human relative times: if a CPU cycle represents 1 second of human time, a network request from San Francisco to Hong Kong takes 11 years.

Latency at a human scale (source)

The radical difference in time with these “instructions” will be fundamental for discuss threads and the Python GIL.

THE ROLE OF THE OPERATING SYSTEM

If I was building a University course about Concurrency, I’d have an entire lecture just covering the History of Operating Systems. It’s the only way to understand how concurrency mechanisms evolved.

The first consumer-focused operating systems (think MS-DOS) didn’t support multitasking. At most, only one program would run at a certain time. They were followed by “cooperative multitasking” systems (Windows 3.0). This type of multitasking resembled more an “event loop” (think Python’s asyncio) than a modern multithreaded program. In a cooperative multitasking OS, the application is in charge of “releasing” the CPU when it’s not needed anymore and let other apps run. This obviously didn’t work, as you can’t trust every app (or even your developers to be sharp enough to make the right call on when to release the CPU). Modern Operating Systems (Windows 95, Linux, etc) switched their multitasking design to “Preemptive Multitasking”, in which the Operating System is the one in charge of deciding which apps receive CPU, how often and for how long.

The Operating System then, is the one deciding which processes get to run, and for how long. This is what allows the app to still feel like it’s “multitasking” even if it has only 1 CPU. The OS Scheduler is switching Processes back and forth very quickly.

In the following diagram, 3 processes (P1, P2, P3) are running concurrently, sharing the same CPU (the blue line is CPU time). P1 gets to run for quite some time before quickly being switched to P2, and then P3, and so on.

3 processes are running “concurrently”

THE PROCESS

A Process is the abstraction that the Operating System uses to run your code. It wraps your code in this process, and assigns memory and other shared resources. The process abstraction allows the OS to “distinguish” between running programs, so they create conflicts with each other. For example, Process 1 can’t access memory reserved by Process 2. It’s also important in terms of users’ security. If User 1 starts a process, that process will be able to read the files accessible by that user.

The following diagram contains a visual representation of a Process. The process contains the code it has to execute, allocated RAM, and all the data created by the program (variables, files open, etc).

Now we can see the same “program” (the same piece of code) running in multiple processes concurrently.

Concurrency in Modern Systems

We can now move on to the second phase: how is concurrency (or even parallelism) actually achieved? From your perspective as a developer, there are two mechanisms you can employ to achieve “concurrency”: multithreading or multiprocessing.

MULTITHREADING

Threads are concurrent execution flows within the same process. The process starts several threads of execution, that “evolve” independently based on what they have to do.

For example, let’s say we have a program that must download data from three websites, and then combine it for a final report. Each website takes around two seconds to respond with the data. A single threaded program would be something like:

d1 = get_website_1() # 2 secs
d2 = get_website_2() # 2 secs
d3 = get_website_3() # 2 secs

combine(d1, d2, d3) # very fast

The total execution time will be >6 seconds:

A multithreaded program will start to download the data from the three websites at the same time, and wait until they’re all done to make the final report.

t1 = Thread(get_website_1)
t2 = Thread(get_website_2)
t3 = Thread(get_website_3)

wait_for_threads([t1, t2, t3])

d1 = t1.result()
d2 = t2.result()
d3 = t3.result()

combine(d1, d2, d3)

The execution now will be close to ~2 seconds, as the three websites will respond around the same time:

In Python, threads are created with the Thread class, from the threading module, and you can see several examples in the first notebook from the repo of my tutorial: 1. Thread Basics.ipynb.

THREADS AND SHARED DATA

Threads are explicitly created by your program, so they live within the same process. That means they share all the same data from the process. They can read and modify local variables, and access the same memory or same resources (files, network sockets, etc).

This can cause some issues, specially unexpected mutation of shared data and Race Conditions. You can see an example of a race condition in the notebook 2. Thread Data & Race Conditions.ipynb. In the same notebook, we introduce some mechanisms to “synchronize” threads to avoid Race Conditions, but those same mechanisms can potentially introduce the problem of a Deadlock. There’s an example of a Deadlock in the notebook: 3. Deadlocks.ipynb.

PRODUCERS, CONSUMERS AND QUEUES

Synchronization issues can be resolved by just changing the paradigm. The best way to avoid synchronization problems is to avoid synchronization altogether.

This is achieved by using the Producer-Consumer pattern. Producer threads create “pending tasks” to perform and then place them on a shared queue. Consumer threads will read the tasks from the queue and do the actual work.

In Python, the Queue class from the queue module is a thread-safe queue that is used for these purposes.

There’s an example of a Producer-Consumer model in the notebook
4. Producer-Consumer model.ipynb.

THE PYTHON GLOBAL INTERPRETER LOCK

The GIL is one of Python’s most hated features. It’s actually not that bad, but it has a terrible reputation.

The GIL is a “global” lock that is placed by the Python interpreter which guarantees that only 1 thread can access the CPU at a given time. This does sound stupid: why are you going to worry about multithreaded code if only one thread will be able to access the CPU at a given time anyways?

Here’s where it’s important to understand that your code is not all about using CPU. Your code will do different tasks, like reading files, getting information from the network, etc. Remember our latency access times? Here they are again:

This is key to combating the limitation of the GIL. The moment a thread needs to read something from a file, it can release the CPU and let other threads run. It’ll take a lot of time until the hard drive returns the data.This is because most programs are what we call I/O-Bound. That is, they use a lot of I/O features (network, file system, etc).

You can see a “proof” of a I/O bound task in the notebook: 5. The Python GIL.ipynb.

MULTIPROCESSING

Some other programs are inevitably CPU-Bound, such as programs that need to do a lot of computations. Sometimes, you want to speed things up. If you use multithreading for CPU-Bound tasks, you’ll realize that your programs are not faster because of the GIL (they can actually get even slower).

Enter multiprocessing, the second mechanism we have to write concurrent programs. Instead of having multiple threads, we’ll actually spawn multiple child processes that will take care of performing different tasks concurrently.

There’s an example of a multiprocessing-based program in the notebook 6. Multiprocessing.ipynb

CONCURRENT.FUTURES

To finish my tutorial, I’d like to point out that there’s a higher-level module that is part of the Python Standard Library that should be used when possible: concurrent.futures.

It provides high level abstractions to create thread and process pools, and it’s mature enough to be considered, whenever possible, the default alternative to write concurrent code.