Concurrency in Python encompasses the techniques for executing multiple tasks simultaneously or in overlapping time periods — including threading (for I/O-bound tasks), asyncio (for high-concurrency I/O with cooperative scheduling), and multiprocessing (for CPU-bound tasks that bypass the GIL), with the choice between these approaches determined by whether the workload is I/O-bound or CPU-bound and the specific requirements for parallelism, memory sharing, and integration with async frameworks like those used in LLM API clients.
What Is Concurrency in Python?
- Definition: The ability to manage multiple tasks that make progress within overlapping time periods — concurrency (tasks interleave on one core) differs from parallelism (tasks execute simultaneously on multiple cores), though Python supports both through different mechanisms.
- GIL (Global Interpreter Lock): CPython's GIL allows only one thread to execute Python bytecode at a time — this means threading does NOT provide true parallelism for CPU-bound Python code, but it DOES allow parallel I/O operations because the GIL is released during I/O waits.
- Choosing the Right Tool: I/O-bound tasks (API calls, database queries, file I/O) benefit from threading or asyncio — CPU-bound tasks (data processing, model inference) require multiprocessing or external libraries (NumPy, PyTorch) that release the GIL during computation.
Concurrency Models
| Model | Best For | Python Module | True Parallelism | Memory |
|---|---|---|---|---|
| Threading | I/O-bound, simple | threading | No (GIL) | Shared |
| Asyncio | I/O-bound, many connections | asyncio | No (single thread) | Shared |
| Multiprocessing | CPU-bound | multiprocessing | Yes (separate processes) | Separate |
| ProcessPoolExecutor | CPU-bound, simple API | concurrent.futures | Yes | Separate |
| ThreadPoolExecutor | I/O-bound, simple API | concurrent.futures | No (GIL) | Shared |
Async for LLM APIs
- Why Async: LLM API calls take 500ms-30s — async allows hundreds of concurrent requests on a single thread, maximizing throughput when calling OpenAI, Anthropic, or self-hosted models.
- AsyncOpenAI: The OpenAI Python client provides an async interface —
await client.chat.completions.create()enables non-blocking API calls. - asyncio.gather: Run multiple async calls concurrently —
results = await asyncio.gather(*[call_api(p) for p in prompts])processes all prompts in parallel. - Rate Limiting: Use
asyncio.Semaphoreto limit concurrent requests — preventing API rate limit errors while maintaining high throughput. - Streaming: Async streaming (
async for chunk in response) enables real-time token delivery to users while other requests are processed concurrently.
When to Use Each Approach
- Threading: Simple I/O parallelism (downloading files, making a few API calls) — easy to use but limited scalability for thousands of connections.
- Asyncio: High-concurrency I/O (web servers, LLM API batching, websockets) — scales to thousands of concurrent connections on a single thread but requires async-compatible libraries.
- Multiprocessing: CPU-intensive work (data preprocessing, model inference without GPU) — true parallelism but higher memory overhead (each process gets its own memory space).
- External Libraries: NumPy, PyTorch, and other C-extension libraries release the GIL during computation — enabling true parallelism within threads for numerical workloads.
Concurrency in Python is the essential skill for building performant ML applications — choosing between threading, asyncio, and multiprocessing based on whether workloads are I/O-bound or CPU-bound, with async programming particularly critical for LLM applications that must efficiently manage hundreds of concurrent API calls and streaming responses.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.