Async/Await (Asynchronous Programming)

Async/Await (Asynchronous Programming) is the concurrency model that allows a single thread to handle many concurrent I/O-bound operations by suspending and resuming coroutines at await points rather than blocking the thread waiting for I/O to complete — the correct solution for building high-throughput LLM API servers, RAG pipelines, and AI services where network I/O dominates latency.

What Is Async/Await?

- Definition: A programming model built on coroutines — functions that can be paused at await points (while waiting for I/O) and resumed later, allowing a single event loop thread to interleave execution of thousands of concurrent operations without blocking.
- Event Loop: The central scheduler that manages coroutine execution. When a coroutine awaits an I/O operation (network request, database query), the event loop pauses it and runs other ready coroutines — no thread blocking, no wasted CPU cycles.
- Python asyncio: Python's built-in async framework — async def declares a coroutine, await suspends until the awaited operation completes, asyncio.run() starts the event loop.
- Key Distinction: Async/await is concurrent (many tasks interleaved) but not parallel (only one thing running at a time per thread) — it is ideal for I/O-bound work, not CPU-bound computation.

Why Async Matters for AI Services

- LLM APIs Are I/O-Bound: Calling OpenAI, Anthropic, or a local vLLM server to generate a 500-token response takes 3-10 seconds. A synchronous (blocking) server would tie up a thread for every active request — 100 concurrent users requires 100 threads.
- Thread Cost: Each Python thread consumes ~8MB of memory and has context switching overhead. 10,000 concurrent users cannot be served with 10,000 threads.
- Async Solution: 100 concurrent LLM API calls need only 1 async event loop thread — when request 1 is waiting for OpenAI to respond, the event loop processes requests 2 through 100.
- Streaming Responses: Server-sent events (token-by-token streaming) require the server to hold many open connections simultaneously — async makes this trivially efficient.
- Parallel RAG Steps: Retrieval from vector DB + metadata lookup + reranker API call can all be awaited simultaneously with asyncio.gather(), reducing total latency from sum of steps to max of steps.

Async/Await in Practice

Basic Pattern:
import asyncio
import httpx

async def call_llm(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/chat/completions",
json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]}
)
return response.json()["choices"][0]["message"]["content"]

async def main():
# Sequential: ~20 seconds for 4 calls
# result1 = await call_llm("Q1")
# result2 = await call_llm("Q2")

# Parallel: ~5 seconds for 4 calls (run concurrently)
results = await asyncio.gather(
call_llm("Q1"), call_llm("Q2"), call_llm("Q3"), call_llm("Q4")
)
return results

RAG Pipeline with Async:
async def rag_query(query: str) -> str:
# These three run concurrently — total time = max(embedding, cache check, metadata), not sum
embedding, cached_result, doc_metadata = await asyncio.gather(
embed_query(query), # ~50ms embedding API call
check_semantic_cache(query), # ~5ms Redis lookup
fetch_recent_docs() # ~20ms database query
)
if cached_result:
return cached_result

chunks = await vector_search(embedding) # ~30ms
context = build_context(chunks, doc_metadata)
return await call_llm(context, query) # ~3000ms

FastAPI + Async:
from fastapi import FastAPI
app = FastAPI()

@app.post("/generate")
async def generate(request: GenerateRequest) -> GenerateResponse:
response = await call_llm(request.prompt)
return GenerateResponse(text=response)

FastAPI automatically runs async endpoints on the event loop — thousands of concurrent requests with a single worker process.

Async Libraries for AI

| Library | Use Case |
|---------|---------|
| httpx | Async HTTP client (LLM APIs, webhooks) |
| aioredis | Async Redis (caching, rate limiting) |
| asyncpg | Async PostgreSQL (vector DB, metadata) |
| aiofiles | Async file I/O |
| FastAPI | Async web framework |
| OpenAI SDK | Built-in AsyncOpenAI client |
| LangChain | ainvoke(), astream() for async chains |

Common Pitfalls

Blocking the event loop: Calling a CPU-intensive or sync-blocking function inside an async context blocks all other coroutines.
Fix: Use asyncio.run_in_executor() to run blocking code in a thread pool.

result = await asyncio.get_event_loop().run_in_executor(None, blocking_function, args)

Forgetting await: async def functions return coroutines, not values — forgetting await returns the coroutine object instead of executing it. Use asyncio.iscoroutine() in debug mode to catch this.

Async/await is the concurrency model that makes high-throughput AI serving economically feasible — by allowing a single process to handle thousands of concurrent LLM API calls, database queries, and streaming responses without proportional thread overhead, async/await is the architectural foundation of every modern AI API gateway and inference serving platform.

Async/Await (Asynchronous Programming)

Want to learn more?