Streaming LLM Responses
Why Streaming? Instead of waiting for complete generation, stream tokens as they are produced:
- Better UX: Users see immediate response
- Lower perceived latency: First token appears quickly
- Flexibility: User can stop generation early
Server-Sent Events (SSE) Standard protocol for streaming from server to client.
Server Implementation (FastAPI)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
app = FastAPI()
@app.post("/chat")
async def chat(prompt: str):
async def generate():
for token in llm.generate_stream(prompt):
yield f"data: {json.dumps({"token": token})}
"
yield "data: [DONE]
"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)
Client Implementation (JavaScript)
const eventSource = new EventSource("/chat?prompt=Hello");
eventSource.onmessage = function(event) {
if (event.data === "[DONE]") {
eventSource.close();
return;
}
const data = JSON.parse(event.data);
document.getElementById("output").textContent += data.token;
};
Python Client
import httpx
with httpx.stream("POST", "/chat", json={"prompt": "Hello"}) as response:
for line in response.iter_lines():
if line.startswith("data: "):
data = json.loads(line[6:])
print(data["token"], end="", flush=True)
OpenAI-Style Streaming
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
Key Streaming Metrics
| Metric | Description | Target |
|---|---|---|
| TTFT | Time to First Token | Less than 500ms |
| TPOT | Time Per Output Token | Less than 50ms |
| ITL | Inter-Token Latency | Low variance |
WebSocket Alternative For bidirectional real-time communication:
from fastapi import WebSocket
@app.websocket("/ws/chat")
async def chat_websocket(websocket: WebSocket):
await websocket.accept()
while True:
prompt = await websocket.receive_text()
for token in llm.generate_stream(prompt):
await websocket.send_text(token)
Best Practices
- Handle connection drops gracefully
- Consider buffering (send every N tokens)
- Implement backpressure for slow clients
- Add heartbeats for long generations
- Log complete generations for debugging
streamingsserealtime
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.