Home Knowledge Base Streaming LLM Responses

Streaming LLM Responses

Why Streaming? Instead of waiting for complete generation, stream tokens as they are produced:

Server-Sent Events (SSE) Standard protocol for streaming from server to client.

Server Implementation (FastAPI)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

@app.post("/chat")
async def chat(prompt: str):
    async def generate():
        for token in llm.generate_stream(prompt):
            yield f"data: {json.dumps({"token": token})}

"
        yield "data: [DONE]

"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

Client Implementation (JavaScript)

const eventSource = new EventSource("/chat?prompt=Hello");

eventSource.onmessage = function(event) {
    if (event.data === "[DONE]") {
        eventSource.close();
        return;
    }
    const data = JSON.parse(event.data);
    document.getElementById("output").textContent += data.token;
};

Python Client

import httpx

with httpx.stream("POST", "/chat", json={"prompt": "Hello"}) as response:
    for line in response.iter_lines():
        if line.startswith("data: "):
            data = json.loads(line[6:])
            print(data["token"], end="", flush=True)

OpenAI-Style Streaming

from openai import OpenAI

client = OpenAI()
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Key Streaming Metrics

MetricDescriptionTarget
TTFTTime to First TokenLess than 500ms
TPOTTime Per Output TokenLess than 50ms
ITLInter-Token LatencyLow variance

WebSocket Alternative For bidirectional real-time communication:

from fastapi import WebSocket

@app.websocket("/ws/chat")
async def chat_websocket(websocket: WebSocket):
    await websocket.accept()
    while True:
        prompt = await websocket.receive_text()
        for token in llm.generate_stream(prompt):
            await websocket.send_text(token)

Best Practices

streamingsserealtime

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.