Latency

Latency in the context of AI and LLM deployment refers to the time delay between sending a request to a model and receiving the beginning of its response. It is one of the most critical performance metrics for any real-time AI application.

Components of LLM Latency

- Network Latency: The round-trip time for the request to reach the inference server and the response to return. Typically 1–50 ms depending on geography and infrastructure.
- Queue Wait Time: Time spent waiting for a GPU to become available if the system is under load.
- Prefill Latency: Time to process all input tokens (the prompt) through the model. Scales with prompt length.
- Time to First Token (TTFT): The total delay before the first output token is generated — includes network + queue + prefill time.
- Decode Latency: Time to generate each subsequent output token. Determines the perceived streaming speed.

Typical Latency Targets

- Interactive Chat: TTFT under 500 ms, decode at 30+ tokens/second for a smooth conversational experience.
- API Calls: End-to-end response within 1–5 seconds for most applications.
- Real-Time Systems: Sub-100 ms TTFT required for voice assistants, gaming, and robotics.

Optimization Techniques

- KV Cache: Stores previously computed key-value pairs to avoid redundant computation during autoregressive decoding.
- Speculative Decoding: Uses a smaller draft model to predict multiple tokens in parallel, verified by the main model.
- Model Distillation: Smaller, faster models trained to mimic larger ones.
- Hardware Upgrades: Faster GPUs with higher memory bandwidth (like NVIDIA H100/H200) directly reduce latency.

Want to learn more?