Batching and throughput optimization

Home› Knowledge Base› Batching and throughput optimization

Batching and throughput optimization is the technique of combining multiple inference requests into single GPU operations — processing batches of prompts together rather than individually, maximizing GPU utilization and tokens-per-second throughput, essential for cost-effective LLM serving at scale.

What Is Batching?

Definition: Processing multiple requests in a single forward pass.
Goal: Maximize GPU utilization and throughput.
Trade-off: Higher throughput vs. increased per-request latency.
Context: Critical for production LLM serving economics.

Why Batching Matters

GPU Utilization: Single requests underutilize GPU compute.
Cost Efficiency: More tokens per GPU-hour = lower cost per token.
Scale: Handle more users with same hardware.
Memory Amortization: Fixed overhead spread across more requests.

Batching Strategies

Static Batching:

Fixed batch size, wait until batch is full.
All requests start and end together.
Simple but wasteful (padding, waiting).

Dynamic Batching:

Accumulate requests within time window.
Variable batch size based on arrivals.
Better utilization than static.

Continuous Batching (State-of-the-art):

Requests join/leave batch dynamically.
New request can start while others are in progress.
No waiting for batch completion.
Implemented in vLLM, TGI, TensorRT-LLM.

In-Flight Batching:

Mix prefill and decode phases in same batch.
Maximize both compute (prefill) and memory (decode) utilization.
Most efficient for heterogeneous request lengths.

Batch Size Trade-offs

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 760 470" width="100%" height="100%" style="background:#0F172A;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Helvetica,Arial,sans-serif;">
  <defs>
    <linearGradient id="cardGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" stop-color="#1E293B"/>
      <stop offset="100%" stop-color="#0F172A"/>
    </linearGradient>
  </defs>

  <!-- Title Banner -->
  <rect x="20" y="20" width="720" height="56" rx="14" fill="url(#cardGrad)" stroke="#334155" stroke-width="1"/>
  <text x="36" y="46" font-size="18" font-weight="800" fill="#F8FAFC">Batching Strategies: Static vs Continuous Batching</text>
  <text x="36" y="64" font-size="12" font-weight="600" fill="#94A3B8">Inference Throughput Optimization · Iteration-Level Scheduling · Bubble Waste Reduction</text>

  <!-- Panel 1: Traditional Static Batching (Bubble Waste) -->
  <rect x="20" y="90" width="350" height="350" rx="16" fill="url(#cardGrad)" stroke="#334155" stroke-width="1"/>
  <text x="40" y="118" font-size="14" font-weight="700" fill="#F43F5E">1. Traditional Static Batching</text>

  <rect x="40" y="135" width="310" height="180" rx="10" fill="#0F172A" stroke="#334155" stroke-width="1"/>
  <text x="55" y="158" font-size="12" font-weight="800" fill="#F43F5E">Sequence Execution Timelines:</text>

  <!-- Req 1 (Short) -->
  <text x="55" y="182" font-size="10" font-weight="700" fill="#CBD5E1">Req 1 (100 tok):</text>
  <rect x="150" y="172" width="70" height="15" rx="3" fill="#38BDF8"/>
  <rect x="220" y="172" width="110" height="15" rx="3" fill="#334155" stroke="#F43F5E" stroke-dasharray="2,2"/>
  <text x="275" y="184" font-size="9" font-weight="700" fill="#F43F5E" text-anchor="middle">GPU Idle Bubble</text>

  <!-- Req 2 (Medium) -->
  <text x="55" y="212" font-size="10" font-weight="700" fill="#CBD5E1">Req 2 (250 tok):</text>
  <rect x="150" y="202" width="120" height="15" rx="3" fill="#A78BFA"/>
  <rect x="270" y="202" width="60" height="15" rx="3" fill="#334155" stroke="#F43F5E" stroke-dasharray="2,2"/>

  <!-- Req 3 (Long - Bottleneck) -->
  <text x="55" y="242" font-size="10" font-weight="700" fill="#CBD5E1">Req 3 (500 tok):</text>
  <rect x="150" y="232" width="180" height="15" rx="3" fill="#F59E0B"/>

  <line x1="330" y1="165" x2="330" y2="260" stroke="#F43F5E" stroke-width="1.5" stroke-dasharray="3,3"/>
  <text x="55" y="285" font-size="10" font-weight="800" fill="#F43F5E">Batch Bottleneck: Waits for longest sequence to finish!</text>

  <!-- Summary Box -->
  <rect x="40" y="325" width="310" height="100" rx="10" fill="#1E293B" stroke="#475569" stroke-width="1"/>
  <text x="55" y="348" font-size="12" font-weight="800" fill="#F8FAFC">Static Batch Limitations:</text>
  <text x="55" y="370" font-size="11" font-weight="600" fill="#CBD5E1">• High memory padding waste &amp; low GPU utilization.</text>
  <text x="55" y="392" font-size="11" font-weight="600" fill="#CBD5E1">• New incoming requests blocked until full batch finishes.</text>
  <text x="55" y="412" font-size="10" font-weight="700" fill="#F43F5E">Low overall serving throughput (tokens/sec/GPU)</text>

  <!-- Panel 2: Continuous / In-Flight Batching -->
  <rect x="390" y="90" width="350" height="350" rx="16" fill="url(#cardGrad)" stroke="#334155" stroke-width="1"/>
  <text x="410" y="118" font-size="14" font-weight="700" fill="#38BDF8">2. Continuous / Iteration-Level Batching</text>

  <rect x="410" y="135" width="310" height="180" rx="10" fill="#0F172A" stroke="#334155" stroke-width="1"/>
  <text x="425" y="158" font-size="12" font-weight="800" fill="#38BDF8">Dynamic Token Iteration Scheduling:</text>

  <!-- Dynamic slot filling -->
  <text x="425" y="182" font-size="10" font-weight="700" fill="#CBD5E1">Slot 1:</text>
  <rect x="475" y="172" width="70" height="15" rx="3" fill="#38BDF8"/>
  <rect x="550" y="172" width="110" height="15" rx="3" fill="#10B981"/>
  <text x="605" y="184" font-size="9" font-weight="800" fill="#F8FAFC" text-anchor="middle">+ Req 4 (Immediate Join)</text>

  <text x="425" y="212" font-size="10" font-weight="700" fill="#CBD5E1">Slot 2:</text>
  <rect x="475" y="202" width="120" height="15" rx="3" fill="#A78BFA"/>
  <rect x="600" y="202" width="60" height="15" rx="3" fill="#EC4899"/>
  <text x="630" y="214" font-size="9" font-weight="800" fill="#F8FAFC" text-anchor="middle">+ Req 5</text>

  <text x="425" y="242" font-size="10" font-weight="700" fill="#CBD5E1">Slot 3:</text>
  <rect x="475" y="232" width="185" height="15" rx="3" fill="#F59E0B"/>

  <text x="425" y="285" font-size="10" font-weight="800" fill="#10B981">Zero Bubble Waste: Completed requests exit instantly!</text>

  <!-- Continuous Batching Benefits Box -->
  <rect x="410" y="325" width="310" height="100" rx="10" fill="#1E293B" stroke="#475569" stroke-width="1"/>
  <text x="425" y="348" font-size="12" font-weight="800" fill="#38BDF8">Continuous Batching Advantages:</text>
  <text x="425" y="370" font-size="11" font-weight="600" fill="#F8FAFC">• 2x to 4x higher throughput vs static batching.</text>
  <text x="425" y="392" font-size="11" font-weight="600" fill="#CBD5E1">• Implemented in vLLM, TGI, TensorRT-LLM engines.</text>
  <text x="425" y="412" font-size="10" font-weight="700" fill="#10B981">Optimal for SLA latency &amp; high QPS LLM serving</text>
</svg>

Memory Constraints

KV Cache Scaling:

KV Cache Memory = 2 × layers × hidden_size × seq_len × batch_size × dtype

Example (Llama 70B, 4K context, FP16):
= 2 × 80 × 8192 × 4096 × batch × 2 bytes
= 10.7 GB per sequence

Batch of 16 = 171 GB just for KV cache!

PagedAttention Solution:

Allocate KV cache in pages, not contiguous blocks.
Share common prefixes across requests.
Dynamic allocation reduces fragmentation.
Enables 2-4× higher throughput.

Throughput Optimization Techniques

Prefill Chunking:

Split long prompts into smaller chunks.
Process interleaved with decode tokens.
Reduces TTFT variance.

Request Scheduling:

Priority queues for latency-sensitive requests.
Separate queues for long vs. short requests.
Preemption for high-priority requests.

Multi-GPU Strategies:

Tensor Parallel: Split model across GPUs.
Pipeline Parallel: Split by layers.
Data Parallel: Replicate model, split batches.

Throughput Benchmarks

Configuration                | Tokens/sec | Latency
-----------------------------|------------|----------
Single request               | 50-80      | 20ms/token
Batch 8, static              | 300-400    | 35ms/token
Batch 32, continuous         | 800-1200   | 50ms/token
Batch 64, PagedAttention     | 1500-2500  | 70ms/token

Monitoring Metrics

Queue Depth: Pending requests waiting for processing.
Batch Utilization: Actual vs. maximum batch size.
GPU Memory: KV cache utilization percentage.
Time-in-Queue: Wait time before processing starts.
Tokens/Second: Overall throughput metric.

Batching and throughput optimization is the key to LLM serving economics — without efficient batching, GPU utilization stays below 20% and costs are prohibitive; with modern continuous batching and PagedAttention, the same hardware serves 10× more users at fraction of the cost.

batchbatch sizethroughputcontinuous batchingpaged attentiongpu utilization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.

🔍 Search Topics 💬 Ask CFSGPT 📚 Browse All

Related Topics

Explore 500+ Semiconductor & AI Topics