Triton Inference Server
What is Triton? NVIDIA Triton Inference Server is a production-grade serving platform that supports multiple frameworks, dynamic batching, and GPU orchestration.
Key Features
| Feature | Description |
|---|---|
| Multi-framework | PyTorch, TensorFlow, ONNX, TensorRT |
| Dynamic batching | Automatically batch requests |
| Model versioning | Serve multiple model versions |
| Ensemble models | Chain models together |
| GPU/CPU execution | Flexible resource allocation |
| Metrics | Prometheus metrics built-in |
Model Repository Structure
model_repository/
├── llama/
│ ├── config.pbtxt # Model configuration
│ └── 1/ # Version 1
│ └── model.onnx # Model file
├── embeddings/
│ ├── config.pbtxt
│ └── 1/
│ └── model.pt
Model Configuration
# config.pbtxt
name: "llama"
platform: "onnxruntime_onnx"
max_batch_size: 16
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # Variable length
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1, 32000 ]
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 50000
}
Running Triton
# Start server
docker run --gpus all -p 8000:8000 -p 8001:8001
-v /path/to/models:/models
nvcr.io/nvidia/tritonserver:24.01-py3
tritonserver --model-repository=/models
Client Usage
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient("localhost:8000")
# Create input
inputs = [httpclient.InferInput("input_ids", [1, 10], "INT64")]
inputs[0].set_data_from_numpy(input_array)
# Infer
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("llama", inputs, outputs=outputs)
result = response.as_numpy("logits")
Dynamic Batching Triton automatically batches requests:
Request 1: batch_size=1 ─┐
Request 2: batch_size=1 ─┼─► Combined batch_size=4
Request 3: batch_size=2 ─┘
Benefits:
- Better GPU utilization
- Higher throughput
- Configurable latency trade-offs
Scaling
- Horizontal: Multiple Triton instances behind load balancer
- Multi-GPU: Multiple model instances across GPUs
- Kubernetes: Use Triton Inference Server Operator
tritoninference serverserving
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.