Home Knowledge Base Triton Inference Server

Triton Inference Server

What is Triton? NVIDIA Triton Inference Server is a production-grade serving platform that supports multiple frameworks, dynamic batching, and GPU orchestration.

Key Features

FeatureDescription
Multi-frameworkPyTorch, TensorFlow, ONNX, TensorRT
Dynamic batchingAutomatically batch requests
Model versioningServe multiple model versions
Ensemble modelsChain models together
GPU/CPU executionFlexible resource allocation
MetricsPrometheus metrics built-in

Model Repository Structure

model_repository/
├── llama/
│   ├── config.pbtxt        # Model configuration
│   └── 1/                   # Version 1
│       └── model.onnx       # Model file
├── embeddings/
│   ├── config.pbtxt
│   └── 1/
│       └── model.pt

Model Configuration

# config.pbtxt
name: "llama"
platform: "onnxruntime_onnx"
max_batch_size: 16

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]  # Variable length
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 32000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 50000
}

Running Triton

# Start server
docker run --gpus all -p 8000:8000 -p 8001:8001
  -v /path/to/models:/models
  nvcr.io/nvidia/tritonserver:24.01-py3
  tritonserver --model-repository=/models

Client Usage

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient("localhost:8000")

# Create input
inputs = [httpclient.InferInput("input_ids", [1, 10], "INT64")]
inputs[0].set_data_from_numpy(input_array)

# Infer
outputs = [httpclient.InferRequestedOutput("logits")]
response = client.infer("llama", inputs, outputs=outputs)
result = response.as_numpy("logits")

Dynamic Batching Triton automatically batches requests:

Request 1: batch_size=1  ─┐
Request 2: batch_size=1  ─┼─► Combined batch_size=4
Request 3: batch_size=2  ─┘

Benefits:

Scaling

tritoninference serverserving

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.