Model orchestration and routing

Model orchestration and routing is the technique of directing requests to different AI models based on query characteristics — using intelligent routing to send simple queries to fast/cheap models and complex queries to powerful/expensive models, optimizing cost, latency, and quality across a portfolio of AI capabilities.

What Is Model Routing?

- Definition: Dynamically selecting which model handles each request.
- Goal: Optimize cost, latency, and quality simultaneously.
- Methods: Rule-based, classifier-based, or LLM-based routing.
- Context: Multiple models with different cost/capability trade-offs.

Why Routing Matters

- Cost Optimization: Use expensive models only when needed (90%+ spend reduction possible).
- Latency: Fast models for simple queries, powerful for complex.
- Quality: Match model capability to task requirements.
- Reliability: Fallback to alternate models on failures.
- Scalability: Distribute load across model portfolio.

Router Architectures

Rule-Based Routing:
``python def route(query): if len(query) < 50 and "?" not in query: return "gpt-3.5-turbo" # Simple, cheap elif "code" in query.lower(): return "claude-3-sonnet" # Good at code else: return "gpt-4o" # Default capable`

Classifier-Based Routing:`Train classifier on: - Query difficulty labels - Query category labels - Historical model performance

At inference: Query → Classifier → Predicted best model`

LLM-Based Routing:`Use small, fast LLM to analyze query: "Based on this query, which model should handle it?" → Route to recommended model`

Cascading Strategy

`┌─────────────────────────────────────────────────────┐ │ User Query │ │ ↓ │ │ Try cheap/fast model first │ │ ↓ │ │ Check confidence/quality │ │ ↓ │ │ If good → Return response │ │ If uncertain → Escalate to powerful model │ └─────────────────────────────────────────────────────┘

Example cascade: 1. Llama-3.1-8B (fast, cheap) 2. If confidence < 0.8 → GPT-4o-mini 3. If still uncertain → Claude-3.5-Sonnet`

Multi-Model Portfolios

`Model | Cost/1M tk | Latency | Capability | Use For -----------------|------------|---------|------------|------------------ GPT-3.5-turbo | $0.50 | ~200ms | Basic | Simple Q&A, chat GPT-4o-mini | $0.15 | ~300ms | Good | General tasks GPT-4o | $5.00 | ~500ms | Strong | Complex reasoning Claude-3.5-Sonnet| $3.00 | ~400ms | Strong | Code, writing Claude-3-Opus | $15.00 | ~800ms | Strongest | Critical tasks Llama-3.1-8B | ~$0.05* | ~100ms | Basic | High-volume simple`*Self-hosted estimate

Routing Signals

Query Characteristics: - Length: Short queries → simpler model. - Keywords: Domain-specific → specialized model. - Complexity: Multi-hop reasoning → powerful model. - Format: Code, math, writing → specialized model.

User/Context: - Customer tier: Premium → best model. - History: Past failures → try different model. - SLA: Low latency required → fast model.

System State: - Load: High traffic → distribute to cheaper models. - Errors: Primary down → automatic fallback. - Cost budget: Near limit → prefer cheaper.

Ensemble Strategies

Best-of-N:`1. Send query to N models 2. Collect all responses 3. Use judge model to pick best 4. Return winning response

Expensive but highest quality`

Consensus Checking:`1. Send to 2+ models 2. If responses agree → return any 3. If different → escalate to powerful model

Good for factual accuracy`

Orchestration Platforms

- LiteLLM: Unified API for 100+ model providers. - Portkey: AI gateway with routing, caching, fallbacks. - Martian: Intelligent model router. - OpenRouter: Multi-provider routing. - Custom: Build with simple routing logic.

Implementation Example

`python class ModelRouter: def __init__(self): self.classifier = load_classifier(""router_model.pt"") self.models = { ""simple"": ""gpt-3.5-turbo"", ""moderate"": ""gpt-4o-mini"", ""complex"": ""gpt-4o"" } def route(self, query: str) -> str: complexity = self.classifier.predict(query) model = self.models[complexity] return call_model(model, query) def cascade(self, query: str) -> str: for model in [""simple"", ""moderate"", ""complex""]: response, confidence = call_with_confidence( self.models[model], query ) if confidence > 0.85: return response return response # Final attempt``

Model orchestration and routing is essential for production AI economics — without intelligent routing, teams either overspend on powerful models for simple tasks or underserve complex queries with weak models, making routing architecture critical for balancing cost, quality, and user experience.

Want to learn more?