Together AI is the cloud inference platform serving 100+ open-weight language models via an OpenAI-compatible API at 3-10x lower cost than proprietary models — enabling developers to switch from GPT-4 to Llama-3-70B or DeepSeek-V3 with a single line of code, while Together AI handles the GPU infrastructure, inference optimization, and model hosting.
What Is Together AI?
- Definition: A cloud inference platform founded in 2022 that specializes in hosting and serving open-weight language models (Llama, Mistral, Mixtral, Qwen, DeepSeek) via a REST API compatible with OpenAI's SDK — so existing OpenAI integrations work with different model weights instantly.
- Mission: Democratize access to open-source AI by providing the infrastructure to run large open-weight models affordably — without requiring teams to manage GPU infrastructure, CUDA drivers, or serving frameworks.
- OpenAI-Compatible API: Together AI's inference API mirrors OpenAI's chat completions endpoint — change base_url to api.together.xyz and swap the model name to use Llama or Mixtral instead of GPT-4.
- Custom Inference Stack: Together AI builds optimized inference kernels for throughput and latency — delivering faster time-to-first-token and higher tokens/second than standard self-hosted vLLM on equivalent hardware.
- Founded: 2022, backed by NVIDIA, Salesforce Ventures, and Andreessen Horowitz — with a mission to build the decentralized cloud for AI.
Why Together AI Matters for AI Engineers
- Cost Reduction vs OpenAI: Llama-3.1-70B at ~$0.88/million tokens vs GPT-4o at $5/million input tokens — 5x+ cost reduction for comparable capability on many tasks.
- Open-Weight Access: 100+ open-weight models available via simple API — no hosting infrastructure needed to use Llama, Mistral, DBRX, Qwen, DeepSeek, or Code Llama.
- Zero-Migration API: Build on OpenAI SDK, switch to Together AI with two config lines — no refactoring of prompts, parsers, or application logic.
- Fine-Tuning Service: Upload LoRA fine-tuned adapters or train custom models on Together AI infrastructure — serve custom models via the same inference API.
- No Vendor Lock-in: Build on open-weight models — if Together AI changes pricing, migrate to self-hosted vLLM or alternative provider with same model weights and prompts.
Together AI Services
Inference API (Chat Completions):
from together import Together
client = Together(api_key="your-key")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain RLHF in AI training"}],
max_tokens=1024
)
print(response.choices[0].message.content)
Fine-Tuning:
- Upload training data in JSONL format (instruction/response pairs)
- Fine-tune base models (Llama, Mistral) on custom domain data
- Serve fine-tuned models via same API with your custom model ID
- Pricing: per training token + per inference token
Embeddings:
- Embed documents with BAAI/bge-large, M2-Bert, and other embedding models
- Returns vectors for RAG pipelines at competitive pricing
- Compatible with LangChain and LlamaIndex embedding integrations
Key Models Available:
- Meta Llama 3.1 405B / 70B / 8B Instruct Turbo
- Mixtral 8x7B / 8x22B Instruct
- DeepSeek-V3, DeepSeek-R1 (reasoning)
- Qwen 2.5 72B / 110B
- DeepSeek Coder, Code Llama (code generation)
- FLUX.1 (image generation)
Pricing Model:
- Pay per million tokens (input + output separately priced)
- No subscription, no minimum spend
- Larger models cost more per token; smaller/quantized models cost less
- Fine-tuning priced per training token
Together AI vs Alternatives
| Provider | Cost | Model Selection | API Compat | Latency | Notes |
|----------|------|----------------|-----------|---------|-------|
| Together AI | Low | 100+ open | OpenAI | Fast | Broad model library |
| Groq | Very Low | Limited | OpenAI | Very Fast | Custom LPU hardware |
| Fireworks AI | Low | 50+ open | OpenAI | Fast | Good for code models |
| OpenAI | High | GPT-4o/o1/o3 | Native | Fast | Proprietary only |
| Self-hosted | Compute cost | Any | OpenAI | Variable | Full control |
Together AI is the inference cloud that makes open-weight models as accessible as OpenAI's API at a fraction of the cost — by providing a production-grade, OpenAI-compatible inference layer over the best open-source models, Together AI enables teams to build cost-effective AI applications without managing GPU infrastructure or serving frameworks.