Instructor

Keywords: instructor,structured,pydantic

Instructor is a Python library that forces LLMs to return valid, validated Pydantic models by patching official provider SDKs — combining JSON mode, function calling, and automatic retry-with-error-feedback into a single decorator-driven interface — making structured LLM output as simple as defining a Python class and as reliable as a typed API endpoint.

What Is Instructor?

- Definition: An open-source Python library (by Jason Liu, 2023) that wraps OpenAI, Anthropic, Google, and other LLM provider SDKs to add a response_model parameter — specify any Pydantic BaseModel subclass and Instructor guarantees the LLM response parses into a valid instance of that class.
- Core Mechanism: Instructor uses the provider's native structured output mechanism (OpenAI JSON mode, function calling, or tool use) and adds Pydantic validation on top — if validation fails, it automatically re-prompts the LLM with the validation error message and retries.
- Pydantic Integration: Every field definition, validator, and description in your Pydantic model becomes a prompt signal — Field(description="Must be a positive integer") is automatically included in the schema sent to the LLM.
- Automatic Retries: Configure max_retries=3 and Instructor handles the retry loop — catching Pydantic ValidationErrors, formatting them as feedback to the LLM, and requesting a corrected response.
- Multi-Provider: Supports OpenAI, Anthropic Claude, Google Gemini, Cohere, Mistral, Ollama, and any OpenAI-compatible endpoint — same code, different providers.

Why Instructor Matters

- Developer Ergonomics: Defining a Pydantic model is already standard Python practice — Instructor makes it the complete interface for LLM structured output, requiring zero prompt engineering for format compliance.
- Validation as Specification: Pydantic validators serve as both input specification and output guarantee — @validator("age") def age_must_be_positive becomes both documentation and enforcement.
- Streaming Support: Stream Pydantic model instances as they generate — useful for progressive UI updates where you want to show partial results as the LLM generates each field.
- Observability Integration: First-class integration with Langfuse, Logfire, and OpenTelemetry — every Instructor call is automatically traced with input schema, output, validation errors, and retry count.
- Widely Adopted: One of the most-starred structured output libraries on GitHub — used by thousands of production applications for data extraction, classification, and agent tool responses.

Core Usage Pattern

``python
import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field

client = instructor.from_anthropic(Anthropic())

class Person(BaseModel):
name: str = Field(description="Full name of the person")
age: int = Field(ge=0, le=150, description="Age in years")
occupation: str

person = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[{"role": "user", "content": "Extract: John Smith, 34, works as a software engineer"}],
response_model=Person,
)
# person.name == "John Smith", person.age == 34, always a valid Person
`

Advanced Instructor Features

Nested Models:
`python
class Address(BaseModel):
street: str
city: str
country: str

class Company(BaseModel):
name: str
headquarters: Address # Nested Pydantic model works automatically
employees: list[Person] # List of models also works
`

Partial Streaming:
`python
for partial_person in client.messages.create(..., stream=True, response_model=Iterable[Person]):
print(partial_person) # Progressive output as fields generate
`

Validation with Feedback:
When the LLM outputs
"age": "thirty-four", Pydantic raises ValidationError: age must be int. Instructor automatically sends: "The previous response had a validation error: age must be int. Please correct and retry." — the LLM self-corrects without developer intervention.

Instructor vs Alternatives

| Feature | Instructor | Outlines | Guidance | Raw JSON mode |
|---------|-----------|---------|---------|--------------|
| Pydantic integration | Native | Good | Limited | Manual |
| API model support | Excellent | Limited | Good | Full |
| Retry on failure | Automatic | N/A | N/A | Manual |
| Learning curve | Very low | Low | Medium | Low |
| Streaming | Yes | No | Limited | Manual |
| Validation feedback | Yes (auto) | No | No | No |

Common Use Cases

- Document Extraction: Extract invoices, contracts, and reports into typed Python objects for downstream processing.
- Classification: Multi-label classification with
Literal type hints — category: Literal["tech", "sports", "politics"]`.
- Agent Tool Responses: Ensure tool-calling agents return well-formed tool results that downstream functions can consume without error handling.
- Data Pipeline ETL: Transform unstructured text sources into structured database records with guaranteed schema compliance.
- API Response Generation: Build LLM-powered API endpoints that always return valid JSON matching your OpenAPI schema.

Instructor is the simplest path from Pydantic model to reliable structured LLM output — by leveraging the validation infrastructure Python developers already use daily, Instructor makes LLM-powered data extraction and classification as trustworthy and maintainable as any other typed function in a production codebase.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT