Promptfoo

Keywords: promptfoo,testing,eval

Promptfoo is an open-source command-line tool for systematically testing and evaluating LLM prompts across multiple models and providers — enabling developers to define test cases in YAML, run them against OpenAI, Anthropic, Ollama, and any other provider simultaneously, and get quantitative scores that replace "vibes-based" prompt engineering with data-driven iteration.

What Is Promptfoo?

- Definition: An open-source CLI tool and library (MIT license, 4,000+ GitHub stars) that runs structured evaluations of LLM prompts — taking test case inputs, running them through one or more models, applying scoring assertions (regex match, LLM-as-judge, semantic similarity, custom Python/JavaScript functions), and producing a comparison report.
- YAML-First Configuration: Evaluations are defined in a promptfooconfig.yaml file — prompts, providers, test cases, and assertions are all declarative, making evaluations version-controllable and reproducible.
- Multi-Provider Testing: Run the same prompt through GPT-4o, Claude 3.5 Sonnet, Llama-3, and a local Ollama model in a single command — compare quality and cost across providers simultaneously.
- Assertion Types: Built-in assertions include exact string match, regex, cosine similarity, LLM-based quality scoring (LLM-as-judge), and arbitrary JavaScript/Python evaluation functions.
- CI/CD Integration: Runs as a CLI command (npx promptfoo eval) — integrates into GitHub Actions, GitLab CI, or any pipeline to catch prompt regressions automatically.

Why Promptfoo Matters

- Systematic vs Ad-Hoc Testing: Most prompt development involves manually trying a few examples and deciding "that looks good." Promptfoo forces definition of test cases upfront and evaluates them all consistently — the same discipline software testing brings to code.
- Multi-Model Comparison: Evaluating GPT-4o vs Claude 3.5 Haiku on your specific use case is one command — real performance data on your actual task replaces benchmark comparisons that may not generalize.
- Red Teaming: Built-in adversarial test generation for safety testing — promptfoo can automatically generate jailbreak attempts, prompt injection attacks, and bias-revealing inputs to identify vulnerabilities before deployment.
- Cost Visibility: Each test run reports token usage and estimated cost per provider — model selection becomes a cost/quality optimization with real numbers.
- Open Source and Self-Hosted: No data leaves your environment — test proprietary prompts without concerns about model providers training on your evaluation data.

Core Usage

Basic Configuration (promptfooconfig.yaml):
``yaml
prompts:
- "Summarize the following in one sentence: {{input}}"
- "Provide a concise one-sentence summary of: {{input}}"

providers:
- openai:gpt-4o
- anthropic:claude-3-5-haiku-20241022
- ollama:llama3

tests:
- vars:
input: "The quick brown fox jumps over the lazy dog near the riverbank."
assert:
- type: contains
value: "fox"
- type: llm-rubric
value: "Is the summary accurate and under 20 words?"
- vars:
input: "Quarterly earnings exceeded analyst expectations by 15% on strong cloud revenue."
assert:
- type: regex
value: "earnings|revenue|quarter"
`

Run with: npx promptfoo eval

Assertion Types

- contains: Response must include a specific substring — simple factual checks.
-
regex: Response must match a regular expression — structured data extraction validation.
-
llm-rubric: An LLM grades the response against a natural language criterion — flexible quality assessment.
-
similar: Cosine similarity above threshold vs a reference answer — semantic correctness without exact match.
-
javascript: Custom JavaScript function — any logic expressible in JS.
-
python: Custom Python function — leverage any Python library for evaluation.

Red Teaming:
`yaml
redteam:
plugins:
- harmful:hate # Test for hate speech generation
- jailbreak # Test prompt injection resistance
- pii:direct # Test PII leakage
strategies:
- jailbreak
- prompt-injection
`

CI/CD Integration:
`yaml
# .github/workflows/eval.yml
- name: Run LLM Evals
run: npx promptfoo eval --ci
# Fails if any assertion fails — blocks PR merge
``

Promptfoo vs Alternatives

| Feature | Promptfoo | Braintrust | DeepEval | Langfuse |
|---------|----------|-----------|---------|---------|
| Open source | Yes (MIT) | No | Yes | Yes |
| CLI-first | Yes | No | Yes (pytest) | No |
| Multi-provider | Excellent | Good | Good | Good |
| Red teaming | Built-in | No | Limited | No |
| CI/CD integration | Excellent | Good | Good | Good |
| Setup time | Minutes | Hours | Hours | Hours |

Promptfoo is the open-source evaluation tool that brings test-driven development discipline to prompt engineering — by making it trivial to define test cases, run them across multiple models, and integrate evaluation into CI/CD, promptfoo enables any developer to replace subjective prompt quality judgments with objective, reproducible, data-driven iteration.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT