Promptfoo

Promptfoo is an open-source command-line tool for systematically testing and evaluating LLM prompts across multiple models and providers — enabling developers to define test cases in YAML, run them against OpenAI, Anthropic, Ollama, and any other provider simultaneously, and get quantitative scores that replace "vibes-based" prompt engineering with data-driven iteration.

What Is Promptfoo?

- Definition: An open-source CLI tool and library (MIT license, 4,000+ GitHub stars) that runs structured evaluations of LLM prompts — taking test case inputs, running them through one or more models, applying scoring assertions (regex match, LLM-as-judge, semantic similarity, custom Python/JavaScript functions), and producing a comparison report.
- YAML-First Configuration: Evaluations are defined in a promptfooconfig.yaml file — prompts, providers, test cases, and assertions are all declarative, making evaluations version-controllable and reproducible.
- Multi-Provider Testing: Run the same prompt through GPT-4o, Claude 3.5 Sonnet, Llama-3, and a local Ollama model in a single command — compare quality and cost across providers simultaneously.
- Assertion Types: Built-in assertions include exact string match, regex, cosine similarity, LLM-based quality scoring (LLM-as-judge), and arbitrary JavaScript/Python evaluation functions.
- CI/CD Integration: Runs as a CLI command (npx promptfoo eval) — integrates into GitHub Actions, GitLab CI, or any pipeline to catch prompt regressions automatically.

Why Promptfoo Matters

- Systematic vs Ad-Hoc Testing: Most prompt development involves manually trying a few examples and deciding "that looks good." Promptfoo forces definition of test cases upfront and evaluates them all consistently — the same discipline software testing brings to code.
- Multi-Model Comparison: Evaluating GPT-4o vs Claude 3.5 Haiku on your specific use case is one command — real performance data on your actual task replaces benchmark comparisons that may not generalize.
- Red Teaming: Built-in adversarial test generation for safety testing — promptfoo can automatically generate jailbreak attempts, prompt injection attacks, and bias-revealing inputs to identify vulnerabilities before deployment.
- Cost Visibility: Each test run reports token usage and estimated cost per provider — model selection becomes a cost/quality optimization with real numbers.
- Open Source and Self-Hosted: No data leaves your environment — test proprietary prompts without concerns about model providers training on your evaluation data.

Core Usage

Basic Configuration (promptfooconfig.yaml):
``yaml prompts: - "Summarize the following in one sentence: {{input}}" - "Provide a concise one-sentence summary of: {{input}}"

providers: - openai:gpt-4o - anthropic:claude-3-5-haiku-20241022 - ollama:llama3

tests: - vars: input: "The quick brown fox jumps over the lazy dog near the riverbank." assert: - type: contains value: "fox" - type: llm-rubric value: "Is the summary accurate and under 20 words?" - vars: input: "Quarterly earnings exceeded analyst expectations by 15% on strong cloud revenue." assert: - type: regex value: "earnings|revenue|quarter"`

Run with: npx promptfoo eval

Assertion Types

- contains: Response must include a specific substring — simple factual checks. -regex: Response must match a regular expression — structured data extraction validation. -llm-rubric: An LLM grades the response against a natural language criterion — flexible quality assessment. -similar: Cosine similarity above threshold vs a reference answer — semantic correctness without exact match. -javascript: Custom JavaScript function — any logic expressible in JS. -python: Custom Python function — leverage any Python library for evaluation.

Red Teaming:`yaml redteam: plugins: - harmful:hate # Test for hate speech generation - jailbreak # Test prompt injection resistance - pii:direct # Test PII leakage strategies: - jailbreak - prompt-injection`

CI/CD Integration:`yaml # .github/workflows/eval.yml - name: Run LLM Evals run: npx promptfoo eval --ci # Fails if any assertion fails — blocks PR merge``

Promptfoo vs Alternatives

| Feature | Promptfoo | Braintrust | DeepEval | Langfuse |
|---------|----------|-----------|---------|---------|
| Open source | Yes (MIT) | No | Yes | Yes |
| CLI-first | Yes | No | Yes (pytest) | No |
| Multi-provider | Excellent | Good | Good | Good |
| Red teaming | Built-in | No | Limited | No |
| CI/CD integration | Excellent | Good | Good | Good |
| Setup time | Minutes | Hours | Hours | Hours |

Promptfoo is the open-source evaluation tool that brings test-driven development discipline to prompt engineering — by making it trivial to define test cases, run them across multiple models, and integrate evaluation into CI/CD, promptfoo enables any developer to replace subjective prompt quality judgments with objective, reproducible, data-driven iteration.

Want to learn more?