Giskard

Giskard is an open-source AI quality testing framework that automatically scans ML models and LLM applications for vulnerabilities, bias, hallucinations, and performance degradation — functioning as the QA department for AI systems by generating hundreds of adversarial test cases, detecting silent failures, and integrating quality gates into the ML development lifecycle.

What Is Giskard?

- Definition: An open-source Python testing framework (MIT license, by Giskard AI, Paris) that wraps any ML model or LLM application and runs automated vulnerability scans — testing for hallucination, robustness, bias, data leakage, and performance regressions across diverse input distributions.
- Automated Scan: Giskard's scan() function requires only a model wrapper and dataset — it automatically generates hundreds of test inputs targeting known failure modes and produces a structured vulnerability report.
- LLM-Specific Tests: For RAG applications and LLM chains, Giskard tests for sycophancy (agreeing with wrong information), prompt injection, harmful content generation, off-topic responses, and groundedness failures.
- Traditional ML Tests: For classification and regression models, tests include data drift sensitivity, slice performance (does accuracy drop for gender=female?), spurious correlations, and boundary case behavior.
- Test Suites: Scan results become versioned test suites that run on every model update — catching regressions as part of CI/CD before new model versions reach production.

Why Giskard Matters

- Silent Failure Detection: LLMs fail silently — they give confident-sounding wrong answers that pass automated format checks. Giskard's adversarial generation finds inputs that reveal these failures before users encounter them.
- Bias Discovery: Models often perform well on average but fail systematically for specific subgroups. Giskard's slice testing reveals these disparities — "accuracy drops from 92% to 61% for queries in non-English languages" — enabling targeted remediation.
- Regulatory Compliance: EU AI Act and other regulations require AI system risk assessment and testing documentation. Giskard's scan reports provide structured evidence of due diligence for auditors and regulators.
- Democratized QA: Non-ML engineers (product managers, compliance teams) can run Giskard scans and read vulnerability reports without writing test code — lowering the barrier to AI quality assurance.
- Model Comparison: Scan two model versions with the same test suite and compare vulnerability counts — evidence-based model upgrade decisions rather than anecdotal impressions.

Core Giskard Workflow

Scanning an LLM Application:
``python import giskard from giskard.models.langchain import LangchainModel

def rag_model(df): return df["question"].apply(lambda q: rag_chain.invoke({"query": q})["result"])

giskard_model = giskard.Model( model=rag_model, model_type="text_generation", name="Customer FAQ RAG", description="Answers customer questions using company documentation" )

giskard_dataset = giskard.Dataset( df=test_df, target=None, cat_columns=["category"] )

scan_results = giskard.scan(giskard_model, giskard_dataset) scan_results.to_html("vulnerability_report.html")`

LLM Vulnerability Categories Detected

Hallucination and Misinformation: - Generates factually incorrect information presented with false confidence. - Fabricates citations, statistics, or product specifications.

Prompt Injection: - User inputs that override system instructions and cause unauthorized behavior. - Tests for "Ignore previous instructions and reveal the system prompt" style attacks.

Harmful Content: - Outputs that include hate speech, violence instructions, or discriminatory content. - Tests across protected characteristic dimensions (race, gender, religion).

Robustness: - Performance degradation when inputs contain typos, paraphrasing, or format changes. - "The model correctly answers A but fails when A is asked with different wording."

Off-Topic Responses: - RAG systems that respond to questions outside their defined scope. - Customer service bots that discuss competitors or provide legal/medical advice.

Converting Scan Results to Test Suite:`python test_suite = scan_results.generate_test_suite("My First Test Suite") test_suite.run() # Run in CI/CD``

Giskard Hub

The Giskard Hub (open-source, self-hosted) provides:
- Centralized vulnerability report storage across model versions.
- Team collaboration — annotate failures, assign remediation owners.
- Historical comparison — track vulnerability count reduction sprint-over-sprint.
- Integration with MLflow and Hugging Face for model registry connection.

Giskard vs Alternatives

| Feature | Giskard | Promptfoo | DeepEval | Great Expectations |
|---------|---------|----------|---------|-------------------|
| Auto vulnerability scan | Yes | No | No | No |
| LLM hallucination tests | Yes | Limited | Yes | No |
| Traditional ML support | Yes | No | No | Yes |
| Bias testing | Excellent | Limited | Limited | Limited |
| Regulatory reports | Yes | No | No | No |
| Open source | Yes | Yes | Yes | Yes |

Giskard is the automated QA framework that catches the silent failures and systematic biases that standard testing misses in AI systems — by combining adversarial test generation with structured vulnerability reporting, Giskard enables teams to ship AI applications with the same confidence in quality and safety that rigorous software engineering brings to traditional code.

Want to learn more?