Fuzzing with LLMs

Fuzzing with LLMs combines fuzz testing (automated test input generation) with large language models to generate diverse, semantically meaningful test inputs that explore program behavior and uncover bugs — leveraging LLMs' understanding of code structure, input formats, and common bug patterns to create more effective fuzzing campaigns.

What Is Fuzzing?

- Fuzz testing: Automatically generating random or semi-random inputs to test programs — looking for crashes, hangs, assertion failures, or security vulnerabilities.
- Traditional fuzzing: Random byte mutations, grammar-based generation, or coverage-guided evolution.
- Goal: Find bugs by exploring unusual, unexpected, or malicious inputs that developers didn't anticipate.

Why Combine LLMs with Fuzzing?

- Semantic Awareness: LLMs understand input structure — generate valid JSON, SQL, code, etc., not just random bytes.
- Bug Patterns: LLMs learn common vulnerability patterns — buffer overflows, SQL injection, XSS.
- Context Understanding: LLMs can generate inputs tailored to specific code — understanding what the program expects.
- Diversity: LLMs can generate diverse inputs that explore different program paths.

How LLM-Based Fuzzing Works

1. Code Analysis: LLM analyzes the target program to understand input format and expected behavior.

2. Seed Generation: LLM generates initial test inputs based on code understanding.
``python # Target function: def parse_json_config(json_str): config = json.loads(json_str) return config["database"]["host"] # LLM-generated seeds: '{"database": {"host": "localhost"}}' # Valid '{"database": {}}' # Missing "host" key '{"database": null}' # Null database '{}' # Missing "database" key 'invalid json' # Malformed JSON`

3. Mutation: LLM mutates seeds to create variations — adding edge cases, boundary values, malicious patterns.

4. Execution: Run program with generated inputs, monitor for crashes or errors.

5. Feedback Loop: Use execution results to guide further generation — focus on inputs that trigger new code paths or interesting behavior.

LLM Fuzzing Strategies

- Grammar-Aware Generation: LLM generates inputs conforming to expected grammar (JSON, XML, SQL, etc.) but with edge cases. - Vulnerability-Targeted: LLM generates inputs designed to trigger specific vulnerability types — injection attacks, buffer overflows, integer overflows. - Coverage-Guided: Combine with coverage feedback — LLM generates inputs to maximize code coverage. - Semantic Mutation: LLM mutates inputs while preserving semantic validity — change values but keep structure valid.

Example: SQL Injection Fuzzing

`python # Target: Web application with SQL query def search_users(username): query = f"SELECT * FROM users WHERE name = '{username}'" return execute_query(query)

# LLM-generated fuzz inputs: "admin" # Normal input "admin' OR '1'='1" # SQL injection attempt "admin'; DROP TABLE users; --" # Destructive injection "admin' UNION SELECT password FROM users --" # Data exfiltration "admin' AND SLEEP(10) --" # Time-based blind injection

# Fuzzer detects: SQL injection vulnerability!``

Applications

- Security Testing: Find vulnerabilities — buffer overflows, injection attacks, authentication bypasses.
- Robustness Testing: Discover crashes and hangs from unexpected inputs.
- API Testing: Generate diverse API requests to test web services.
- Compiler Testing: Generate programs to test compiler correctness and robustness.
- Protocol Testing: Generate network packets to test protocol implementations.

LLM Advantages Over Traditional Fuzzing

- Semantic Validity: Generate inputs that are structurally valid but semantically unusual — more likely to reach deep code paths.
- Targeted Generation: Focus on specific bug types or code regions — more efficient than random fuzzing.
- Format Understanding: Handle complex input formats (JSON, XML, protobuf) without manual grammar specification.
- Contextual Mutations: Mutate inputs in semantically meaningful ways — not just random bit flips.

Challenges

- Computational Cost: LLM inference is slower than traditional mutation — need to balance quality vs. speed.
- Determinism: LLMs are stochastic — may not reproduce the same inputs, complicating bug reproduction.
- Bias: LLMs may focus on common patterns, missing rare edge cases that random fuzzing would find.
- Validation: Need to verify that LLM-generated inputs are actually valid for the target program.

Hybrid Approaches

- LLM + Coverage-Guided Fuzzing: Use LLM to generate seeds, then use coverage-guided fuzzing (AFL, libFuzzer) to mutate and evolve them.
- LLM + Grammar Fuzzing: LLM generates grammar rules, traditional fuzzer uses them to generate inputs.
- LLM-Guided Mutation: LLM suggests which parts of inputs to mutate and how.

Tools and Frameworks

- FuzzGPT: LLM-based fuzzing framework.
- WhiteBox Fuzzing + LLM: Combine symbolic execution with LLM-generated inputs.
- AFL++ with LLM: Integrate LLMs into AFL++ fuzzing workflow.

Evaluation Metrics

- Bug Discovery Rate: How many bugs found per unit time?
- Code Coverage: What percentage of code is exercised?
- Unique Crashes: How many distinct bugs are discovered?
- Time to First Bug: How quickly is the first bug found?

Benefits

- Higher Quality Inputs: LLM-generated inputs are more likely to be semantically meaningful.
- Faster Bug Discovery: Targeted generation finds bugs faster than random fuzzing.
- Reduced Manual Effort: No need to manually write input grammars or seed corpora.
- Adaptability: LLMs can adapt to different input formats and program types.

Fuzzing with LLMs represents the next generation of automated testing — combining the thoroughness of fuzz testing with the intelligence of language models to find bugs more effectively.

Want to learn more?