The Battle of AI Wrappers vs. AI Systems

Written by:

Published on:

Nov 3, 2025

On This Page

The Battle of AI Wrappers vs. AI Systems

To do vulnerability triage, we use a number of tools: composable agents, workflows, zero-shot LLM calls, deep research, knowledge bases, code analysis tools — you get it.

But, does any of it matter? We need to know if a simple "AI wrapper" from some competitor could achieve adequately good results with a state-of-the-art reasoning model.

So we ran a benchmark.

The Benchmark

We assembled a dataset of vulnerabilities for testing: a human-labeled mix of real-world open source vulnerabilities, anonymized examples from consenting customers, hand-crafted cases, and synthetically generated cases.

The dataset includes both straightforward cases and challenging ones — some intentionally undecidable given the available inputs.

We measured concrete factors like false positive identification, as well as fuzzier assessments like severity classification accuracy. We intentionally made passing easier for the comparison baseline to give it some grace on the fuzzier things we measure.

Results: Purpose-Built System Outperforms

Pixee's agent achieved 89% classification accuracy on benchmarks, while the vanilla SWE agent scored approximately 51%. The specialized system performed significantly better on vulnerability classification across multiple dimensions.

Performance patterns observed: - The naive agent succeeded more on simple, single-file cases - It consistently failed on complex scenarios - It tended to overestimate risk levels - Performance degraded when detailed framework knowledge was essential

Why Purpose-Built Systems Excel

There are two primary reasons why our purpose-built system dramatically outperforms a generic AI wrapper approach:

1. Domain-Specific Knowledge Gaps

Large language models lack deep expertise in application security nuances. Specialized knowledge regarding Java XML parser configurations, jQuery API execution behaviors, and exploitation specifics remain underrepresented in training data.

Examples include: - Java XML parser flag combinations and their security implications - jQuery API behavior and edge cases - Specific vulnerability patterns that are unlikely to be well-represented in training data

These details matter immensely in security contexts, but are often too niche to be thoroughly covered in general training data. Domain-specific knowledge bases and structured classification matrices provide measurable advantages.

2. Safety Alignment Limitations

Large language models are trained to be helpful and harmless, which means they hedge when facing inconclusive evidence. They're programmed to say things like "it's probably nothing, but consult a doctor."

But in security contexts, we are the doctors, and there is no one else to call.

Language models exhibit reluctance to reach firm conclusions when data is inconclusive. In security contexts, organizations require definitive assessments rather than hedged recommendations. Security teams need AI systems that can make firm, defensible decisions based on available evidence — not recommendations that push the burden back to already-overwhelmed security analysts.

Future Direction: Building the Resolution Layer

While triage accuracy represents table stakes for security platforms, superior analytics alone are not enough. Anyone can hook up an LLM to a vulnerability scanner, but that doesn't create an effective security solution.

The real opportunity — and the real challenge — is building a comprehensive Resolution Layer that:

1. Integrates deeply with your company's systems and workflows

2. Understands organizational context dynamically

3. Incorporates both inductive and deductive learning capabilities

4. Makes confident, defensible decisions that security teams can trust

This is the difference between an AI wrapper and an AI system. Wrappers might handle the easy cases, but systems handle the reality of modern application security.

About the Author: Arshan Dabirsiaghi is CTO and Co-Founder of Pixee, where he leads the development of the world's first enterprise-grade automated remediation platform.