To do vulnerability triage, we use a number of tools: composable agents, workflows, zero-shot LLM calls, deep research, knowledge bases, code analysis tools — you get it.
But, does any of it matter? We need to know if a simple "AI wrapper" from some competitor could achieve adequately good results with a state-of-the-art reasoning model.
So we ran a benchmark.
We assembled a dataset of vulnerabilities for testing: a human-labeled mix of real-world open source vulnerabilities, anonymized examples from consenting customers, hand-crafted cases, and synthetically generated cases.
The dataset includes both straightforward cases and challenging ones — some intentionally undecidable given the available inputs.
We measured concrete factors like false positive identification, as well as fuzzier assessments like severity classification accuracy. We intentionally made passing easier for the comparison baseline to give it some grace on the fuzzier things we measure.
Pixee's agent achieved 89% classification accuracy on benchmarks, while the vanilla SWE agent scored approximately 51%. The specialized system performed significantly better on vulnerability classification across multiple dimensions.
Performance patterns observed: - The naive agent succeeded more on simple, single-file cases - It consistently failed on complex scenarios - It tended to overestimate risk levels - Performance degraded when detailed framework knowledge was essential
There are two primary reasons why our purpose-built system dramatically outperforms a generic AI wrapper approach:
Large language models lack deep expertise in application security nuances. Specialized knowledge regarding Java XML parser configurations, jQuery API execution behaviors, and exploitation specifics remain underrepresented in training data.
Examples include: - Java XML parser flag combinations and their security implications - jQuery API behavior and edge cases - Specific vulnerability patterns that are unlikely to be well-represented in training data
These details matter immensely in security contexts, but are often too niche to be thoroughly covered in general training data. Domain-specific knowledge bases and structured classification matrices provide measurable advantages.
Large language models are trained to be helpful and harmless, which means they hedge when facing inconclusive evidence. They're programmed to say things like "it's probably nothing, but consult a doctor."
But in security contexts, we are the doctors, and there is no one else to call.
Language models exhibit reluctance to reach firm conclusions when data is inconclusive. In security contexts, organizations require definitive assessments rather than hedged recommendations. Security teams need AI systems that can make firm, defensible decisions based on available evidence — not recommendations that push the burden back to already-overwhelmed security analysts.
While triage accuracy represents table stakes for security platforms, superior analytics alone are not enough. Anyone can hook up an LLM to a vulnerability scanner, but that doesn't create an effective security solution.
The real opportunity — and the real challenge — is building a comprehensive Resolution Layer that:
1. Integrates deeply with your company's systems and workflows
2. Understands organizational context dynamically
3. Incorporates both inductive and deductive learning capabilities
4. Makes confident, defensible decisions that security teams can trust
This is the difference between an AI wrapper and an AI system. Wrappers might handle the easy cases, but systems handle the reality of modern application security.
About the Author: Arshan Dabirsiaghi is CTO and Co-Founder of Pixee, where he leads the development of the world's first enterprise-grade automated remediation platform.