We’re proud to announce that Zellic has been awarded $1M from DARPA for our submission to the AIxCC Small Business Track↗! DARPA’s AI Cyber Challenge (AIxCC) is a two-year program where entrants will create AI systems that find and fix bugs fully automatically.
In this post, we’d like to share some of our perspectives on the challenge and discuss how they informed our eventual design. This post will only serve as a high-level overview for our winning proposal, for which you can read the whitepaper here↗.
Defining the Problem
In AIxCC, the goal is to create systems that autonomously discover and patch security vulnerabilities. Given that the entire Linux kernel is one of the exemplar challenge programs↗, such systems must be both incredibly general and powerful, combining breadth and depth. In some sense, truly successful systems would replicate the prowess of human researchers in a scalable, automated way.
Past attempts to create such cyber reasoning systems have failed due primarily due to two reasons:
-
Lack of generality. The system only works for certain specific bug classes or patterns, like buffer overflow. Moreover, it cannot identify “logic bugs”, meaning nuanced issues with business logic that have no clear “buggy/not buggy” oracle (e.g. crashes, hangs).
-
Lack of intuition. The system lacks the ability to reason fluidly. Instead, it over-relies on hardcoded heuristics (like block/edge coverage), or rigid, slow, and undecidable techniques like constraint solving. For the same reason, it cannot reject clearly benign behaviors (e.g. false positives in web scanners).
Both of these problems contributes to the other, and have remained unsolved despite decades of research. However, we believe recent advances in AI—in particular, LLMs—offer the opportunity to overcome both of these shortcomings. LLMs are able to reason fluidly and creatively like humans; LLMs handle ambiguity and nuance deftly; and finally, LLMs are generalists with a broad range of knowledge across many domains.
Instead of focusing on any particular technique (e.g., fuzzing or symbolic execution), we believe it is necessary to develop a generalized research system. Such a system must break down complex problems into simpler ones, prioritize among competing tasks and objectives, reason critically and creatively, and troubleshoot independently. These are the skills that empower human researchers to tackle any vulnerability research campaign, not just find memory corruption in binutils. The previous generation of cyber reasoning systems are only the tools used by the true drivers of vulnerability research: the researchers themselves.
In other words, we must build a robot scientist, not another microscope. We call this concept an Automated Vulnerability Research System (AVRS). This system would follow something akin to the scientific method, iteratively performing experiments and refining its knowledge about the program as it looks for bugs.
What does a successful AVRS look like? They must address several challenges to emulate a human security researcher effectively:
- Autonomy. It must operate in a fully autonomous fashion, given minimal guidance other than a target program and an overall sense of what kind of bugs are in scope. It needs to create and execute its own plans, recursively breaking them down into smaller, more manageable steps. Results and information also needs to be organized in some manner, and lines of investigation converging to the same answer need to be merged. In some sense, we need to impart the AVRS with its own executive function.
- Accuracy. It must minimize both false positives and false negatives. Due to LLM hallucinations, false positives can easily arise without careful engineering and prompting. False negatives are even harder to prevent, and require augmenting LLMs with external capabilities or structure to ensure the LLM can “see” all bugs given a limited context window.
- Extensibility. It must be inspectable, observable, debuggable, and extensible in order to keep up with the constantly evolving field of information security. This includes tool use, meaning it must be able to use external tools at its discretion rather than solely rely on LLMs. For example, the AVRS must be able to use AFL to fuzz a program, if it decides that is a good way to proceed.
- Scalability. It must explore and analyze arbitrarily large codebases. Moreover, it must be highly parallelizable in order to target large codebases like the Linux kernel in a reasonable time frame.
This is an ambitious concept. The challenges include open research questions and novel obstacles. To assess viability of a practical LLM-based AVRS, we conducted experiments with LLMs to see if they would be a suitable foundation.
What Do We Know About LLM Reasoning?
We have been experimenting with LLMs since the public release of GPT-2 in early 2019. Even with current SOTA models like GPT-4, our internal experiments show poor reasoning performance as LLM context length grows.
Modern LLMs have demonstrated strong reasoning capabilities on small code samples.↗ However, high performance on needle-in-the-haystack evaluations do not seem to correlate with stable reasoning skills over large inputs. While some research (Liu et al.↗, Yuan et al.↗) does suggest this is the case, we relied primarily on expert human evaluation with skilled security researchers. These experiments lead us to believe real-world performance on code analysis tasks may be worse than current benchmarks can measure quantitatively.
As a result, the trivial strategy of putting the entire codebase into a single high-context LLM is not feasible. We consider this a critical limiting factor for AI code analysis. Therefore, more sophisticated strategies involving multi-step, recursive, self-directed, adversarial, etc. prompting and planning are necessary. This is a fairly reasonable outcome, as a human researcher can only reasonably think about so much code at once. Instead, they break targets down into a tree of problems and questions, which they approach one at a time. Given that LLMs do reason well on small pieces of information, the reasonable approach would be to model this familiar human process.
Task Prioritization
As with many LLM-based reasoning systems (such as AutoGPT↗ and AgentGPT↗), we take an agent-based approach to task prioritization. An agent is an LLM (e.g., GPT-4) prompted with a goal (e.g., find and patch bugs).
Agents split their goals into subgoals and spawn subagents (e.g., first try to understand what the program does). A root agent serves as an orchestrator that initiates other agents to pursue different lines of investigation.
These agents may be specialized for specific roles: they may use different models, have access to different external tools, or operate under different resource constraints.
We organize agents into a Graph of Thoughts↗, obtaining a number of desirable properties:
- Work can be parallelized across a large number of threads.
- Semantic memory can be namespaced by task or topic (e.g., using a vector database / RAG), avoiding overloading individual agents with larger prompts than are required for a task while preserving all relevant information for the reasoning system as a whole.
- Agents can query each other, allowing for information compression and refinement.
- Work can be refined through self-correction (“Are you sure?”) and debate amongst agents↗.
- Traceability and monitoring tools allow for rapid human review of agent, tool, and prompt performance.
We built a prototype version of this Graph of Thoughts planning system, shown below.
Conclusion
By combining LLMs with traditional static analyses and tools, we can create an AVRS that emulates the process of a human researcher. This allows us to tackle large codebases efficiently and effectively, paving the way for autonomous vulnerability research.
Acknowledgements
Special thanks to Keegan Novik, Avi Weinstock, Alex Vanderpot, and Luna Tong for their work on Zellic’s AIxCC concept white paper. Of course, this post only provides an overview of our overall Small Business Track submission. If you’re interested, we recommend you check it out here↗.
About Us
Zellic specializes in securing emerging technologies. Our clients include Cognition Labs, Axiom, Polymarket, MyShell, and more. Our engineers bring a rich set of skills and backgrounds, including cryptography, machine learning, web and mobile security, low-level exploitation, and finance.
Contact us↗ for a security review that’s better than the rest. Real audits, not rubber stamps.