We are thrilled to announce the release of Smart Contract Fiesta↗, an open-source, high-quality dataset containing over 175 million lines of Ethereum smart contract source code. This extensive dataset covers approximately 150,000 unique contract sources across 30 million smart contracts up to March 2023, making it an invaluable resource for researchers and developers in the blockchain community.
Access the dataset here: Smart Contract Fiesta on Hugging Face↗
The Problem
Smart contracts are written in idiosyncratic, domain-specific languages like Solidity. These contracts also often implement financial products and are highly mission-critical. At the same time, there has been a lack of high-quality, comprehensive datasets available for smart contract source code. This presents a problem for security (and AI) researchers and engineers seeking to “raise the floor” for smart contract security.
There exists datasets like Smart Contract Sanctuary↗. However, it is not necessarily comprehensive. To our knowledge, our dataset should be the most complete for verified contracts on Ethereum mainnet as of block 16860349 (March 19, 2023).
A Look at the Dataset
Smart Contract Fiesta contains over 175 million lines of Ethereum smart contract source code. With approximately 150,000 unique contract sources across 3 million verified smart contracts, this dataset is a valuable resource for blockchain researchers and developers.
Here’s a breakdown of the key statistics:
- Total contracts: 30,586,657
- Contracts with code available: 3,897,319 (>10%)
- Contracts with code and unique bytecode: 149,386
- Total lines of code (LoC): 177,552,050
We also analyzed the distribution of code, comments, and blank lines within the dataset:
- Code LoC: 90,562,628
- Comments LoC: 62,503,873
- Blank LoC: 24,485,549
Goals of This Dataset
We’re releasing the dataset with three main goals in mind:
- Foster novel research and tool development targeting smart contracts
- Help fortify existing tools that parse smart contract code
- Promote the principles of the Web3 community
Let’s briefly discuss each of these goals.
Fostering New Research and Tool Development
By providing a comprehensive library of smart contract code, we aim to remove data-related bottlenecks and facilitate the creation of innovative tools. Imagine training large language models (LLMs) to write smart contracts from natural language descriptions–or legalese–or to identify and eliminate bugs. Foundational datasets like this enable many lines of research. We hope that this will make smart contract development safer and more accessible.
Robustifying Existing Tools
Smart Contract Fiesta can be used to improve existing tools by exposing them to a wide variety of smart contract code. At Zellic, we’ve already experienced the benefits. While building an in-house Solidity parser and static analyzer (which we plan to open-source!), we use Smart Contract Fiesta as a test suite. This revealed numerous unexpected, unusual, or undocumented Solidity quirks we would have never thought of. If not for this stress test, these edge cases would have likely gone unnoticed. These are areas where specification and documentation can and should be improved.
Of course, we will be sure to blog about those curious oddities, so stay tuned!
Promoting Web3 Principles
As champions of decentralization and transparency, we believe in making data accessible and easy to use. By consolidating this vast amount of data into a single, easy-to-download dataset, we hope to lower barriers and enable the community to build even more incredible projects. Our Web3 community is fundamentally one founded by builders, innovators, and developers. We believe that helping developers is the least we can do to give back.
Interesting Insights
Based on the broad dataset-level statistics alone, we can draw two interesting conclusions.
First, over 10% of smart contracts on Ethereum are verified, which is surprisingly high. Second, the vast majority of verified contracts share identical bytecode. We encountered only ~150,000 unique code hashes, whereas there were 3.8 million verified contracts.
Next, we examined the frequency distribution of words in smart contract source code:
One interesting observation was the frequent occurrence of the string 0x360894a13ba1a3210667c828492db98dca3e2076cc3735a920a3ca505d382bbc (over 64,000 times). It turns out that this is part of the EIP-1967 implementation↗, which is why it shows up so much! There’s several other addresses and weird strings which appear with high frequency. We encourage you to explore the full histogram on Hugging Face↗.
How We Did It
At Zellic, we develop and maintain an in-house fork of Geth, the mainstream Ethereum full node implementation. One of the features of this fork is that it continuously records a variety of useful information, such as a list of all addresses that have ever been involved in a transactions. It’s possible to access information like this through data providers like Dune Analytics↗. Nevertheless, we prefer to keep this in-house due to its high speed to access and query. Plus, it’s free (beyond the cost of running a node).
We performed a full sync using this Geth fork and progressively built up a list of all smart contracts ever deployed. If that sounds useful to you, we’ve made available the list of all contract addresses as part of this dataset as well. We encourage you to explore—you never know what you might find↗!
Finally, using that list, we deduplicated contracts by their runtime code hash. We cross-referenced this set of unique contracts with online smart contract repositories to assemble this data set.
Get Started with Smart Contract Fiesta
We invite you to explore the potential of Smart Contract Fiesta by accessing the dataset on Hugging Face↗. Use this resource to help drive the blockchain community forward and contribute to a more secure and innovative future.
Loved this blog post? Share this on Twitter↗