Today, I'm open sourcing the blockchain indexer software behind TzStats.com in the hope it will enable a wave of innovation in the Tezos ecosystem.
There's plenty of good indexers like Conseil, lib_indexer and even the TzScan indexer from OcamlPro, but all share the dreadful drawback that they require a heavy SQL datastore. This not only makes them resource hungry, but also slow.
In this post I will explain the motivation and technology behind a new generation of blockchain indexers. The main takeaway is that if indexers are built on top of efficient storage layers they allow for more powerful techniques to extract, verify and analyze on-chain data which results in a richer dataset and much faster access to data. A win for everybody.
This post will loosely follow the talk I gave at TQuorum Global Summit in New York City. If you are interested, you can download my slides here.
Block Explorer Use Cases
Block explorers are not only made for casual users to check if their coins are still safe or for delegators who are afraid their bakers cheat on them. These use cases are big and important, but there's more.
First, there are blockchain engineers who develop and debug new features in sandboxes every day. Also dapp developers need debugging tools and visibility into the current and past state of their contracts who run on internal testnets and on mainnet. Many useful features of dapps actually depend on full history, so explorers should provide long-term access to all historic state of each contract.
Then there's bakers and staking services who need reliable data about delegation and earnings history for calculating correct payouts and also reliably planning their bond pools and operations.
Then there's less technical user groups like auditors and regulators with strict requirements on data quality and the need to access a trusted copy of the full on-chain history in a format that's easy to digest for their spreadsheets and compliance tools. Finally, there's asset researchers and analysts who use sophisticated statistics packages to find correlations and hidden treasures.
Many people think it's the web frontend which matters, but what's actually important is a powerful API geared towards a diverse set of users which helps them achieve their individual goals with minimal effort.
Whats wrong with traditional explorers
Typical block explorers consist of an indexer that extracts on-chain data and stores it into a database, an API that queries the database and a frontend that displays the data. Traditional indexer backends share a common architecture centered around some SQL database and a relatively light-weight indexer process.
An indexer extracts, transforms and loads on-chain data into the SQL database by mapping the data into a pre-defined schema of tables with referential integrity. Indexers are optimized for simplicity and fast implementation time. The hope is that anything interesting can later be extracted using complex SQL queries.
It turns out that this one-way data extraction model has a couple of problems
- extracting complex state from a blockchain-node is expensive (especially things like listing balances across all accounts)
- SQL stores optimize for transactional workloads like finding individal rows, but are bad for analytical workloads that aggregate across columns (because full rows are loaded from disk)
- raw data storage requirements and I/O loads are extremely high since SQL data is stored row-wise
- some queries are even so expensive, they have to run offline (i.e. once a night or once a cycle) and their results have to be stored in extra tables
- most online queries to lists or joins across tables are limited by available main memory and I/O bandwidth and the only way to speed them up is replicating the database or using larger instances.
Enter: The next Generation of Indexers
The Blockwatch indexer replaces the slow and expensive SQL datastore with a high-performance columnar data store that allows for extremely fast analytical queries. It's a custom-made database for blockchain analytics. Avoiding the storage bottleneck allows for more complex data processing, in fact, this process becomes so fast, that I reconstruct the entire on-chain state including balances, grace periods, delegator set and even full snapshots in the database. State updates happen at each block, which means all balance updates are always verified and the indexer will follow chain reorganizations in real-time. On the side, because there's plenty of time, I even run complex analytical queries for global ranks and end-of-day statistics.
Because the data storage layer uses smart compression of column vectors I can store much more data points and do not have to worry about schema normalization. In fact my data model and table schemas are only designed with query efficiency in mind. I even duplicate columns and use extra boolean columns just to filter data more efficiently.
Costs and Benefits
The complex extraction process in the Blockwatch indexer comes at the cost of increased code size and longer implementation time. But that's about it. The rest of the architecture is superior along all other dimensions. On-disk data sets are much smaller even though more data points are stored, indexing performance increases tremendously, and on-line query times are consistently faster.
The biggest benefits, however, are speed and extra flexibility on the API layer. Besides the typical fat-object model of an explorer API, the Blockwatch indexer also supports a lightning fast table and a time-series API for free. Users can conveniently filter by every data field and select the specific columns they are interested in. The underlying data store never loads more data than necessary from disk. Results are streamed to clients as soon as matching entries are found in a table. Together, this results in the best end-to-end latency and shortest time-to-first-byte across the industry. Since result sets are never stored in memory you don't even need high-end servers.
For those of you who like to run your own block explorer at home, you can find the indexer and our TzStats frontend source code on Github. We also publish up-to-date Docker Images for the indexer and our frontend.
If this feels too involved, you can access all data on our hosted indexer version on api.tzstats.com. You can find all API docs here.
Please remember that the indexer comes with a community license that lets you do everything you want for personal purposes. If you use our software as a business we kindly ask you to purchase a license from us. Reach out to firstname.lastname@example.org for details.