What follows is a brief overview of the data flow within the Firehose system. We’ll then dive deeper with a detailed explanation of the various components involved.
At a high level, the Firehose is constituted by:
- An instrumented version of the native node process, which streams pieces of block data in a custom text-based protocol.
Extractor, reading from that stream, reassembles the chunks into a Firehose block, writing it to persistent storage and broadcasting downstream.
Relayerreads the broadcasted blocks to provide a live source of blocks from multiple
Mergerbundles the Firehose blocks together in batches of 100 merged blocks, and stores them to an object store or to disk.
Indexer, whose job it is to provide a targeted summary of the contents of each 100-blocks file, represented by an index file.
IndexProvider, that knows how to unpack the index files and provide fast responses about block contents for filtering purposes.
Firehoseservice receives blocks from either a file source (merged blocks), a live source (relayer), or an indexed file source (indexer + index provider), and returns them in a “joined” manner to the consumers, through a gRPC connection.
In deeper detail, we’ll now see how block data is stored and moved around within the various components
that form the
Firehose stack. We will go from the very low level up to the gRPC streaming layer,
discussing the tradeoffs and benefits of each element.
Our pipeline begins with an instrumented version of the process used to sync the target blockchain.
The patch (which we internally baptized
Deep Mind), is built directly in the node’s source code, where we manually
instrument critical block and transaction processing code paths. We currently have such a patch for
Geth (and a few of its derivatives), for OpenEthereum, for Solana, and for another high-throughput chain
(which was merged in the upstream repository).
The instrumented code actually outputs small chunks of data using a simple text-based protocol over the standard
output pipe, for simplicity, performance and reliability. The messages sent can be seen as small events like
RECORD STATE CHANGE,
Each message contains the specific payload for the event like block number and block hash for the start block.
The instrumented output look like this for an example Ethereum client:
Those messages are then read by the
Extractor component. It’s responsible for launching the instrumented process,
connecting to its standard output pipe and reading the
DMLOG messages. It collects and organizes the various
small chunks, assembling state changes, calls and transactions forming a fully-fledged block for a specific protocol.
The assembled block is actually a protocol buffer generated object, taken from our protobuf definitions. Once a block has been formed, it is then serialized in binary format, stored into a persistent storage, and at the same time broadcast to all listeners using gRPC streaming.
By storing a persistent version of the block, we enable historical access to all blocks of the chain without relying on the native node process. And by having them easily accessible, we are able to create highly parallelized reprocessing tools to slice and dice different sections of the chain at will.
A multiplexer called the
Relayer connects to multiple
Extractor instances, and receives live blocks from them.
Multiple connections enable redundancy in case the
Extractor crashes or needs maintenance.
It deduplicates incoming blocks, so will serve its own clients at the speed of the fastest
We like to say that they race to push data to consumers!
The relayer then becomes the “live source” of blocks in the system, as it serves the same interface as the extractor in a simple (non-HA) setup.
Merger is responsible for creating bundles of blocks (100 per bundle) from persisted one-block files.
This is done to improve performance, and helps reduce storage costs through better compression, as well as more efficient metered network access to a single 100 blocks bundle, as opposed to a single element.
The bundled blocks become the “file source” (a.k.a historical source) of blocks for all components.
Indexer is a background process which digests the contents of merged blocks, and creates targeted summaries
of their contents. It writes these summaries to object storage as index files.
The targeted summaries are variable in nature, and are generated when an incoming
Firehose query contains optional
Transforms, which in turn contain the desired properties of a series of blocks. The
Transforms can be likened to
filter expressions, and are represented by protobuf definitions.
Index Provider is a chain-agnostic component, whose job it is to accept
Firehose queries containing
It will interpret these
Transforms expressions according to their protobuf definitions, and pass them along
to chain-specific filter functions that will apply the desired filtering to Blocks in the stream.
Index Provider delivers knowledge about the presence (or absence!) of specific data in large ranges
of Block data. This helps us avoid unnecessary operations on merged block files.
Finally, the last component that serves the actual stream of blocks to the consumer is the
Firehose connects to both a file source and a live source, and starts serving blocks to the consumer.
The sources are joined together using an intelligent “joining source” that knows when to switch over from the file source
to the live source.
As such, if a consumer’s request is for historical blocks, they are simply fetched from persistent storage,
passed inside a
ForkDB (more info about that below), and sent to the consumer with a cursor which uniquely identifies
the block as well as its position in the chain. In so doing, we can resume even from forked blocks, as they are all preserved.
Firehose component also has the responsibility of filtering a block’s content according to the request’s
filter expression, represented by a
Transform. This filtering is achieved by querying the
Transactions that have no matching unit are removed from the block and execution units are flagged as matching/not matching the filter expression. Block metadata is always sent, even with no matches, to guarantee sequentiality on the receiving end.
The underlying library that powers all of the components above is
bstream (a portmanteau for Block Stream)
available on our Github organization.
It is the core code within our stack which abstracts away the files and the streaming of blocks from an instrumented protocol node, to present to the user an extremely simple interface that deals with all reorgs. The library was built, tweaked and enhanced over several years with high speed and fast throughput in mind. For example, the file source has features like downloading multiple files in parallel, decoding multiple blocks in parallel, inline filtering, etc.
bstream, one of the most important elements for proper blockchain linearity is the
ForkDB, a graph-based
data structure that mimics the forking logic used by the native node.
ForkDB receives all blocks and orders them
based on the parent-child relationship defined by the chain, keeping around active forked branches and reorgs that
When a branch of blocks becomes the longest chain of blocks, the
ForkDB will switch to it, emitting a series of
events for proper handling of forks (like
new 3a, etc.). Active forks are
kept until a certain level of confirmation is achieved (exact rules can be configured for specific chain), and
when block(s) become final (a.k.a irreversible). Specific irreversibility events are emitted by the
Each event emitted contains the step’s type (either
irreversible), the block it relates to,
and a cursor. The cursor contains information required to reconstruct an equivalent instance of the
in the correct branch, forked or canonical, enabling a perfect resume of the stream of events where the
consumer left off. To visualize, the cursor points to a specific position in the stream of events emitted by
and as such, the blockchain itself.
bstream library is chain-agnostic, and is only concerned about the concept of a
Block, containing the minimal
required metadata to maintain the consistency of the chain.
Block carries a protocol buffer bytes payload which is
decoded by the consumer to one of the supported chain-specific
Block definitions, such as
We’ve now covered everything required to understand the
Firehose data flow, from data acquisition to producing
a consumable stream of blocks.