StreamingFast Firehose design principles
Firehose was heavily inspired by large-scale data science machinery and other processes previously developed by the StreamingFast team.
Firehose was designed with the following truths and assumptions taken into excruciatingly careful consideration.
- Flat files provide more efficiency than live running CPU and RAM-consuming and intensive processes.
- Fast iteration is preferred for data processes because data is inherently messy.
- Data agility is only achievable when data processes can be parallelized.
- Clear data contracts between tasks and processes including APIs, RPC query formats, and data model definitions, are critical
- Maximum precision is required for defining, referencing, and identifying concepts or data models. Leave no stone unturned.
- The only guarantee in data science is that data processes change and evolve indefinitely.
- Migrating data is annoying, careful consideration must be taken for:
- file formats,
- forward and backward compatibility,
- and performance.
StreamingFast strives to create the shortest path available from the deterministic execution of blocks and transactions down into a flat file. High-level goals surrounding the extraction process were identified and conceptualized including:
- The development of simple, robust, laser-focused processes.
- Avoid the coupling of extraction and indexing and any other services.
- Guarantee maximum node performance during data extraction for instrumented nodes, for all protocols.
Firehose achieves data completeness through the extraction of all available data from instrumented nodes.
Revisiting instrumented nodes is avoided by Firehose due to the complete, rich, verifiable data collected during the extraction process.
During a transaction, the balance for an address could change from
200.Firehose will save the storage key that was altered, and the previous and next values.
Forward and backward atomic updates and integrity checks are made possible due to the fidelity of data being tracked by Firehose.
In the example above,
200should be the next changed value for the
previous_datakey. If a discrepancy is encountered it means there is an issue with the extraction mechanism and data quality will be negatively impacted.
Complete data means accounting for:
- the relationships between a transaction,
- the transaction's block’s schedule,
- transaction execution,
- transaction expiration,
- events produced by any transaction side effects,
- the transaction call tree, and each call’s state transition and effects.
Detailed transaction relationship information is difficult to obtain from typical blockchain data.
Firehose provides thorough and complete transaction data to avoid missed opportunities for potential data application development efforts.
Query requests for either transaction status or state are available for some JSON-RPC protocols. Both status and state however aren't available.
Data processes triggered by Ethereum log events can benefit from having knowledge of their source. The event could have been generated by the current contract, its parent (contract), or another known and trusted contact.
Accessing rich, complete data leads smart contract developers to emit additional events. Emitting additional events leads to increased gas fees.
The lack of availability of rich data also has effects on contract design.
Contract designers are required to reason and plan out how stored data will be queried and consumed by their application.
Having access to richer external data processes allows developers to simplify contracts reducing on-chain operation costs.
The data model used by StreamingFast to ingest protocol data was created with extreme diligence and care.
Interpreting subtleties in bits of data, for things like the meaning of a reverted call in an Ethereum call stack, becomes impossible farther downstream.
Firehose provides enough comprehensive data to conceptually boot and run a full archive node.
StreamingFast chose to utilize flat files instead of the traditional request and response model of data acquisition. Using flat files alleviates the challenges presented by querying pools of, generally load-balanced, nodes in some type of replication configuration.
The decision to rely on flat files assists with the reduction of massive consistency issues, retries, and incurred latency. In addition, using flat files greatly simplifies the task developers face writing code to interface with blockchain node data.
Flat-file and data stream abstractions adhere to the Unix philosophy of writing programs that do one thing, do it well, and work together with other programs by handling streams of data as input and output.
StreamingFast uses state transitions scoped to calls, indexed within transactions, indexed within a block.
The basic unit of execution always remains a single smart contract execution resulting in a single EVM call. However, calling another contract from within the first contract means a second execution will occur.
Contracts lose state precision when the state is changed in the middle of a transaction or block.
Attempting to locate the balance for calculations at the exact point needed, during the processing of a log event, for example, will result in receiving the balance value at the end of the block. The balance value may have changed state in a subsequent transaction after the transaction currently being indexed.
Consuming blockchain state is difficult and each blockchain presents its own issues.
Solidity, for example, uses
bytes32mapping, making such a data retrieval endeavour rather opaque and difficult to reason about. This data is available with additional effort, but not easily.
Optional and required fields were removed in Google's Protocol Buffers version 3 simplifying the data extraction process.