Links

Design Principles

StreamingFast Firehose design principles

Firehose Design Principles

Firehose was heavily inspired by large-scale data science machinery and other processes previously developed by the StreamingFast team.

The Firehose "North Star"

Truths & Assumptions

Firehose was designed with the following truths and assumptions taken into excruciatingly careful consideration.
  • Flat files provide more efficiency than live running CPU and RAM-consuming and intensive processes.
  • Fast iteration is preferred for data processes because data is inherently messy.
  • Data agility is only achievable when data processes can be parallelized.
  • Clear data contracts between tasks and processes including APIs, RPC query formats, and data model definitions, are critical
  • Maximum precision is required for defining, referencing, and identifying concepts or data models. Leave no stone unturned.
  • The only guarantee in data science is that data processes change and evolve indefinitely.
  • Migrating data is annoying, careful consideration must be taken for:
    • file formats,
    • forward and backward compatibility,
    • versioning,
    • and performance.

Extraction

Minding Deterministic Block Execution

StreamingFast strives to create the shortest path available from the deterministic execution of blocks and transactions down into a flat file. High-level goals surrounding the extraction process were identified and conceptualized including:
  • The development of simple, robust, laser-focused processes.
  • Create core system components including the Extractor, Merger, Relayer, and gRPC Server.
  • Avoid the coupling of extraction and indexing and any other services.
  • Guarantee maximum node performance during data extraction for instrumented nodes, for all protocols.

Data Completeness

Full Data Extraction

Firehose achieves data completeness through the extraction of all available data from instrumented nodes.
Revisiting instrumented nodes is avoided by Firehose due to the complete, rich, verifiable data collected during the extraction process.

Finite Data Tracking

During a transaction, the balance for an address could change from 100 to 200. Firehose will save the storage key that was altered, and the previous and next values.

Integrity & Fidelity

Forward and backward atomic updates and integrity checks are made possible due to the fidelity of data being tracked by Firehose.
In the example above, 200 should be the next changed value for the previous_data key. If a discrepancy is encountered it means there is an issue with the extraction mechanism and data quality will be negatively impacted.

Complete Data in Detail

Complete data means accounting for:
  • the relationships between a transaction,
  • the transaction's block’s schedule,
  • transaction execution,
  • transaction expiration,
  • events produced by any transaction side effects,
  • the transaction call tree, and each call’s state transition and effects.

Transaction Relationships & Data

Detailed transaction relationship information is difficult to obtain from typical blockchain data.
Firehose provides thorough and complete transaction data to avoid missed opportunities for potential data application development efforts.

Transaction & State Data Together

Query requests for either transaction status or state are available for some JSON-RPC protocols. Both status and state however aren't available.
Data processes triggered by Ethereum log events can benefit from having knowledge of their source. The event could have been generated by the current contract, its parent (contract), or another known and trusted contact.

Reduced Need for Smart Contract Events

Accessing rich, complete data leads smart contract developers to emit additional events. Emitting additional events leads to increased gas fees.
Note: Enriched and complete transaction data is simply not easily or readily available.

Contract Design Issues

The lack of availability of rich data also has effects on contract design.
Contract designers are required to reason and plan out how stored data will be queried and consumed by their application.

Contract Simplification & Cost Reduction

Having access to richer external data processes allows developers to simplify contracts reducing on-chain operation costs.

Modeling With Extreme Care

Data Model for Ingestion

The data model used by StreamingFast to ingest protocol data was created with extreme diligence and care.
Tip: StreamingFast encountered several peculiarities within many protocols during the design and development process of Firehose.

Subtleties in Reverted Calls

Interpreting subtleties in bits of data, for things like the meaning of a reverted call in an Ethereum call stack, becomes impossible farther downstream.
Note: Firehose provides complete node data through carefully considered and implemented model definitions created with Protocol Buffer schemas.

Running Full Archive Nodes

Firehose provides enough comprehensive data to conceptually boot and run a full archive node.

Pure Data, Files & Streams

Flat Files

StreamingFast chose to utilize flat files instead of the traditional request and response model of data acquisition. Using flat files alleviates the challenges presented by querying pools of, generally load-balanced, nodes in some type of replication configuration.

Simplification with Flat Files

The decision to rely on flat files assists with the reduction of massive consistency issues, retries, and incurred latency. In addition, using flat files greatly simplifies the task developers face writing code to interface with blockchain node data.

Adhering to the Unix Philosophy

Flat-file and data stream abstractions adhere to the Unix philosophy of writing programs that do one thing, do it well, and work together with other programs by handling streams of data as input and output.

State Transition Hierarchy

StreamingFast uses state transitions scoped to calls, indexed within transactions, indexed within a block.
Tip: Blockchains typically “round up” state changes for all transactions into a block to facilitate consensus.

Smart Contract Execution

The basic unit of execution always remains a single smart contract execution resulting in a single EVM call. However, calling another contract from within the first contract means a second execution will occur.

Keeping Track of State

Contracts lose state precision when the state is changed in the middle of a transaction or block.
Attempting to locate the balance for calculations at the exact point needed, during the processing of a log event, for example, will result in receiving the balance value at the end of the block. The balance value may have changed state in a subsequent transaction after the transaction currently being indexed.
Important: The process of querying nodes can cause substantial issues for developers wanting finite node data access.

Blockchain Data Consumption

Consuming blockchain state is difficult and each blockchain presents its own issues.
Solidity, for example, uses bytes32 => bytes32 mapping, making such a data retrieval endeavour rather opaque and difficult to reason about. This data is available with additional effort, but not easily.
Tip: Developers having access to state data presents tremendous opportunities for indexing and application development.

Protocol Buffers

Google's Protocol Buffers version 3 met the requirements identified by StreamingFast for versioning, compatibility, and speed of file content.
Optional and required fields were removed in Google's Protocol Buffers version 3 simplifying the data extraction process.