Title: Three Different Approaches to Proof of Data Storage and Integrity in Distributed Systems
Introduction
Since their inception, blockchain and cryptocurrency have aimed to change the financial landscape by providing broader access and eliminating intermediaries. The development of web3 has expanded the application scenarios and emphasized the potential of blockchain technology in creating an internet where creators thrive and users control their data.
Ensuring decentralization and empowering end-users requires a resilient and censorship-resistant database for data storage. Distributed data storage systems meet the demand for fault tolerance and high resilience storage by building a network of nodes to store, manage, and share data.
The need to remove central authority and distribute data in a peer-to-peer manner enhances security and transparency. While distributed storage systems rely on unused storage capacity, they face challenges in terms of regulations and interoperability.
Proof of data storage and integrity is crucial for an open and accessible network. Deterministic and probabilistic proofs are two types of proofs used to address the issue of how data is stored and maintained.
Tagion
Architecture
Tagion is a decentralized network focused on high-capacity transactions and aims to establish a unique currency system based on technology and democratic governance. It relies on an innovative database architecture and encryption technology to achieve scalability. Tagion’s proof mechanism is an example of deterministic proof.
The core function of the DART database is to store data based on hash keys. As information storage increases, the structure naturally generates more branches, with each branch supporting up to 256 combined records and sub-branches.
In addition to the distributed hash table, Tagion’s infrastructure can be understood as a Sparse Merkle Tree (SMT). SMT is an authentication data structure based on key-value pairs, supporting standard database operations such as lookup, insertion, update, and deletion. Each key-value pair represents a leaf, and hashing child nodes recursively to the Merkle root derives the hash of the parent branch.
SMT significantly improves efficiency by allowing verifiers to validate the existence of elements without accessing unrelated data elements or downloading specific data fragments. The independence of values in the tree allows updates in any order without changing the final structure of the tree.
Tagion’s system utilizes the root hash containing all sub-branch hashes to quickly verify the data state with minimal computation. To further enhance processing capacity, the system can create child DARTs for specific ecosystems, similar to sharded blockchains.
By using HiBON (Hashed Immutable Binary Object), Tagion simplifies the storage process, ensuring data remains hash invariant upon entry. This verified technology accelerates data retrieval and writing in the database.
Data Integrity
All subsystems in Tagion undergo random walks to check if data is stored and provided as needed. Nodes that fail to pass the verification challenge will be excluded from the network.
All records include timestamps and require payment to extend storage time. During the walk, the system checks for payment, and if none is received, it deletes the data and frees up space.
Filecoin
Filecoin is a decentralized storage network that incentivizes miners to provide storage capacity through its native token, Filecoin. To earn rewards, miners must generate proofs to verify their storage capabilities.
The basic storage unit in Filecoin is called a sector, with a standard size and a lifespan that providers can extend. This design carefully balances security and availability. All user data stored on Filecoin is encrypted, and multiple copies are distributed across the network to ensure that miners cannot access file contents.
The influence of miners in the Filecoin network is proportional to their provided storage capacity, allowing them to participate in the network’s consensus mechanism. The Filecoin virtual machine is responsible for executing smart contracts and facilitating market operations, such as pairing storage providers with users.
Filecoin’s architecture is modular, allowing nodes to operate specific parts of the system as needed. For example, a node can act solely as a storage node without participating in market operations.
To ensure data integrity and availability, Filecoin relies on two algorithms: storage proof and replication proof.
Storage Proof
Miners in Filecoin generate proofs to verify that they hold data copies at any given time. This verification is achieved through challenges presented by the system, where miners must answer correctly only if they have the data.
To prevent miners from duplicating data only when challenged, the challenges are designed to target different parts of the data at unpredictable intervals. The combination of randomness and uncertainty in time intervals makes it economically and logically infeasible for miners to obtain data only when challenged.
Proof of Spacetime (PoSt)
Filecoin introduces Proof of Spacetime (PoSt) to ensure continuous storage and data availability. PoSt verifies storage within a specific time interval. Miners can only pass the challenge if the file is stored within the specified timeframe.
PoSt includes two types of challenges:
– Winning PoSt: Miners verify that they stored data copies at specific times, usually when the algorithm selects miners to mine the next block. Short-term deadlines ensure that miners hold the data.
– WindowPoSt: A repetitive challenge where miners submit proof that they have maintained the data as required. Sealing data only when submitted is more expensive for miners.
Sealing is part of the replication proof algorithm, which is compute-intensive, encouraging miners to reduce the frequency of sealing. Replication proof ensures that miners create and store unique copies on their physical hardware.
By requiring miners to generate both types of proofs, Filecoin provides a guarantee of secure file storage for users, and only miners who provide actual storage can receive rewards. Due to the large size of the proofs, miners generate zkSNARKs (Zero-Knowledge Succinct Non-Interactive Argument of Knowledge) and submit them to the chain, making Filecoin the largest user of zkSNARKs, generating 6 to 7 million proofs daily.
In conclusion, Filecoin combines deterministic proof (PoRep) and probabilistic proof (PoSt) methods using a hybrid approach.
Celestia
The third example in this article is Celestia, a so-called data availability blockchain that provides execution and data storage for modular blockchains, allowing them to outsource core functionality.
With the rise of Ethereum Rollup, data availability solutions like Celestia have become popular for offering a cheaper alternative for storing Rollup transaction data compared to Ethereum archive nodes.
Proof of Data Availability
Unlike Filecoin, Celestia does not provide storage solutions for end-users but focuses on addressing data availability issues. Data availability ensures that blockchain data is correctly published. Typically, blockchain nodes must download the entire block to verify availability, which is a resource-intensive process that may hinder verification.
To simplify this process, Celestia uses Data Availability Sampling (DAS). This method involves light nodes downloading only a small portion of data until they reach a predetermined confidence level. If the data in the sample is available, it is considered a probabilistic proof of data availability.
The process works as follows:
– A proposer creates a data block.
– The block data is split into k×k chunks to form a matrix.
– The matrix is expanded by adding parity data using Reed-Solomon encoding to create a 2k×2k matrix. This type of encoding allows the recovery of the entire dataset from a subset of data.
– Independent Merkle roots of each row and column in the expanded matrix are computed and combined.
– Finally, the combined roots’ Merkle root is added to the block header’s data commitment, confirming the data’s availability.
To verify availability, light nodes randomly select unique coordinates from the expanded matrix and query the corresponding Merkle proof of the data block from full nodes. If the response is correct, it indicates a high probability that the entire block’s data is available.
The nodes then broadcast the received data blocks with the correct Merkle proofs to the rest of the network. With sufficient sampling, nodes can reconstruct the entire block, allowing Celestia to rely more on resource-limited nodes for verification, contributing to decentralization.
At the time of writing, Celestia is still very new. However, Data Availability Sampling is a technology that may be adopted outside of Celestia, and Ethereum core developers are discussing adding it to the protocol to help with scalability.
Conclusion
In conclusion, various methods for storing and verifying data availability in distributed networks are actively in operation and being used.
Tagion uses the DART database to improve throughput and support the development of specialized ecosystems protected by a Sparse Merkle Tree.
Filecoin’s architecture utilizes two different algorithms, Proof of Spacetime and replication proof, allowing miners to verify and prove reliable data storage. These proofs are subsequently recorded on the chain in the form of zero-knowledge proofs.
Celestia, as a data availability layer, uses Reed-Solomon encoding to expand data blocks into a matrix. This structure allows lightweight clients to confirm data availability through random sampling, bypassing the need to download the entire dataset.
As the landscape of distributed storage systems continues to evolve, Tagion, Filecoin, and Celestia have each proposed unique strategies to ensure data integrity, availability, and accessibility. These platforms collectively contribute to the development of resilient data publishing and storage systems that support decentralized networks.