Filecoin RetroPGF Round 3 Round 2 Round 1

Prolly Tree Indexed Data Oracle for IPLD banner

Round 2 / Prolly Tree Indexed Data Oracle for IPLD

Drips Project

KenCloud-Tech/prolly-tree-oracle

GitHub URL

https://github.com/KenCloud-Tech/prolly-tree-oracle

Funding amount

0 FIL

Peer to peer and decentralized applications have been popular lately, but almost all of them rely on synchronizing append only logs of operations (or an equivalent). This leads to performance degredation over time as data accumulates and needs to be processed on new peers before they can start querying and updating data within an existing swarm. This project deals with how pre-indexed peer to peer databases can be built and how they can sidestep the need to process operation logs and query data more efficiently. The short of it is that with this new data structure, the network becomes your database instead of needing to build a database after loading everything from the network.

Most existing p2p databases like OrbitDB, SecureScuttlebutt, Automerge, or Cabal work by having each users create a log of all the changes they're making to a dataset. Each change that needs to happen or new piece of data that gets created is stored in an "operation" in the users log. When a peer wants to load data they will start pulling changes out of the logs of other peers and processing them locally in order to arrive to the latest "state" of the dataset. What's interesting is this isn't just the case of p2p databaes, but is also the building block of distributed databases in general like MongoDB or CockroachDB. In fact, most blockchains are just another case of databases that make use of append only operation logs, but with new ways of doing log consistency which distributes trust accross minors or validators. At the end of the day it's all operation logs and some sort of mechanism for ensureing the consistency of the log and whether the DB can process operations in different orders to arrive at the same state.

This setup works well in that it separates the view of the data from how the data gets replicated and leaves a lot of room for how to manage consistency and fault tolerance in your system. However, this setup works best for databases that are always online and have a relatively consistant set of peers to replicate to. If you have large time ranges where two peers are unable to communicate, or if you have new peers joining and leaving, you end up requiring large amounts of operationst to be processed before a peer can actually make use of the data.

This can be see in in scenarios where somebody is following a lot of users with a lot of history in SSB and needing to wait for all the logs to be processed, or when a new database replica is added and needs to fetch all the existing state (whether it's a regular DB or a blockchain full node). The new node will need to sit still and focus all of its resources on catching up with the append only log and updating it's local database indexes before it's able to do antyhing else. Relying on append only logs leads to this scenario as your data grows and as the number of peers trying to consume the data grows and leads to similar situations accross all systems with this core building block. You can attempt to optimize your core sync logic but as your data grows in either volume or frequency, you'll end up facing the same limitations.

One way to get around needing to "catch up" with a writer is to get "snapshots" of the current state of the data from trusted peers. From there you can skip a lot of the extra processing and just load the new changes since your snapshot. This of course presumes that you have a trusted peer and that they are in fact trustworthy, which isn't always possible in distributed systems (in particular p2p). In some cases the snapshot might not be getting generated in time or might be so large that auditing it becomes infeasible.

Another approach is to split your data into multiple oplogs for different "regions" so that peers only need to replicate the data that is actually relevant to them. This is used by certain databases like CockroachDB, by the "offchain" processing methods we're seeing in some cryptocurrency projects, and by projects like PeerBit which split up the logs for individual documents. The Secure Scuttlebutt community is also looking at a new feed standard where they can split up users data into a bunch of sparsely replicated feeds. Splitting up the logs can reduce the overall amount of operations you need to replicate, but for bits of data that have a lot of "history", you can expect to see everything getting slower and slower as time progrsses. For exaple if you have a chat room with a multi-year history, if a new user needs to rely on append only logs to get the latest data, they'll have no choice but to sit there while it loads.

All of this together leads to either a centralization of power in long-lived nodes that do the replication for you (e.g. traditional databases, blockchains) and then give you access to the data as a "client" which needs to rely on the node to be truthful and constantly reachable. For cases where an application doesn't have the ability to do this centralization (like p2p databases), you end up getting performance degrdation as your data grows.

Similar to B+ Trees, prolly trees store ordered key ranges in a tree structure. Unlike B+ Trees they use content addressibility and probibalistic chunking to determenistically generate the tree structure. In practice this means that the same key-value pairs will generate the same tree structure regardless of insertion order and one can combine two trees determenistically.

Everything that can apply to B+ Trees and database indexes, can now apply to Prolly Trees with the added ability of it being p2p, sparsely loaded, and mergable. In particular, when merging two datasets, you can easily skip over blocks that are the same on both sides without having to traverse into a tree, you can also insert entire ranges that are new into your tree withut having to fully traverse the individual items within.

Use Cases1: P2P Social Apps. A lot of social P2P apps right now use the append only logs method of synchronizing data and end up having large histories if they actually get used for a long period of time. Generally these apps also have users coming in and out from the swarm frequently and being offline for long periods of time. This leads users to commonly face the downsides of using append only logs. New users have to sit and wait for their databases to sync (sometimes for minutes, sometimes tens of minutes), and users that haven't been online for a while can still need to sit around a bit to get up to speed. Very active users that might be in a bunch of chatrooms at once will have that initial load multiplied by each chat they're participating in which can lead to sluggishness (see Matrix chat clients when you're in hundreds of rooms with thousands of people).

If these apps instead used prolly tree based p2p databases, they could drastically improve the initial load times.

Use Cases2: Blockchain State. Most "clients" trying to use blockchains have to trust that the full node they're using isn't omitting data, and that it will be online fot them indefinitely. These full nodes typically keep a local database (like postgres or the such) and have an HTTP API for clients to connect to and ask for state. This HTTP API is where full nodes can impose arbitrary restrictions and start charing for access to the chain state. For example, if you're trying to use metamask to access some data, or some NFT platform, they effectively control your access to the chain via their centralized management of these HTTP APIs.

With prolly tree based p2p databases, we can sidestep the need for these HTTP APIs and central-ish databases in blockchain ecosystems. Instead of a blockchain existing to keep an append only log of blocks, and full nodes acting as indexers over that state in order to make it actually usable, what if the chain published both the latest block, and a link to the root of a Prolly Tree containing the indexes of all the data.

From there a client could talk to any node participating in keeping the chain running and do their actual query on the data by traversing the peer to peer indexes. Instead of loading everything from a single node and depending on it to be online, a client can load subsets of the index from multiple nodes in paralell, and the more nodes exist in the chain the more available the data will be.

This can also be used to "prove" that a certain result was part of the blockchain state by linking to the root block, and the subset of the prolly tree which was used to satisfy the query. From there one could send these proofs around to different systems which can have implications for inter-chain communication and sending proofs without needing to be fully connected to the chain.

Backups and fresh starts can also be improved in that a node can get the latest block from the chain and start loading the underlying prolly trees from the history while prioritizing blocks that are relevant to any smart contract calculations first. To be specific, a node can execute code that relies on the state before having a full copy of the state by being able to load the part of the prolly tree it needs on the fly from the rest of the network.

Storage of the blockchain state can also be optimized thanks to the content addressibility. If only a small subset of the index changed, then all the other nodes in the tree will remain the same. It's kind of like compressing your DB right from the get go without losing the ability to roll back to earlier snapshots of your state.

Use Cases3: Cross-Organizational Search. Another place where p2p search indexes would be useful is for indexing institutional data for text search. Specifically, it could be useful for projects doing archiving of data using WebRecorder to build up large datasets of crawled data. At the moment, it's possible to index data from these sources using traditional centralized databases like Elasticsearch, but this requires running centralized infrastructure and doesn't allow for sharing indexes between groups easily. Often this is done via external companies like Google which come in and set up search indexing on the institutions behalf and results in the institution relying on paying the third party to keep access to their own data.

With prolly tree based p2p databases we can begin to build up search indexes collaboratively with community members, where individuals or smaller groups can participate in generating indexed data, and larger indexes being formed from combining the smaller ones. For example, cultural institutions can start tracking data for use by their community, and they can also build bridges to other institutions in order to search through larger amounts of data without needing to delegate authority and ownership to a third party like Google.

This can apply outside of cultural institutions as well and can be used for curated search indexes for all sorts of use cases. Communities interested in a given topic can collaborate on search via prolly trees, scientific publishers and communities can combine their data, etc. Instead of having a centralized search engine like Google or Bing which is vulnerable to spam originating from untrusted and illrelevant parties, we can have smaller scale and more focused search indexes wich can be use and combined together based on who is interested in what. This is some of the paths that we're hoping to explore with the IPFS Search team by taking their existing indexed data and finding ways to ingest it into prolly trees and perform searches over top.

Use Cases4:Big Data. Ingtroducing prolly tree based p2p databases into the mix gives us new ways of approaching big data. In the first place it gives us a new way to scale out and shard large datasets accross multiple storage backends by storing subsets of a prolly tree accross several machines. Decoupling the storage from the authoring and reading layer can make it possible to spin up different components within a single cloud or accross people's home machines. Ingestion can be completely changed too in that instead of running ingestion on a single database or via some sort of map reduce cluster, one can split up sections of data to be ingested and indexed and combine the outputted prolly trees at a different layer accross many worker nodes.

Apart from scaling we can now have new ways of structuring data marketplaces where authors of datasets can encrypt prolly trees and sell decryption keys to others which can then process and refine the data and publish it themselves. This sale and publishing can either be done via centralized marketplaces, blockchains, or manually coordinated contracts accross organiations. In fact, having the underlying distribution and processing be decentralized can be a boon to making Data DAOs which manage and profit off of datasets.

Use Case5: AI

Verifiability important for tracing where data came from
Indexable data structures useful for RAG and tool calls
Decentralized datasets with instantiated agents

Round 2 / Prolly Tree Indexed Data Oracle for IPLD

Category

Drips Project

GitHub URL

Funding amount