"I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
Datopia has at its center a permissionless, immutable database — we maintain, persist and federate parallel indices to efficiently service structured queries over the chain state. Our specific approach to index implementation has involved extending Greenberg’s Hitchhiker tree, and, in datahike, providing facilities for arranging HH trees in such a way as to support interrogation with Datalog, a declarative query language.
This post is aimed at readers unfamiliar with the Hitchhiker tree — or write-optimized B trees, more generally. We’ll begin by revising properties of related data structures, while incrementally suggesting the design of the HH tree — a functional implementation of a B+ tree, employing similar write-optimizations to fractal tree indices or Bε trees. Before parting, we’ll briefly touch on authentication and replication. We can’t cover everything: details of how indices are represented in the HH trees will be the topic of a follow-up post.
From The Ground Up
Binary Search Trees
The canonical (and simplest) tree data structure suitable for storing sorted collections is the binary search tree — an arrangement in which inner nodes point to at most two subtrees (left and right). While simple, an unbalanced BST is vulnerable to structural deformation — if, say, entries are inserted in sorted order — and may degenerate into an approximation of a linked list. Accordingly, we’ll set unbalanced trees aside, and focus only on the balanced, or self-balancing subset of trees — those in which all leaves are approximately equidistant from the root.
In any balanced tree, the distance between the root and a leaf scales logarithmically with the number of entries. In a balanced BST having entries, the cost of a lookup is comparisons, where the base of the logarithm — 2 — is the tree’s branching factor. Intuitively, this makes sense: we’re halving the remaining search space whenever we select a subtree for descent.
“You need the power of the logarithm.”
If we think about balanced tree lookups as involving comparisons — where is the branching factor — we might consider increasing to yield a more favourable base for the logarithm. The B tree family of structures can be understood as self-balancing generalizations of the BST, leveraging the logarithm for practical gain.
In B+ trees, specifically, we’ve a sorted layer of data nodes — likely implemented as a contiguous/linked list — beneath sorted layers of index nodes, through which we route key lookups. On each descent we logarithmically search the sorted pivot values, locating the appropriate subtree — or, terminally, the data node — for the input key.
In practice, B+ trees tend to be impressively shallow — they're realized with far greater branching factors than may coherently be depicted. We'll stick with smaller numbers, for the sake of the diagrams.
So, there’s a lot going on here. We’ve got a B+ tree containing the integer keys 1-29, with a minimum branching factor — — of 3. Each index node has between 3 () and 5 () children — excepting the root, which may have fewer. When an operation would otherwise violate these bounds, the tree maintains balance via unsurprising internal manipulations (joining/splitting nodes. etc.) Navigating via the pivot values is straightforward: the greatest (rightmost) value in any subtree is its parent’s pivot.
While B+ trees leave little to be desired for reading data, they’re not write-optimal — an insert or deletion also costs , as we’re required to walk to the respective leaf prior to operating on it1. The Hitchhiker tree attempts to asymptotically improve upon this, by buffering write operations in fixed-length append logs associated with each index (inner) node — an optimization common to fractal and Bε trees. While append log length is configurable, we’re using a maximum of two entries per index node in the below examples.
In Figure 2 we see such a tree, with the append logs rendered vertically at the rightmost of each index node. If we attempt to append a write to a full log, the contents are flushed downwards a level. Eventually, an element arrives at a leaf and is inserted into the data node. Per Greenberg’s summary of the benefits of this approach:
- Most inserts are a single append to the root’s event log.
- Although there are a linear number of events, nodes are exponentially less likely to overflow the deeper they are in the tree.
- All data needed for a query exists along a path of nodes between the root and a specific leaf node. Since the logs are constant in size, queries still only read nodes.
Given our ability to independently alter the log length and branching factor, we can view this data structure either as a write-optimized B+ tree, or a read-optimized append log. The latter property is one we’re interested in exploiting to efficiently replicate write-intensive event-logs in replikativ, an associated project.
Figure 3: A small Hitchhiker Tree.
In Figure 3 you can see a small HH tree — specifically a 2-3 B+ tree (, so per index node) with append logs of length 2 — containing the keys 0-12. Note how a few of the elements remain in the logs (0, 11, 12, 13) — let’s walk through the insertion of further elements, to develop our intuitions around how the append logs are flushed down the tree.
Figure 4: We get off easily, via a vacancy in the root's log.
First we insert 14, observing that it requires a single write operation on the root node’s log per Figure 4 — leaving the root’s append log at capacity. Let’s tempt fate and attempt to insert another element, -1:
Figure 5: An insert causes an overflow and flushes the elements down to the leaf nodes.
The root’s append log overflows, and elements 13 and 14 move rightwards, causing another flush per Figure 5 — triggering their insertion into the data layer of the tree, and a split of the B+ tree’s index node. Critical for the reduction in I/O costs is the fact that the newly inserted element — -1 — only migrates to the node on the left, generating a single I/O operation until the append log is filled up.
When querying a Hitchhiker tree, we have to downwardly project unincorporated append log values — interpolating them in memory after loading the required data nodes. In that sense, the appended values hitchhike with the query operation to their appropriate position. This doesn’t require any additional I/O, only the CPU work of sorting a few elements in the nodes. For range queries, we selectively project downwards only the elements that belong on the relevant path. See the asymptotic costs appendix for more detailed information about the complexity of these operations.
Clojure’s data structures are trivial to structurally hash, and arbitrary data structures may be authenticated without difficulty. We’ve exploited these facts to merkelize (urgh) the Hitchhiker tree, by indirecting parent-child relationships with recursive SHA512 subtree hashes, yielding a write-optimized Merkle B+ tree.
Consequently, index segments (addressed by hash) may be replicated peer-to-peer (e.g. Dat, Bittorrent), and selectively retrieved by light clients in response to local queries — database consumers maintain local copies of whatever subset of the indices their queries touch, without losing the ability to authenticate the entire structure, given a trusted root hash.
I hope we’ve explained the Hitchhiker tree’s background sufficiently to communicate its attraction to us as a building block for distributed databases. We believe authenticated, optimized B trees to be a far better choice for high-performance data storage solutions (blockchains, P2P filesystems) than direct materialization of DAGs or chains. If these ideas interest you, or you’re motivated by the composition of clearly-delineated components into surprisingly powerful systems, consider joining us!
Appendix: Asymptotic Costs
It’s perhaps clarifying to discuss complexity in the notation of Bε trees. While a B+ tree has a fanout of — each node has at least children — a tree has children (e.g. children for ). Each node has elements: are navigational pointers while the remainder belong to the append log.
To calculate the amortized insertion cost informally, we can say that we have to flush an element times to the leaf. On each flush we move elements down to each child. For a detailed explanation see Section 2.2. of Jannen et al.
|Cost (IO ops)||B+ tree||HH tree|
Table 1: Comparison of the asymptotic complexity of operations between a B+ tree and a HH tree.
Note that while the query cost increases slightly by , we can target larger node sizes, as they’re not rewritten as often as is the case for a B+ tree. If we consider as a fixed constant (e.g. ), it’s eliminated from the asymptotic cost expressions and yields the theoretic superiority of a fractal tree.