Blockchains as Information Systems

Blockchains positioned as application deployment targets must confront the ubiquity and volume of highly-structured information in all but the most trivial applications. Expensive, crude or ad-hoc approaches to modeling, storing and retrieving data are typical in the blockchain space. This need not be the case.

Axioms

  1. A blockchain is a history of facts.
  2. Inference is the application of reason to histories of facts.

Goals

An integrated set of components for collecting, storing, and processing data and for providing information, knowledge, and digital products.
— Zwass, Vladimir. "Information system." The Encyclopedia Britannica.

Below, we’ll attempt to convince you that neither traditional databases nor blockchains represent epistemologically robust information systems — the former forgets, the latter is often unable to remember when it counts1. More consequentially, we’ll suggest that sound information systems are a prerequisite for solving many of the problems which engage us as a community.

Through unceasing appeal to immutable databases as solution space from which we’ve plenty to learn, we hope to establish that it’s both possible and necessary to offer comparable facilities in a permissionless, distributed information system. In parallel, we’ll suggest that “smart contracts” often function in practice as gatekeepers for data, which may otherwise be directly read or conditionally written — if only there existed a generic mechanism for modelling, interrogating, permissioning and storing structured histories of values.

Finally, we’ll make a case that user-schematized attributes — with optional, logical constraints over their usage — may represent a simpler and more expressive means of talking about data than ad-hoc key value stores and serial imperatives. By enforcing write constraints, apparently fundamental logical invariants (those around asset transfer, say) may be excised from the core of the system, while being trivially implementable atop it. Our contention is that the resulting environment — in which data is collaboratively modelled, freely shared and uniformly accessed — approaches an ideal substrate for transparently tackling complex analytic and inferential problems.

1 By, say, emulating "mutable cell" storage semantics in on-chain computation protocols, surfacing only the most recently committed value for each cell — requiring that histories be maintained explicitly, expensively, and in data structures immune to efficient or expressive interrogation.

Motivation

“Man is certainly stark mad; he cannot make a worm, and yet he will be making gods by dozens.”
— Michel de Montaigne

Consider a boardwalk lemonade stand in high season, managed with a commitment to transparency and self-optimization. The price and provenance of every lemon, the yield of each squeeze, hourly ambient temperatures — a sample of the facts that our revolutionaries may want to structure and record into a public, immutable history — the memory of their organization, and the substrate for its analytic/inferential processes.

Blockchain enthusiasts invoke heady problems — deterministic arbitration, on-chain governance — while seldom acknowledging the explosive volumes of information consumed and emitted by these processes, even within narrow domains. Platforms — blockchains, or otherwise — without cost-effective solutions to structured, historic data retrieval are suited only to the most comically trivial class of problem: they couldn’t govern a lemonade stand. Unsurprisingly, businesses solving complex problems already know this.

“A database that updates in place is not an information system. I'm sorry."
— Rich Hickey, The Database as a Value

A billion dollar market is emerging around the private, distributed analysis of append-only logs. Clearly, databases built atop mutable cells — yesterday’s value is obliterated by today’s — are unsuited to many of the problems faced by their customers. Immutable databases take a far more interesting position: your structured data’s history is data of the same order — equivalently structured and interrogable. When the architectural predecessors of contemporary databases were conceived in the early 1970s, this approach would’ve been ostentatious in the extreme. Fortunately, an awful lot has happened to the price of storage media in the intervening decades1.

“Peter had seen many tragedies, but he had forgotten them all.”
— J.M. Barrie, Peter Pan

Fortuitously, a blockchain’s fundamental responsibility is that of securing a coherent, immutable, ordered history of facts. Just the thing we need, to make intelligent decisions! In a truly curious turn, this history — the network’s identity — tends, as a matter of precedent, to be obscured from on-chain computation protocols determined to privilege the present. Given the effort and coordination involved in maintaining these histories, to fail to offer a transparent means of analyzing them — as a line, not a point — is astonishingly profligate and short-sighted.

1 In late 2018, we can get for $99 what would've cost three quarters of a billion dollars in the early 70s.

How We Got Here

“One of the poets, whose name I cannot recall, has a passage, which I am unable at the moment to remember, in one of his works, which for the time being has slipped my mind, which hits off admirably this age-old situation.”
— P.G. Wodehouse

In the main, blockchains have tended towards composing solutions to consensus, Sybil resistance and replication in a manner at odds with the needs of cost and feature-competitive data storage — long block times and exorbitant storage costs being among the more salient consequences. The pervasive use of the word ledger — unaccompanied by the concession that ledgers are special-purpose databases — has likely cemented our aversion to considering blockchains as information systems, in any broad sense.

Significant progress has been made on the above technical concerns — less on the cultural ones. On the solutions side, there exist a number of sound, high-throughput consensus algorithms (e.g. metastable, classical, etc.) composable with responsive, intuitive mechanisms for Sybil resistance (POS, DPOS). Elsewhere (Ethereum, Filecoin), novel approaches to state distribution are required for platforms to function at projected demand. We’ve a vision problem, not a technical one.

First as Tragedy, Then as Farce

From a developer’s perspective, one of the more disappointing compute blockchain trends is the conflation of information and implementation at the center of the dominant programming model. We’re recapitulating the worst of object-orientation, atop systems embarrassed to describe themselves as such. Data isn’t an implementation detail, and mediating its access through domain-specific methods1 is a thoroughly debased strategy at odds with the needs of sustainable, composable systems.

These aren’t stylistic concerns. The absence of a fundamental means of global, structural interrogation/insertion consigns contracts to the re-implementation2 of a small set of access patterns over their “internal state” — whatever that means, and however it’s been jerry-rigged together. While it’s awkward to obtain empirical data3, we’ve the intuition that an astonishing percentage of deployed contracts are concerned with trivial, imperative data brokerage — compensating for the shortcomings of their platforms, not doing anything smart. Briefly, an excerpt from Solidity by Example:

contract Ballot {
  ...
  mapping(address => Voter) public voters;
  ...
  function giveRightToVote(address voter) public {
    require(msg.sender == chairperson,
            "Only chairperson can give right to vote.");
    require(!voters[voter].voted,
            "The voter already voted.");
    require(voters[voter].weight == 0);
    voters[voter].weight = 1;
  }
  ...
}

This is fairly typical Solidity code — after an imperative sequence of runtime assertions, the giveRightToVote method sets a nested, persistent property to 1. All of the other methods on the Ballot object are in a similar line of work — delicate, sequential assertions, followed by trivial data manipulation. This is not code, it’s data disguised by blush and carmine.

1 c.f. generated getters and setters per Solidity, etc.
2 It may surprise you to learn that grave mistakes are often made in these implementations.
3 Expect a follow-up.

A Better Way

“So far it's perfectly simple!... A galvano-plastic overstress on a centrifugal pin!... A simple matter of computation!... The factors involved are child's play... Radio-diffusible lighting with a Valadon projector!... My word, all it takes is a little spunk and initiative!"
— Courtial des Pereires

At a high level, our principal interest is in developing a trustless, immutable, deductive information system, sufficiently expressive to serve as the substrate for a ledger, governance platform, etc. — without spilling the details of those domains all over the core system design. While we’ll resist the impulse to wade too deeply into the weeds in this introductory post, below is a sketch of a design in which the fundamental network interaction, a transaction, denotes something much closer to that word’s use in database systems.

The Tyranny of Structurelessness

What follows is a tedious — but mercifully brief — exploration of the requirement of a single, flexible means of structuring arbitrary data entrusted to the network.

;; The angle brackets are an ad-hoc metasyntax for the purpose of
;; abstracting incidental values --- entity identifiers, here.

{:datopia/entity <sally>
 :email          "sally@gmail.com"}

Here we’ve an entity — a thing — represented as a map/dictionary, with the entity’s attributes as its keys. For those of us unrestrained by type and struct fetishes, this ought to appear a perfectly familiar, open (i.e. no fixed set of permissible attributes per entity), universal means of talking about things. Let’s talk about <sally> from the perspective of another entity, <joe>:

{:datopia/entity <joe>
 :age            72
 :balance        27
 :friend         #:datopia/ref <sally>}

These map representations are trivially isomorphic to the Entity-attribute-value information model, in which data is typically structured as global1 triples of the form — wait for it — entity, attribute, value (e.g. <joe>, age, 72). Like RDF, without the megalomania.

A key feature of our system is that any of the attributes referenced above may be (optionally) schematized, to express type, cardinality, uniqueness, or, more interestingly — to logically constrain the attribute’s use in transactions. This latter facility is a general means of establishing global invariants, such as demanded by a ledger (balance sufficiency, zero-sum exchange, etc.) — though far more interesting examples abound. Users deploy attribute schemas, and the genesis block includes some helpful, primitive schemas essential to maintain the network itself.

Here’s where it gets a little steampunk — we really like Datalog (an ancient, declarative, uncannily expressive Turing-incomplete subset of Prolog2) as a domain-specific logic language for database interrogation. Queries and invariants, like most everything we traffic in, are structured data. A trivial Datalog query over our example data, declared in Clojure’s literal notation, for readability:

[:find  ?age ?balance
 :where [?e     :friend  ?other]
        [?other :email   "sally@gmail.com"]
        [?e     :balance ?balance]
        [?e     :age     ?age]]

All of the bare words / symbols (?-prefixed, by convention) are logic variables for which we’re seeking concrete substitutions.

On transaction receipt, all applicable attribute invariants are evaluated against an in-memory Datalog engine, containing only the union of the facts asserted by the transaction, and the result of an optional, arbitrary pre-query against the chain state, on which the attribute schema may declare a dependency3. If the transaction is accepted, its facts are incorporated into the persistent, authenticated indices which comprise the network’s database. If you’ve some grasp of the above query, there’s not much mystery to attribute invariants — they’re simply queries of the same form, required to unify in order for an attribute’s usage — and any transactions containing it — to be considered valid.

Why not SQL?

While both SQL and Datalog are rooted in similar formalisms, in practice, comparison is confounded by implementation-specific extensions — on both sides — which may drastically alter the properties of a given system. Assuming some SQL implementation capable of recursive queries, we'd first appeal to Datalog on the basis of expressive power: it's far less operational (what, rather than how), inferentially more succint, and amenable to structural query representation, per the above example.

The EAV data model is well suited to domains in which a voluminous set of attributes are sparsely associated with entities — e.g. an open system with user-defined attribute schemas. While it's certainly possible to use SQL to interrogate EAV-spaces, it's not our idea of a good time.

1 Some EAV databases organize triples within named tables — we don't find tables to be a motivating organizational scheme.
2 Shouts out Alain Colmerauer.
3 e.g. the invariant component of some balance attribute's schema may declare something like "I need the current balance for every entity referenced in the transaction, in order to evaluate the correctness of the transaction's use of balance".

Facta, non verba

How might we flexibly support value transfer in such a system, in more detail? As our fundamental interaction is the submission of arbitrary, structured data, let’s idealize a transaction — in this case, a vector of two facts, each concerned with a distinct entity:

[{:simoleon/balance (- 99),
  :datopia/entity   <sender>},

 {:simoleon/balance (+ 99),
  :datopia/entity   <recipient>}]

Here we’re imagining Simoleons to be some user-defined asset, which happens to use the namespace simoleon for its qualified keywords1. The simoleon/balance values are submitted not as absolute values, but relative ones — we’re declaring something like <sender>’s simoleon/balance shrinks by 99. <recipient>’s grows by 99.

To say that Simoleons are a user-defined asset implies only that there exists a schematized attribute within the network — simoleon/balance — logically constraining its own use in such a way as to render inexpressible “unsound transfers” — whatever that meant to the author of the attribute’s schema2. Datopia nodes have no intrinsic conception of a transfer — when it comes to transaction processing, their primary concern is the evaluation of user-defined invariants.

Nodes — prior to applying transactions — synthesize additional attributes from low-level metadata not explicitly represented in the transaction’s body (e.g. that its envelope was signed by <sender>, rather than <recipient> — handy). It’s trivial to see that the sum of this data, considered alongside the transaction — and a pre-query resulting in <sender>’s simoleon/balance — would be sufficient inputs for a relatively brief logical declaration of the conditions of value transfer. It’s first-order logic all the way down — Datopia’s native asset is defined and exchanged via identical means.

1 An equitable mechanism for granting exclusive access to particular namespace prefixes — or fully-qualified attribute names — is outside the scope of this post, and perhaps need not be a platform-level feature.
2 Semantics no doubt acceptable to those who volunteer to traffic in Simoleons.

The Ecstasy of Immanence

Detective Deutsch: What else?

Barton Fink: Trying to think. Nothing, really. He... he said he liked Jack Oakie pictures.

Detective Mastrionotti: You know, ordinarily we say anything you might remember could be helpful. But I'll be frank with you, Fink. That is not helpful.
— Barton Fink (1991)

It’s difficult to conceive of a less attractive transformation than undergone by systems at the point they develop a dependency on a database. In the absence of persistence, functional transformation of values is about as delightful as software development can get, for many of us. More often than not, databases invite us to replace these transparent inputs and outputs with result sets and opaque connection handles — in exchange for the privilege of competing to submit strings to a distant, volatile authority.

Imagine a network in which we’ve a class of nodes responsible, in turns, for the deterministic application of transactions — and, hopefully, a larger class of participants issuing transactions, and interrogating the database they constitute. With an immutable architecture, participants needn’t issue queries over connection handles, or issue them at all — we embed a Datalog query engine in clients, and retrieve authenticated index segments as required by queries (via a peer-to-peer distribution protocol). The client maintains as large an authenticated subset of the network’s database / history as it needs, and executes queries locally — without contesting shared compute resources1.

1 For network participants with a need to execute queries contingent on the consumption of data surplus to local bandwidth/storage capacity — embedded devices, say — a generic mechanism for on-chain query evaluation is planned.

A Brief History of Time

For the purposes of convenience, we’ve been ignoring a crucial dimension — the temporal — and its centrality to coherent information systems. No longer! In Rich Hickey’s Datomic talk The Database as a Value, we’re beguiled by an appeal to the virtues of an epochal model of time — an unambiguously ordered accretion of immutable facts — as a more sound basis for reasoning about database semantics than the prevailing mutable cell model. This approach ought to be uncontroversial among blockchain enthusiasts — consider a network attaining consensus over a single numerical value, v, for 4 successive blocks:

It doesn’t require a philosopher’s wit to surmise there’s little to discuss about any of the values of v without reference to the corresponding block height. The insight gleaned from systems like Datomic is that the temporality required by sound information systems needn’t be some inconvenience — we gain tremendous expressivity and leverage by embracing it wherever we can.

In the abstract, we can model this property by conceiving of our Entity Attribute Value triples as EAVT quads, incorporating a dimension we’ll call, uh, Time:

Entity Attribute Value Time
<joe> balance 27 0
<sally> balance 1 0
<joe> balance 26 1
<sally> balance 2 1

Each row is an immutable fact — Joe’s balance at T1 doesn’t invalidate, overwrite, or otherwise supersede his balance at T0 (indefinitely accessible to anyone nostalgic for T0). This behaviour extends on-chain, where we might — for example — deploy a contract (or execute a query) partly concerned with transparent, deterministic computations over the full or partial history of Joe’s balance. Similarly, light clients/applications may express identical traversals locally, via the selective replication mechanism outlined above.

Often, we can afford not to care about the temporal dimension — such as in the earlier transaction submission examples — but there are instances where it’s the only means of solving a problem. We can realize the database at any T, diff or join two databases as of different times, inspect the history of entities over time, etc. These are superpowers.

Project Status

We’ve a functional, preliminary Clojure testnet combining Tendermint’s ABCI (classical consensus) with Datahike1, which, in concert, can do some — but not yet all — of what’s described above. While our primary focus is the design and delivery of a trustless, permissionless, neutral deployment of Datopia, we intend to encourage radical arrangements of its components — e.g. to experiment with alternative consensus/Sybil-resistance algorithms, trusted/closed deployments, etc.

Over the course of the next weeks and months, we’ll continue to publicly articulate the project’s goals, and technical approach, with a view to attracting potential contributors, advisors, critics and investors.

1 An authenticated Hitchhiker tree (write-optimized B+ tree) capable of satisfying Datalog queries.

Subscribe to Datopia Updates

If you've an interest in receiving infrequent project updates, or details of any future Datopia token sale, please enter your email address below.

Related Posts

An Introduction to the Hitchhiker Tree

The goal of the Hitchhiker tree is to wed three things: the query performance of a B+ tree, the write performance of an append-only log, and convenience of a functional, persistent data structure.