Swarm v 0.4 intro. Technology (2/5)

Swarm v 0.4 intro. Technology (2/5)

This post continues the 0.4 intro series discussing the basics of the Swarm technology.

Auto-merged CRDTs

The choice to support both real-time and offline modes was mostly driven by the use case of collaborative apps running on mobile devices. That was an unusual ice-and-fire tradeoff as those conditions are perceived as opposites. Still, their respective complexities are caused by exactly the same root cause: asynchrony of the environment. If new events happen faster than other parts of the system become aware of them, things soon become difficult.

Swarm deals with asynchrony in the most fundamental way. Namely, by exclusively using Commutative Replicated Data Types. Those enable both real-time sync (like in Google Docs) and disconnected operation (offline replicas are fully functional). CRDT guarantees automatic merge of concurrent changes in an asynchronous environment. Obviously, that comes at a cost. With CRDTs, the most basic integer data type diverges into a variety of integer types having different behaviors: counter, register, and so on. But at least, these types have a specified behavior. An average impromptu ad-hoc implementation is guaranteed to be full of surprises once exposed to asynchronous conditions. Even very professional implementations tend to reveal some surprises repeatedly.

Hence, Swarm sticks to CRDT. No manual merge of changes allowed.

On the other hand, if developers wants to, they are free to create their own CRDT data types on top of Swarm POLO layer (partially ordered op log).

Still, CRDT is quite a broad definition. Those are any types that converge or commute. So, which ones?

Op-based CmRDT

Swarm chose op-based Commutative RDTs over state-based Convergent RDTs that server-side implementations prefer. Indeed, op-based types are much better for bandwith-restricted environments (client, mobile). Op-based CRDTs are somewhat reminiscent of Operational Transformation: every change to a data structure is represented as an operation (op), which is sent around, applied, stored. Indeed, imagine if Google Docs would use some state-based approach instead of OT. Sending a full document state on every key press would be a nightmare.

What is our difference from OT then? OT operations are repeatedly transformed (by definition) to match the asynchronous environment and its partial event orders. Contrary to that, Swarm CRDT ops are immutable. Data structures are defined in a way that slight changes in op order do not affect the outcome, as long as cause-effect ordering is obeyed.

So, every object has multiple replicas, replica states are changed by ops, ops propagate to all the replicas, eventually, to make their states converge. In the first approach, Swarm is a regular CmRDT.

Practical question #1. Suppose, we download a CRDT object. Should we download its entire op history? Answer: not in Swarm. More or less in line with the classic state machine replication model, new replicas are initialized by a state snapshot and then continuosly updated by a stream of ops. The key difference from the classic SMR is subtle, but important nevertheless. SMR assumes a single linear order of all the ops. We use a partially ordered operation log (POLO); our ops can arrive in different orders to different replicas.

Practical question #2. In theory, most papers tend to consider “steady state” synchronization. In practice, steady state sync is a temporary condition, and synchronization can be interrupted halfway. Especially so in mobile networks.

Hence, Swarm employs sort of a reconciliation handshake protocol. First, replicas exchange batches of ops to equalize their states. Then, they proceed with steady-state synchronization. As we deal with partial orders, a handshake is based either on version vectors OR linear positions in the respective op logs. The latter option is cheaper bandwidth-wise, but slightly expensive in terms of state to maintain.

The resulting scheme is nicknamed “a shashlyk”:

0╌╌╌╌┴╌╌┴╌╌╌╌┴╌╌╌╌>

0  "zero" (default) state
╌  ops
┴  intermediary states
>  "head" (recentmost) state

Differently both from SMR and OT, the accent is on immutable ops, not states. States are not precisely ephemeral, but the exact state sequence may vary from replica to replica. Also, your replica might have a state nobody else had. Still, once all the ops reach all the replicas, they converge to the same state.

Practical question #3. How do these ops get to those replicas?

Spanning trees

It is easy to draw up a simple one-to-one client-server sync scenario. But here we approach the practical question #4. Is there such a thing as simple client-server sync at all?

First, the client is often intermittently connected and needs to resync on reconnection. That aspect was covered earlier. Second, if a relatively simple single-page web app is open in two browser tabs, that creates certain complications. Some apps simply forbid that, so please close the other tab. Third, a web app is rarely served by a single server. Multiple servers require server-to-server sync. Fourth, greater autonomy requires a local cache. Imagine an office subnet that keeps working when “the cloud” goes down. Imagine two apps that need to sync on a single mobile device which is currently offline.

Those challenges necessitate a multilevel sync structure. Not just one-to-one sync, not a “star” but at least a tree. A spanning tree of replicas is our single unifying abstraction for both server-side and client-side sync. Ops spread by the spanning tree from replica to replica, using the minimal number of transmissions. That differs from the “gossip” approach that sends updates either in all possible directions or randomly.

Self-stabilizing spanning tree building algorithms is a separate interesting topic, so let’s skip that for now.

We assume that for each object, there is a “responsible” server-side storage that is the root of the respective spanning tree. Everything that gets into the root replica is truly saved. Still, isolated subtrees may function just fine. Technically, any node may be appointed a root.

So, let’s reiterate question #2. Remember that the graph of replicas is far from steady state, connectivity is intermittent. Still, we need to propagate all the ops to all the replicas, while avoiding causality violations.

The problem is resolved by the handshake algorithm which merges subtrees of replicas. It exchanges ops that are missing on each respective side, so steady-state sync may continue from that point on. Thus, handshakes build the spanning tree out of two-way subscriptions. The inner workings of a Swarm handshake will be discussed in the 4/5.

At this point, we had a bird-eye view of the Swarm system: Ops prapagate by a Spanning Tree to all the Replicas to make their States converge.

The next part 3/3 will explain what constitutes a Swarm node and how its parts work together: storage engine, op routing and API objects.