Swarm v 0.4 intro (1/5)

Swarm v 0.4 intro (1/5)

On its long road from version 0.3 to version 0.4, Swarm changed a lot. While Swarm 0.3 was built along the lines of a M-of-MVC JavaScript framework, 0.4 is more like a cross of a NoSQL database, ORM library and a pub/sub service, although it does not fit either category.

  • The API is object based, much like in ORM.
  • Once fetched, the data keeps being updated, like in pub/sub.
  • Finally, Swarm supports no SQL :)

Hence, Swarm is safely described as “data sync middleware”.

I start a series of five posts discussing the project’s objective (1/5), the underlying technology (2/5), its parts and inner workings (3/5), its network protocol (4/5) and exposed APIs (5/5) as of version 0.4 (which is currently available as an unstable branch and will hopefully replace 0.3 once we reach post 5/5).

Objectives

Swarm’s mission is to synchronize data in collaborative apps, web and mobile alike. Swarm is the M of MVC. That shapes its very specific feature set:

  • it synchronizes data in real time,
  • it works fine on intermittent internet connections (mobile/wireless),
  • it works with partial datasets (the client can’t have the complete DB).

More on that below.

Real-time

These days, collaborative software tends to sync in real-time as the default. If users share a room or a teleconference, your software has to keep up with the natural speed of events. If you are doing a mobile app, then your app can’t be worse than SMSes and SMSes are pretty much real time already.

Going further on the realtimeness scale, interesting things start to happen. Arguably, Google Docs was the first Web app to experience those effects. Once the timescale of changes starts being smaller than the timescale of change propagation, it gets curiouser and curiouser.

Consider chats. Despite all the experience in building that kind of apps, their sync is imperfect, typically. For example, it is not uncommon to see messages evaporating from Facebook, GTalk or Skype message histories. And chats are just barely realtime.

Well, data syncing is fundamentally hard. Nobody builds their in-house DB or crypto these days. Soon, nobody will risk to build their own sync.

Intermittent connectivity

On one hand, “there will be no offline soon”; mobile data is ubiquitous. On the other hand, mobile+wireless means intermittent internet connection. Physics, contention, hardware faults, and hand-overs lead to temporary network unavailability. If a user is actually mobile (i.e. walks, holds a device in the hand), then he/she expects that device to be responsive now. That assumes some level of autonomy.

Swarm is able to cache everything it fetches and work off the local data set. Also, it is able to stash changes to sync them later.

Partial datasets

The classic DB “dataset replication” ideology assumes the full dataset being copied over and synced. Obviously, that does not work for client-side syncing. A client needs a tiny arbitrary subset of the data.

One popular shortcut is to work with per-user or per-document datasets. That trick shifts the complexity somewhere else, sometimes too much of it. A nice example is a simple chat roster (who’s online, who’s offline). The overwhelming majority of typical chat app traffic is roster updates. Statuses change more often and their fanout is higher than those of regular chat messages. The problem is, every user needs to see his own subset of “friends”.

Swarm aims to make it straight. There must be no less-synced UI parts. Swarm subscriptions work on the object-by-object basis. All the data fetched keeps being updated.

P2P sync

There is a bunch of political, engineering and usability arguments in favor of P2P architectures. Unfortunately, building P2P systems is order of magnitude harder.

Like git repos, Swarm replicas are peers, so there is no master copy. As a byproduct, Swarm supports local data caches and P2P serverless sync. That warrants some level of flexibility and resilience. For example, an on-premises cache can localize an organization’s traffic. Similarly, an on-device caching daemon can sync data locally between applications.

Consider modern smartphones. Their local data cache is virtually unlimited. Meanwhile, mobile data connections are by no means reliable. Given that trade-off, caching and prefetching come in handy. Direct device-to-device sync is also potentially useful.

In the next post , I’ll explain how Swarm achieves those objectives by employing CRDT data types.