Supporting Directory Operations With a Flat Name Index

June 27, 2018

Traditional file systems allow renaming of directories or moving files from one directory to another, but with a giant asterisk. Rename operations are not allowed unless the existing and new names are both part of the same file system.

You probably think of your entire file namespace as 'the file system', so breaking that namespace down into multiple 'file systems' may well strike you as just as excuse for telling you that they won't support the rename that you just requested. It is especially annoying that you are expected to remember which of your directories you created as 'file systems' and which are mere directories.

However, there are actually reasons why renaming across file systems are difficult. These issues ultimately require a different implementation strategy to support efficient rename operations for scale-out cloud storage.

Hierarchical directories enable efficient renaming within a single-server file system, but are ultimately incompatible with scale-out storage. Most object storage solutions have solved this problem by disallowing rename operations, but there is a way to support directory operations while keeping the efficient "flat" name indexes used in object storage.

Distributed multi-user storage systems either have to be copy-on-write or have very complex algorithms to ensure read operations are completed consistently. Nobody want a read operation to return a mix of data that was overwritten with partial new data. They want the read to either return the data as it was before someone else's write or to yield data that is entirely after the other user's write. Both NexentaStor and NexentaEdge are copy-on-write solutions.

The challenge for copy-on-write systems is that any update to an object/file theoretically requires updating all of its parent directories. This would result in a tremendous write amplification if each write required its own write to each of its parent directories. Each server is required to only acknowledge a write only after the write and all derived writes are "done". You can't say 'it's done' and later say "oops, sorry". "Done" means "done".

Fortunately the acknowledgement does not require all of those writes be "done" in the sense that they are in their final format, just that the write is "persistently stored". Updating of parent directories can be deferred as long as the deferred edits will be reflected in all transactions going forward and that the changes will not revert if the system reboots. Basically, the edit has to be reflected in working memory and saved persistently to disk. But it does not have to be saved to disk in the permanent format.

Instead the final format writes can be deferred until later. Until then journal entries guarantee the writes will be eventually completed and not forgotten during a system restart.

Deferring writes spreads them out. The number of writes required before the initial write can be acknowledged is greatly reduce. Even the eventual number of writes can be reduced because a single committing write to a parent directory can frequently reflect more than one pending transaction.

The reason that writing to the journal is considered safe is that each file system will reapply pending writes from the journal when restarting.

Ensuring that an acknowledged transaction will always look as though it was fully completed no matter what is not simple logic. But it is at least a tractable problem on a single server file system. Maintaining a journal across multiple servers is far more complex. Just consider how many different recovery scenarios have to be dealt with. Any one of the servers can fail. A second server might fail while a first server is attempting to recover. Enumerating the problem is difficult, solving it correctly even more so.

There are ways to maintain a journal across multiple servers, but not without slowing down each and every commit.

Federated NAS solutions, such as NFSv4, solve this by mapping prefix names to different 'mount points'. Each is its own root file system, journal and a distinct storage pool for at least the metadata, and each has its own repository for the file system metadata. The question is how big should each file system be? Larger file systems enable a wider scope of rename operations but also create a processing bottleneck for metadata operations.

That is why object storage systems have favored eliminating bottlenecks. They effectively make each object its own file system. Having a scope of a single file means there are no directory rename operations.

You can clone an existing object with a new name and hopefully only have the system copy metadata, not payload. But even that requires two operations: creating the clone and then marking the old name for deletion. Cloning an entire tree, if supported, requires copying metadata for each object within the renamed tree. Not having to replicate the payload is a big advantage, but an entire tree of metadata is still an extensive copy operation can be acknowledged

Object storage systems restrict move operations for a very good reason. Getting and putting objects that cannot be moved elsewhere in the namespace hierarchy is simple and straight-forward. Creation of each replica can be acknowledged by the single storage target creating that replica. The creation of each replica can be a simple atomic operation that a single target either completes or does not. These simplifications allow quick completion of the most common transactions. It is hard to even imagine a scenario where directory rename operations would be more common than puts.

What we want is to keep the benefits of simple puts while enabling directory move operations to be performed. Our goals are simple:

The number of operations required to get an object should not be dependent on the number of sub-directories in the path name.
The put of each replica of the Version Manifest must remain a single atomic transaction performed on a single storage target.
Directory Operations need to have linear execution times that are not dependent on the number of objects that are conceptually enclosed in a directory.

Directory Operations can be supported by defining a new type of Object Version that encodes Links. These objects are otherwise like any other Object Version except that they have abbreviated Version Manifest formats.

A File Link encodes minimal key-value metadata, including a linked-to Version Manifest. This specifies the linked to NHID and the VM-CHID of the specific version.

A Directory Link encodes mapping of all names matching a prefix to a new prefix. Versions must be created both for the Directory Link From and Directory Link To names. These records enable resolving names created either by a 'ln' command or by a 'mv' command.

The namespace manifest records for these Link objects enable mapping a time-dependent name that may have been linked or renamed to a stable name in the fully distributed namespace. It avoids the need to edit the names in the Namespace Manifest.

NexentaEdge's ability to make coherency guarantees while operating with the efficiency of eventual consistency is dependent on two rules:

Every operation is performed atomically by a specific storage target. That storage target can acknowledge the creation of a replica once the atomic put of the chunk is safe via journaling local to that storage target.
All operations on metadata are commutable. It does not matter which order the operations are performed, as long as they are all eventually applied the final results are identical (the same records are in all replicas of the namespace manifest and the fully distributed name indexes all have the same current version of the object).

Modifying Namespace Manifest records from '/X/Y' to '/X/Z' would break these rules s:

Two rename operations cannot be applied in either order if they both seek to change the same name.
Changing all '/X/Y' records to '/X/Z' records can involve two or more namespace manifest shards.

We can solve this with Directory Links. A Directory Link specifies where objects/files within this Directory can be found in the fully distributed namespace. This is itself a versioned object. Of course moving a directory requires creating two of these object versions, one to activate the new name and one to hide the old name.

Updating a Link is just putting a new object version. The new object versions have unique version identifiers, making noting their existence in the Namespace Manifest safe. An Initiator can transactionally 'edit' a directory by putting the new version of the Directory Link We can extend the Version Manifest put transaction to define a 'conditional put' which requires that successful completion of the new version can only complete if it is the only version for this generation.

Rather than requiring all name resolutions to go through this process we can require that all namespaces follow the normal object rules, that is they are unmoveable, unless movability has been specifically enabled with a metadata key-value pair for an enclosing directory.

These extra object types enable support for directory renames without negatively impacting the performance of object gets and puts. The user regains the ability to publish using links and to re-organize their namespace by moving directories.

Defining Link objects effectively allows direcgtory operations while keeping the benefits of a flat name index. Further, it does not require pre-designating specific directories as "file systems" and limiting the scope of certain oprations to conform to file system boundaries. Nor are we forcing all objects into top-level "buckets" with no support at all for directory rename operations.

Snapshot Manifests

June 13, 2018

Caitlin bestler

Snapshot Manifests are NexentaEdge's implementation of a traditional copy-on-write snapshot applied to a fully distributed eventually consistent storage cluster.

Snapshot Manifests are versioned objects which specify a precise set of object versions. The are created as a subset of a master distributed record store describing all object versions within a specific namespace, typically a Tenant. We will describe the Namespace Manifest and its construction in a later blog.

Snapshot Manifests can be used in all the ways that traditional file system snapshots can be used. They can even be used in many additional ways.

Snapshot Manifests are created by users. Unlike traditional file system snapshots they are not file system objects. Snapshots can be taken of any subset of a Namespace Manifest, not just of pre-designed file systems. There are no file system limits imposed on the number of snapshots that can be kept. They are created by end users for end user purposes.

##Deferred Enumeration Snapshots A new version of a Snapshot Manifest can be created in an instant capturing a true point in time snapshot of the selected subset of the namespace. The version is first expressed as a rule which specifies the set of information to be extracted from a Namespace Manifest. The actual information extracted replaces this content once it has been extracted. Before then the version exists, but cannot be retrieved.

The set of objects referenced by a Snapshot Manifest does not have to be known by the creator of the Snapshot. The Snapshot does not even have to be for 'now'. Snapshots can be created covering any point-in-time in the past where the referenced data has not yet been expunged.

Because snapshots are a subset of a Namespace Manifest they can even be the starting point for further refining the subset by applying further qualifiers to further limit the included set. New snapshots can also be created by merging the records from existing snapshots as long as they were both subsets of the same original namespace.

##Consistent Snapshots By following some simple rules for time-stamping Object Versions it is easy to create Snapshot Manifests which can be consistent for a specific application. They are consistent in that they either fully include a transaction or fully exclude it, a snapshot never captures a transaction that is halfway done.

When it is complete it is a true point-in-time snapshot of a distributed cluster that never stalls any Initiator from creating new object versions because of network issues or the actions of any other initiator. The snapshot time is not when a set of target machines 'takes' a bunch of distributed snapshots, it is the filter that identifies the set of records to be extracted.

It is not even necessary to identify the snapshot time in real-time. Rather than co-ordinating a cluster-wide quiescent point in real time it would be possible to compare transaction logs after the fact to determine when the application was quiescent after the fact.

##No 'Freeze' Allowed Chandry and Lamport in their 1985 paper[1] compare the problem of taking a snapshot of a distributed system to that of taking a photograph of a sky full migrating birds. The image is too immense to be captured by a single photograph, and the photographers cannot demand the birds "freeze" to enable photo gathering.

Any "snapshot" solution that impedes ongoing transactions in any way is the equivalent of demanding the migrating birds to freeze. The birds won't listen to you, and the distributed storage servers shouldn't be interfered with.

Some "snapshot" solutions require the application layer to effectively gather this ultra-wide screen photo collage themselves by "snapshotting" only the versions known to a specific node at the time of the snapshot. You can snapshot a directory, but only if you already know the latest version of everything in that directory.

This is not effective when the whole purpose of the snapshot is to enable processing of all data collected as of time X. You can only get the list of all data collected by time X if you already know all data collected by time X.

Other solutions merely support creating a clone of a specific object version and then calling that clone a "snapshot".

NexentaEdge provides a true distributed snapshot. Chandry and Lamport algorithm requires end-to-end communication. Ours does not require end-to-end communication to take the snapshot, merely to publish an extract. Periodic snapshots can be taken every hour on the hour, not every hour at the first second of that hour when the network was fully healthy.

With this strategy, NexentaEdge can be accepting new object versions from hundreds of sources concurrently, and still take a point-in-time snapshot at any time that specifies a consistent set of object versions. These snapshots can be taken automatically at specified times even when the network is partitioned. The 2:00 AM snapshot will reflect 2:00 AM, not the first moment after 2:00 AM when the network was fully healed.

Because all of the information about a Version Manifest is unique and immutable a Snapshot can cache any portion of the information form the Version Manifest in the snapshot itself. While this makes the snapshot object larger, it can speed up access to the snapshot objects. This can allpw distributed compute jobs to publish results as a snapshot, allowing single-step access to the referenced chunks by clients who effectively "mount" the snapshot.

##File System Snapshots Traditional file system snapshots from copy-on-write filesystems such as ZFS enabled taking snapshots without making the world 'freeze' first. Snapshots enable rollback to prior points, reading old versions of files and efficient replication of file system changes between multiple sites.

The initial design goal for NexentaEdge snapshots was simply to replicate those features. Taking point-in-time snapshots on single-system file servers is relatively easy, because ultimately there is a single transaction queue. Snapshots can be taken between transactions. Doing that in a distributed system would require complex cluster-wide synchronization. But building snapshots as an extract of a collection of transaction log entries enables something even more powerful - the snapshots are no longer bound to be for a limited number of pre-identifiable snapshot-eligible mount points. Snapshots can be for any set of objects/files no matter what directory they are located 'within', and selection is lot limited to specifying a directory and all descendants.

The chosen subset can be for any set of matching object/file names.

The subset can be further refinded by requiring any number of metadata field matches. For example a snapshot could be limited to objects owned by a specific user.

The snapshot time can be chosen after the fact to be a time when no node had initiated a transaction but not yet completed it It is not necessary to determine this instant in real-time, merely before the earliest possible expunging of data that the snapshot should protect.

There is even an easier method for an application to avoid the risk of a snapshot inlucing only a portion of a transaction, it can force the timestamps for all object versions created as part of a transaction to have the same timestamp.

##Multiple Applications Can Snapshot The Same Data Because a snapshot is an application object that merely pins the referenced data different applications can create their own snapshots that reference the same files.

Multiple snapshots taken "at the same time" do not slow the system down any at the nominal snapshot time. There is a non-zero amount of metadata creation required for each snapshot, but no payload chunks need to be replicated and there are no critical path dependencies waiting for the new snapshots to finish their extraction job.

One Application Can Snapshot The Same Data Many Times It even makes sense for a single application to take different snapshots of different subsets of a larger pool of data. Machine Learning applications work with both Training Sets and Testing Sets. Both are subsets of a larger set of objects. Snapshots are a storage and communications efficient method of identifying these sets.

Snapshots provide the same "containerizing" of a set of data as aggregating container formats such as HDF5, but without replicating the actual payload.

Snapshots can become the representation of "datasets" as used in Machine Learning applications. Multiple snapshots can each specify different selections of data, providing a single handle for referencing that set.

Snapshots can be any set of object versions, making them useful for reporting the results of object queries.

##Snapshot Optimized Data Retrieval Snapshots are not limited to identifying the set of object versions referenced behind a given tag. They can be used to optimize retrieval of the data referenced in a specific snapshot:

Any record within a Version Manifest can be included in the Snapshot Manifest. This can eliminate some or all access to the Version Manifest when the client accesses the object using the snapshot.
The local storage layer can pre-fetch non-cached Version Manifests and/or payload chunks as soon as the client indicates that it intends to access data via a snapshot (by "mounting" the snapshot). This can particularly optimize multi-stage HPC jobs.

##Summary Snapshot Manifests enable creation of true point-in-time snapshots of an eventually consistent distributed storage system. Any criteria can be applied to identify the specific subset of objects to be included in the snapshot. Once created the Snapshot Manifest can be used to access that data, to optimize processing of that data as a set, to clone that point in history or rollback to it.

The Snapshot Manifest goes beyond the traditional file system snapshot and becomes a far more generalized expression of a dataset.

[1]: Leslie Lamport, K. Mani Chandy: Distributed Snapshots: Determining GlobalStates of a Distributed System. In: ACM Transactions on Computer Systems 3. Nr. 1, February 1985.

Unique Version IDs and Generations, but no Version Numbers

June 12, 2018

Caitlin Bestler

How do you assign a unique version number for a new object version even when the network is split? Two clients, A and B, both want to create a new version of Object X, but there is no network connectivity between them. What do you do?

Only allow the client connected with a quorum to put their new version. Keep in mind that neither might be within the quorum. Any version not accepted is in danger of loss from client machine or drive failure.
Create a unique identifier that does not require any form of serialization.

NexentaEdge does the latter, creating a Unique Version Identifier (UVID) by extending a high precision timestamp with the network id of the client. Now the versions created by A and B will be unique, and all nodes will agree on which one was posted later.

While it works the UVID is somewhat awkward. It is large, it is difficult to sequence, and it does not identify when there are two potentially conflicting edits.

NexentaEdge solves this by adding a "Generation' metadata field. A client sets this to 1 greater than what it perceives the current Generation to be.

So, A and B both observe object X with UVID Y and Generation 7. A then puts object X UVID Y1 Generation 8, while B puts object X UVID Y2 Generation 8. Even if they put the versions at exactly the same tick their network address will break the tie. All nodes will agree that B's version is later than A's because even though their timestamp is identical the tie is broken by B's "later" address.

We now have two Generation 8 object versions. This is actually good, because the alternative is to not accept one of them. What the Generation field does is flag that a version has been eclipsed, a 'later' version has superceded it without its contents having been fetched while preparing that 'later' version.

Was information lost? There is no automated method that can automatically correctly evaluate two versions. A human editor is going to have to review the two edits and decide what if any change is required to create a Generation 9 version. Citing the additional 'based upon' versions signals that the split has been resolved. The split no longer needs to be highlighted or have an extended hold on those versions.

Normally the 'Generation' field will look like a normal "version number". It will monotonically increase. Putting more than one version during the short get-through-put cycle will be rare. But when it happens the system can notify the authors of the confict and automatically hold the eclipsed versions until a merged later Generation is posted.

Feature that made ZFS famous

April 25, 2018

Dmitry Yusupov

It was back in 2005 when Sun Microsystems unveiled OpenSolaris and young yet brave new file system called ZFS. Sun did lots of marketing to get ZFS widely recognized and known. Mostly it was grass root effort via engineering blogs and forums, people who were very passionate about what they've architected and engineered.

But what was it that made ZFS so prominent among other existing filesystems at the time? What is it was wrong with the filesystem world that made it widely accepted?

Some believe that ZFS’s differentiation was a direct cause from its performance. However, ZFS wasn't the fastest. In fact, It was on average acceptable in majority of workloads and EXT4 did outperform ZFS actually. ZFS was greedy for available DRAM and CPU resources. And when over time these components became affordable, following Moore's low curve, ZFS outperformed EXT4 on write intensive workloads. The rectified performance of the ZFS system, however, was not the leading reason for its fame.

ZFS stands for "Zettabyte File System". But i didn't see that its great addressability what was critical at a time. In fact, typical array size (ZFS pool cannot be accessed via two nodes simultaneously) was in range of 100-200TBs. You do not need ZB+ addressability if all you planning to address is just going to be within 1PB at most. No, it wasn't its great scalability either..

Features? Usability? Well, that did help to attract hundreds of thousands of enthusiasts to play with it. It certainly gave it necessary momentum. But this couldn't compare to what Linux adoption did with millions of people using EXT4 every day till these days actually. And by the way, breaking layers idea was "too revolutionary" at a time.. while gaining better usability (LVM and FS layer integrated) it did create unnecessary controversy, along side with CDDL vs. GPL discussions. This didn't help much.

Looking back, I'd say ZFS the most important feature that really made it so appreciated in enterprise circles is its built-in end-to-end integrity as a result of Copy-On-Write technique it used so masterfully. As a matter of fact this technique was the core dispute between NetApp and Sun (later Oracle) until it was court settled due to patent claims expiration dates.

To understand this, you need to be an admin or a admin fellow who's job is on the hook if his enterprise would suddenly loose data or it would corrupt archives silently. When ZFS released to public, in open, free as in speech, or with support contracts from companies like Nexenta, it was big deal to those guys! No longer they need to buy expensive EMC subscriptions and NetApp contracts. ZFS let people keep critical data safe, with guaranteed integrity and self-healing capabilities built-in. Snapshots and Clones were obvious by-products of Copy-On-Write rather than separate inspirations. ZFS just harvested its benefits smartly.. by enabling ARC cache and clever transactions algorithms, it did deliver acceptable performance even for workloads that typically hard to achieve with Copy-On-Write, i.e. random writes. As i mentioned earlier Moor's low worked in their advantage.

It was year of 2011 when group of talented engineers and architects at Nexenta realized that world needs to extend Copy-On-Write technique to stay relevant in the Cloud era. This is when Cloud-Copy-On-Write technique was born and with it new product, delivering it - NexentaEdge.

Learn more about NexentaEdge at http://nexentaedge.io.

Multiple Tenant Access To A Shared Storage Cluster

April 12, 2018

Caitlin Bestler

In the prior blog we discussed how to use Kubernetes to provision a class of storage clusters which protects against loss of stored assets itself via erasure encoding and/or replication across already provisioned resources, rather than relying on Kubernetes to supply replacement resources.

In this blog we will extend that concept to show how to use Kubernetes to enable multiple tenants to each access the same storage cluster via tenant-isolated networks.

Kubernetes is currently capable of scheduling a storage cluster which provides storage services to a flat namespace. It can also create multiple tenant clusters with isolated pods which cannot accept connections from nodes of other tenants. The proposal discussed here allows storage clusters to offer storage services to pods via tenant-isolated access networks.

The convention proposed is to first provision a storage cluster using Kubernetes and then independently provision tenant access networks for each tenant to access this same storage cluster.

Each tenant will have their own access network and their own storage namespace on a common backend storage cluster. The storage cluster will typically have a common backend storage network which serves all tenants. The storage cluster is in control of allocation of resources to different tenants. While it is not required to fully provision storage resources to each tenant it will typically enforce quota limitations on each tenant.

Isolated Tenant Access Networks

For the frontend tenant access networks the goal is to provide isolated pods that are all connected to the same tenant access network. Any underlying technology that prevents Tenant B clients from connecting to Tenant A pods can be selected. VLAN, VxLAN and firewall methods all work. What Kubernetes lacks is a uniform strategy allowing multiple tenants to each have isolated networks to a shared backend storage cluster.

Each tenant would be able to enumerate the particpants in their access network:

Client Pods
Service Pods providing tenant specific access to services such as Active Directory or LDAP.
Access Pods providing tenant-specific access to the storage cluster. These Pods would run on hosts that already had the desired Storge Cluster backend pods launched.

Shared Backend Storage Cluster

Why not have each tenant simply clone the storage service for themselves?

Per-tenant storage clusters forces resources to be rigidly divided to different tenants. This inevitably results in idle resources reserved for tenants not currently doing enough to keep them busy.

Having separate tenants access a shared storage cluster over tenant-specific networks allows for dynamic resource sharing between tenants. Given the bursty nature of storage traffic it is very unlikely that each tenant will always be using the same portion of the cluster's resources. Limiting the share any one tenant uses over the long term is obviously good business - it forces those using more resources to pay for better SLA guarantees. Strictly dividing resources for each and every second just ensures that there will be a lot of idle resources being wasted.

Storage traffic, especially for active document archives, is very bursty. Sharing raw resources is very cost effective.

Creating a storage cluster to provide service to a single tenant or a flat namespace certainly works. But providing a shared storage cluster enables economies of scale in providing persistent storage services. It no more has resources assigned to it for a specific tenant than a Bank has your cash pre-identified for you to withdraw. Serving multiple users from a single set of resources is efficient.

Limitations on Sharing

The economies of a shared storage cluster would be shunned if they can at the cost of exposing one tenants assets to any other tenant. Sharing costs is great, sharing data is not.

Presumably the backend storage cluster would also be capable of throttling requests in a tenant specific way so that no single tenant could monopolize the cluster resources without statically dividing physical resources into isolated pools incapable of lending idle resources to meet usage bursts.

All requests for the backend storage network Pods are tied to an authenticated tenant. So the backend pod can apportion its provisioned storage and network resources according to its own policies so that it can offer QoS guarantees to its Tenants.

Typically the Tenant Access networks would be best effort, but bandwidth guarantees can be provisioned on the L2 network and/or in firewall rules.

For each tenant the scope of available storage resources would be scoped by the Tenant ID. If access is through Tenant X's network then only Tenant X assets may be accessed. Tenant-specific ACLs may be additionally applied after that.

This layering of access control enables use of open-source software that has not been designed for multi-tenant access. Open-source file system daemons were largely designed when NAS was implemented over corporate intranets, not shared data center networks.

Problems with the Flat Namespace

Merely publishing a Storage Service to a flat namespace creates issues for multi-tenant access.

In a flat namespace, any advertisement of available mount points would be available to all clients.
The Storage Service would have to reject logins from user that were not part of the Tenant-specific network authorized to access the storage server.
Denial Of Service attacks would hit tenant-independent resources, threatening to bring done service to all tenants. With Tenant-specific networks the Storage Service would be able to confine the impact of a DoS attack to that specific network.

The Ganesha NFS daemon supports a pluggable File System Access Layer (FSAL) that makes it very convenient for providing NFS access over any storage service that resembles a file system, including object storage. Multi-tenant support with Ganesha requires creating Ganesha instances for each tenant. Similar implementation strategies are also employed with iSCSI targets.

Multiple Storage Backends

This proposal also supports provisioning multiple different backend storage clusters, and even assigning generic "storage server" nodes to specific storage clusters. Tenants could then choose which storage backend their clients would be granted access to. While provisioning and configuring storage clusters typically varies between vendors it is very common for different storage clusters to support the same data plane APIs to access storage (NFS, CIFS, S3, Swift, iSCSI, etc.)

When a storage cluster requires a backend storage network it is typically relying on some form of custom congestion control other than TCP/IP. Examples include FCoE or RoCEE reliance no-drop Ethernet, FibreChannel or Infiniband. NexentaEdge use UDP but relies on traffic isolation provided by lossless Ethernet or some other equivalent L2 solution. Network isolation not only enables these strategies, it protect non-storage traffic from being drowned out by storage traffic.

Kubernetes network policy defaults to non-isolated pods. A Network Policy is required to impose access restrictions, and those restrictions may be enforced by accepting/rejecting TCP connections rather than by L2 traffic isolation.

The best L2 traffic engineering, based on the Datacenter Bridging 802.1 protocols, is only supported by switch specific policy's such as Big Switch Networks.

This proposal is compatible with scheduling multiple different storage backends in the same cluster, however the backend network resources are allocated.

While this allows "dynamically" allocating a storage cluster it must be remembered that each new storage cluster would formatting the local storage volumes allocated to it. There is no cluster-independent definition of persistent storage. This is not a solution for dynamically creating storage clusters on demand, but rather for allowing long-term reallocation of resources.

Many storage clusters have similar demands for a "Storage server", and ability to periodically rebalance the allocation of storage servers between different storage clusters is certainly desirable. Even if storage clusters are completely static, the convenience of using Kubernetes to provision them and to keep Container code images up to date would be valuable.

Special Scheduling Requirements for the Backend Storage Service

The backend storage cluster must be provisioned before any tenant access networks are added or removed. The backend storage network frequently has specialized requirements, and specific characteristics may be desired for local storage. Kubernetes has sufficient options to customize host selection to enable provisioning of even very picky storage clusters. For certain backend storage clusters the 'scheduling' may end up specifying the exact set of hosts to be used, but it is possible.

After this backend network is provisioned we need to tag the selected hosts so that Tenant Access Pods can be provisioned to hosts that already have the backend storage service Pod running. Ideally, only nodes requiring optimized access to the backend storage network should be scheduled directly on these gateway hosts.

An example of an end result is depicted below, with two different Tenant Access Pods communicating with a single Backend Storage Service Pod.

graph TD; A(Tenant A Access Pod)-->FrontNIC(Frontend NIC); A-->LocalHostIPC(Localhost IPC); B(Tenant B Access Pod)-->FrontNIC; B-->LocalHostIPC; S(Backend Storage Service Pod)-->FrontNIC; S-->LocalHostIPC; S-->BackendNIC(Backend NIC);

There would be a minimum of one Pod for each tenant, although tenants could have multiple Pods to accommodate varying scheduling requirements. Separation of services into different addresses spaces is frequently desirable, but it is not a requirement for providing multi-tenant service. What is required is that each tenant Pod has tenant-quarantined access to the FrontEnd NIC.

A BackendStorageServicePod (or pods) has previously been provisioned with access to the BackendNIC. For many storage clusters this will be a network with specific quality of service guarantees and specialized resource allocation needs. Some may require lossless Ethernet service, Fibre Channel, Infiniband or merely some guaranteed bandwidth.

The Tenant Access Pods communicate with the Backend Storage Service POD via localhost IPC. This may be used for all communication, or merely to set up shared memory message queuing.

The Backend Storage Service Pod could authenticate each Tenant Access Pod, but would find life simpler if the scheduling policy simply enforced that Tenant access Pods would be scheduled on this host. The Pod could certainly have hyper-converged Containers.

When a new Tenant Pod was added it would use the LocalHostIPC to register itself with the Backend Storage Service Pod, and establish itself as being associated with a specific Tenant.

If the performance demands for a given Tenant Access Pod were high enough the Access Pod and Backend Pod would use initial IPC messages to set up a higher performance channel such as a message queue through shared memory. Shared memory should be allocated by and owned by the Tenant Access pod since it should be deallocated when the Tenant Access Pod is deactivated.

When a Tenant C is added those pods would register with the Backend Storage Service Pod. It would the mix the interfaces with the Tenant C Pods to the list of interfaces it was polling. Load-balancing and prioritizing among Tenants A, B and C would be left to the discretion of the backend pod.

What is important is that the backend pod be able to determine which Tenant is behind each request, and that only approved Tenant Access Pods can try to access it.

Summary

The steps required for dynamic multiple tenant support are:

Schedule the Storage Cluster on N machines. Mark those that are eligible to act as gatewya/proxy machines as being access hosts for this specific Storage Service.
To add a Tenant Access Network:
- Schedule Tenant Access Pods on hosts marked as providing Storage Service X.
- Configure these Pods to access a Virtual Access Network which includes external clients and required Tenant-specific support services (such as AD/LDAP).
- When launched the Tenant Access Pod will register with the Backend Storage Service Pod authenticating itself as working for the specific tenant.
- The Backend Storage Service Pod will now prpvide service via IPC, or a communication path setup using IPC, to access Tenant-specific storage services. For example, each Tenant accessing NFS services would oly have access to mount points defined by the tenant.
To drop a Tenant Access service:
- Logoff the Backend Storage Service Pod using the IPC channel.
- Shutdown the Tenant Access Pod.
When there are no Tenant Access Pods a Backend Storage Service may be terminated.

Membership in either the Tenant Access Networks or the Backend Storage Network may be changed at any time using normal Kubernetes procedures. This does not change the relationships between the networks.

Beyond the Virtual Disk

April 12, 2018

Caitlin Bestler

Kubernetes defines numerous options for Persistent Volumes and Persistent Volume Claims. The problem is that they are too numerous, or perhaps more to the point too diverse.

A Persistent Volume can be anything from a set of storage held by a backend SAN to locally mounted partitions to mount points for network file systems such as NFS.

Kubernetes does not easily supply a Pod with a way of knowing exactly what storage services will be supplied until a Persistent Volume Claim's request has been matched.

The Virtual Disk as an interface is very primitive, so offering more options is good. But this is of limited benefit if there is no unifying concept on what is being provided. A Kubernetes PV (Persistent Volume) is some kind of storage resource that can be assigned to a Pod via a PVC (Persistent Volume Claim). But what is being claimed is not well defined other than by iteration of many options.

By contrast, Rook IO (https://rook.io) defines three different storage services (block, object and file), and the definitions of what a user of those services can expect is clear.

Rook IO uses Kubernetes, but provides storage services using CEPH. The project is currently working on supporting other similar storage backends, including NexentaEdge.

NexentaEdge and Ceph are both examples of storage services that are very different what is offered to Virtual Machines by VMware and most legacy providers.

By default Kubernetes views storage as being just one more replaceable resource. You need W-cores with X GHz utilizing Y GB of RAM and Z TB of storage. If the resources backing any of those assignments fails the Pod is simply relaunched at a new location where that set of resources is available. With Persistent Volumes the contents stored on the storage resources can even be preserved when the Pod is relaunched.

CEPH, NexentaEdge and other storage clusters handle failures of storage target servers or drives very differently. They do not rely on Kubernetes to allocate a replacement, with optional replication of content. Rather they have already replicated or erasure encoded the stored content so that the content can be preserved even after the loss of a server or drive.

This enables the storage cluster to respond to the loss of a resource more quickly than a general purpose Kubernetes algorithm could. Further a storage cluster can have more fine grained responses here. For example, replicating or erasure encoding can be of object versions rather than entire disk drives. When providing a file or object storage service replication/repair resources could be limited to actual content rather than the logical capacity of a drive. Unreferenced sectors do not need to be preserved.

Similarly, a storage server may chose to creative pre-bonded active-passive pairs of Pods that can fail-over the responsibility of providing a service more quickly than Kubernetes can restart a Pod with persistent storage to replace a failed pod.

Expedited fail-overs for both front-end daemon and backend storage target may or may not be needed for any given application. Cluster managed failover of storage targets or storage devices can offer faster recovery and greater resource utilization. Self-pairing of Active/Passive frontends merely provides faster fault recovery.

In the next blog I'll explain how this type of storage cluster can be scheduled using Kubernetes using custom schedulers and the CNI and CSI plug-ins. Further, I'll explain how to enable multiple tenants to share a single backend storage cluster without any risk of leaking content across tenant lines.

Immutable Metadata Not Enough

March 30, 2018

Caitlin Bestler

In prior blogs I've explained how NexentaEdge has immutable self-validating location-independent metadata referencing self-validating location-independent payload. The same can be set about IPFS, the Interplanetary File System (https://ipfs.io). While the two storage solutions' handling of payload chunks is very similar, the differences in how objects are named and found are almost as different as possible.

Payload Chunks

The end result of putting a chunk to IPFS is that it is identified and validated with a cryptographic hash, and that the cryptographic hash can be used to find the chunk for retrieval.

This is very similar to NexentaEdge, but there are differences:

IPFS accepts the chunk and then generates its cryptographic hash. A NexentaEdge client directly interfacing to NexentaEdge cryptographically hashes the chunk before requesting that it be put. This avoids transmission of duplicate chunks.
IPFS routing is a consistent hashing solution. NexentaEdge hashes to a Target Group and then does rapid negotiations within the group to find and dynamically place new chunks on the least burdened targets.

Different Metadata Philosophy

The IPFS naming system is still a work-in-progress, but all of their examples suggest a very different method for publishing content accessible by name.

They take the cryptographic hash of the atomic object and embed those references in other documents, which basically function as directories. Each of these directory objects is also immutable, referencing specific frozen-in-time content. The directory object itself has a cryptographic hash, which can be referenced in higher layer directories. Finally a "root" directory is published which is then pointed to by a mutable name to directory object mapping.

From the examples given and the suggested implementations it is clear that this is not intended as a high transaction rate solution. This is something more akin to publishing the daily release of a open-source project. This new root is collected, authorized and published by a single authoritative user.

This is not that bad of an approach for creating a "permanent web", although it would not even seem applicable for sites such as cnn.com that publish continuously.

One of the primary objectives of NexentaEdge is to be a shared repository for versioned documents that can be accessed and updated by thousands of tenant approved users. Any tenant-approved user should be able to post a new object version, subject to tenant-specified ACLs, at any time without interference from other users. Any tenant-approved user should be able to fetch any version of any tenant object at any time without interference from other users beyond contention for bandwidth. Information about new object versions is propagated asynchronously, but rapidly, and with known and measured propagation delay.

A storage service, as opposed to a publishing service, needs to treat stored payload as opaque blobs. The storage service is not allowed to find references within the payload because it should embrace client driven end-to-end encryption. The storage service should presume that all payload is encrypted and never try to analyze it.^[1]

So information that supports finding stored object, by name or by other search criteria, must be stored as metadata separate from the payload. Metadata also serves the closely interlocked issue of how and even whether to retain content.

Immutable Version Metadata

By definition, most metadata about a specific version of an object must be immutable. Certain metadata can be independent of the version contents, such as metadata controlling retention of the object version. We can meaningfully talk about changing erasure encoding algorithm used to store a specific document version, but if we are changing the Author of the document we are creating a new version.

In particular, whether or not a given version is the current version of the object is obviously subject to change without changing the version itself. One of the strong points for IPFS is that it does not change the storage for a directory object when the mutable naming reference is changed to point at a new version. This is far preferable to the practice of creating an explicitly versioned name for non-current versions, such as used by Swift object storage.

However, there are many features of the Metadata system required for versioned document storage that IPFS simply does not address:

Single Step searches.
Directory/Folder searches with single edit Hierarchical Path Edits.
New Metadata must be propagated quickly.
Predictable search times building upon short RTOs.
Tenant control over access-to and modification-of tenant metadata.
Metadata driven retention of metadata and referenced Payload.

Single Step searches

IPFS describes a multi-step process to resolve a pathname:

The root of the path name is looked up in the mutable naming system (IPNS). That leads to a directory object encoding references.
Each layer of the the "/" delimited name is then iterated. For "/A/B/C/D", "B" is looked up in the "/A" directory. "C" in the resulting directory, etc.
Finally the reference object is retrieved.

This is common for "distributed" storage systems which have effectively just ported the Unix inode to the cloud. Iterative descent is a great theory and very general, but it has not been a performant solution for some time. Single-node storage servers work around this by caching the top level directories. Web-servers have been caching mappings of fully qualified URLs to files for some time as well. But iterative descent results in terrible performance when you have to jump to different storage servers for each step of the iteration. Once you have distributed storage it is very unlikely that the servers handling "/A" will be the same as the servers handling "/A/B". The same applies for "/A/B/C". Even if the entries are cached everywhere, the process requires too many network round trips. If the object name is "/A/B/C/D" the metadata system has to be able to look that up, within the context of the tenant, in a single-step search.

NexentaEdge can resolve a name using the TargetGroup search or a Namespace Manifest search. It involves many servers, but the search is conducted in parallel, not iteratively.

In both cases a single query^[2] is sent either to the TargetGroup or to the Namespace Manifest shards. The addressed targets send their responses back to the Initiator.

The Initiator collects as many responses as are required to find the requested CHID to be retrieved.

Directory/Folder searches With Single-edit Hierarchical Path Edits

Like most cloud-inode solutions, IPFS supports querying directories by iterating from the root Directory until the desired layer and simply reading the directory.

NexentaEdge sends a query to the Namespace Manifest shards requesting all records relevant to resolving a given path. This includes "rename" records which allow single edit updates to hierarchical path names.

Recursive descent allows renaming the path to all objects by simply renaming one directory in the path. "/A/B/" becomes "/A/B2/" simply by renaming the "B" entry within the "/A" directory to "B2". That is a lot more difficult with distributed directories in a storage cloud. If you support finding an object with its full path name then you are ultimately hashing based upon the fully qualified path name ("/A/B/C/D"). When you change "B" to "B2" you change the hash for all objects that are conceptually contained within "/A/B". Executing that synchronously, before completing the request, would be impossible. There could be billions of objects contained within a single directory.

NexentaEdge solves this by creating "rename" entries that record when "B" was renamed to "B2". In the worst case this may force the Initiator to issue a second search using the original folder name to guarantee that it had found all objects in "/A/B2". But the path edit from "/A/B" to "/A/B2" only requires creating a single entry in the Namespace Manifest.

New Metadata Is propagated

NexentaEdge has a two-track method for searching metadata. The search for Version Manifests can be conducted within the Negotiating Group (selected by the NHID) or by searching a sharded Namespace Manifest. The Negotiating Group search is limited to searching for an exact name, and will be limited to searching for the current version once the Namespace Manifest implementation is mature enough.

The Negotiating Group metadata is available as soon as it is put. The Namespace Manifest is updated by post-processing of transaction journals. Updates are sent to the Namespace Manifests shards. The source can be configured to be the initiators or the Targets that create new Version Manifests. These updates are batched. The granularity of batches is configurable. Further, the Namespace Manifest records the latest batch info from each source. This means that a query resolver knows the time as of which it knows all Version Manifests, and which Version Manifests might exist but not yet have been propagated.

IPFS, and other distributed inode solutions, either have to confirm update through the root inode (which would greatly slow down transaction speeds) or live with asynchronous upward posting of the inode tree (with no way to track when this is done). On a functioning network both solutions will propagate this data very quickly, but NexentaEdge can let the querier know when propagation has been delayed.

Predictable search times building upon short RTOs

NexentaEdge maintains metadata and Namespace Manifests so that the RTO to reach all required replicas/shards has a short maximum RTO. The time to resolve any query is directly determined by this RTO.

Other systems, including IPFS, do not guarantee that a name can be resolved within the current site. Therefore the query may be dependent on long-haul RTOs. This takes longer, and it takes longer before retry operations can begin after a failure. Combined this greatly increases the time that must be allowed to complete any query.

Tenant control over access to and modification of tenant Metadata

NexentaEdge enforces a two-layer access control strategy. The first layer imposes strict tenant isolation. All metadata belong to a specific tenant, and is accessible only by users approved by that tenants authentication server. The second tier is inforcement of ACL rules, where the specific rules are part of tenant supplied metadata and permissions/roles granted to the tenant approved users.

IPFS creates a global, visible namespace. If security is desired it must be provided by user-controlled encryption.

Metadata driven retention of metadata and referenced Payload

IPFS control of data retention is a bolt-on. Pinning of IPFS files is done on a per target basis. Cluster-driven retention requires execution of a RAFT-derived consensus algorithm. Requiring cluster-wide consensus for a routine operation seems to be contrary to the goal of being a scale-out storage solution.

NexentaEdge Chunks are retained if they are referennced. There is a MapReduce algorithm to distribute back-reference requirements. Once this information has been distributed each storage target is free to delete older chunks that have not been retained.

Version Manifests are retained when they are referenced in Snapshots or they are current.

Tenants are allowed to expung their own Version Manifests. This enables them to expunge content from their account in order to comply with legal requirements to remove content. Tenants will be able to subscribe to receive notices if expunged chunks are re-added.

Metadata for Enterprise storage

NexentaEdge's metadata is not just immutable, self-validating and location independent. It supports rapid metadata searches that are designed to meet the needs of a document/object storage system holding tenant-private objects. IPFS is inherently limited to publishing the permanent web, and will never be suitable as a versioned project active archive.

It can try to compress the data to save storage resources, but obviously that will not work if the payload was in fact already encrypted. ↩
As will be noted, having renamed directories in the queried path can require an additional query round. However, that is easily avoided by placing all content in 'permanent' directories and effectively moving an alias for the 'current' site to point at a specific permanet directory. ↩

Namespace Manifests

March 26, 2018

Caitlin Bestler

With efficient group messaging a group of storage targets can efficiently manage the collective responsibility for storing Chunks within the group while allowing metadata references to the stored chunks to omit the specific storage targets selected.

That can be extended to find old versions of the stored objects by having each Target track the list of versions stored for each object. But that increases the number of persistent write operations required for each new object version by one.

As covered in the prior blogs, each Version Manifest is immutable. That means that information about a Version Manifest is also immutable. If each Version Manifest is uniquely identified, then the records describing each Version Manifest are also uniquely identified. What NexentaEdge takes advantage of is that if you have a vast distributed collection of immutable unique records can be coalesced into fewer locations where they can be efficiently searched.

We call this master manifest that collects information about all Version Manifests a Namespace Manifest. Each Namespace Manifest deals with one slice of the cluster's namespace and may be sharded over multiple Target machines.

The sharded Namespace Manifest can be organized in a variety of ways to efficiently process more enhanced queries, such as all objects contained within a given scope name, or all object versions with names ending in ".mp3" created in 2015 by a specific user.

The only question with this asynchronous collection of information describing Version Manifests is not the data associated with any Version Manifest (it is immutable) but knowing the range of Version Manifests which might exist but could be as of yet unknown to the collected record store.

That can be addressed by including data from each Initiator about what cutoff date they have for new Version manifests. When Initiator X forwards data about Version Manifests it has collected in a batch it notes that it is no longer creating new Version Manifests with a timestamp prior to X.

The collective master manifest therefore knows that it knows all versions manifests dated earlier than these cutoff timestamps.

Snapshots

The Namespace Manifest can answer a query as to what the current Version Manifest was for any set of objects at one point-in-time.^[1] If there are potentially unknown Version Manifests at that time that it might not know of yet then this resulting subset is not yet complete. The results of such a query can be saved as a version of a Snapshot object.

In photographic terms this is a true point-in-time snapshot, you just have to develop the film before you can make a print. That developing time is the lag time required to collect the records.

Most "snapshots" of distributed storage are anything but "snapshots". They may require a cluster-wide "freeze" to take the snapshot.

Chandry and Lamport in their 1985 paper^[2] compare the problem of taking a snapshot of a distributed system to that of taking a photograph of a sky full migrating birds. The image is too immense to be captured by a single photograph, and the photographers cannot demand the birds "freeze" to enable photo gathering.

Other solutions merely support creating a clone of a specific object version and then calling that clone a "snapshot".

NexentaEdge provides a true distributed snapshot. Chandry and Lamport algorithm requires end-to-end communication. Ours does not require end-to-end communication to take the snapshot, merely to publish it. Periodic snapshots can be taken every hour on the hour, not every hour at the first second of that hour when the network was fully healthy.

Not Block-Chain

The fact that our metadata is immutable and additive might cause some to think of it as being similar to Blockchain algorithms. There is an important difference: we alway allow any Initiator to create a new version of any object (constrained only by the limitation of 1 new version per Initiator per Object per tick). This means that the one-tick rule is the only bottleneck to the creation of new object versions. Block-chain requires each new ledger entry to be authenticated through the deliberately expensive "mining" process that creates a major bottleneck on the recording of new ledger entries.

All Derived from Unique Immutable metadata

The benefits outlined here are all enabled by the definition of NexentaEdge metadata. The methods of collecting, indexing and publishes these derivatives will vary as NexentaEdge evolves as a product. But all of these solutions are enabled by the fact that the information about a Version Manifest can never become obsolete.

This requires following certain rules on how you timestamp things, such as never allowing a clock to run backwards and starting with fairly well synchronized clocks. ↩
Leslie Lamport, K. Mani Chandy: Distributed Snapshots: Determining GlobalStates of a Distributed System. In: ACM Transactions on Computer Systems 3. Nr. 1, Februar 1985 ↩

Location Independent References

March 22, 2018

Caitlin Bestler

In the prior blog on NexentaEdge we mentioned that Chunks were unique and immutable and that Chunk References merely identify how a Chunk is used to rebuild an object, but do not specify the locations where the chunk is stored.

This time we will expand on how the Location Independent References are done.

The Version Manifest specifies a specific version of an object. It specifies the metadata for the version, including a few mandatory fields, and a series of Chunk References which reference the payload chunks.

A typical Chunk Reference contains:

The CHID of the referenced chunk.
The Logical Offset of the Chunk in the object version.
The Logical Length of the decompressed payload.

What it does not specified is any locations where the replicas are held. This means that the content can be migrated either for maintenance or load-balancing purposes without updating the Version Manifest.

Actually lots of systems have location-free Chunks References. What is different about NexentaEdge is that the location-free Chunk References can specify a dynamic set of locations that can change without the add or drop of any storage target.

This is done by hashing the relevant cryptographic hash (content or name) to a Negotiating Group rather than to a set of target machines. Storing and retrieving chunks is negotiated within the group.^[1]

We'll explain the four most critical transactions that implement this strategy:

Getting a Payload Chunk
Putting a Payload Chunk
Getting a Version Manifest
Putting a Version Manifest

Get Chunk with CHID

sequenceDiagram Initiator->>TargetGroup: Get Chunk with CHID=X TargetGroup->>Initiator: Have Chunk Can Deliver at T | Not here Note left of TargetGroup: Response is from each Target in TargetGroup Note over Initiator: Select best offer Initiator->>TargetGroup: Select Target to Supply Chunk Note over TargetGroup: Wait till specified time TargetGroup->>Initiator: Requested Chunk Note left of TargetGroup: From the selected target Note over Initiator: Initiator validates received chunk, retries on error.

Payload chunks are found by sending a find request identifying the CHID (Content Hash IDentifier) of the desired chunk to every member of the TargetGroup. This target group is selected by hashing the CHID.

Each receiving Target responds to the Initiator with either an indication that it has Chunk X and could deliver it at time Y, or that it does not have Chunk X.

Once sufficient replies have been received to make a selection the Initiator sends a message to the TargetGroup specifying the selection it has made. This is sent to the same group so that nodes not selected can cancel tentative resource reservations.

Lastly the selected storage target delivers the requested chunk at the specified time. Because this was negotiated, a network with a non-blocking core can transmit the chunks at the full rate provisioned for payload transfers.

Put Chunk With CHID

sequenceDiagram Initiator->>TargetGroup: Put Chunk with CHID=X TargetGroup->>Initiator: Could Accept at Time I-J | Already Stored Note left of TargetGroup: Response is from each Target in TargetGroup Note over Initiator: Select best set of Targets Initiator->>TargetGroup: Select Targets to Receive Chunk at Time T Note over Initiator: Wait till specified time Initiator->>TargetGroup: Chunk TargetGroup->>Initiator: Receipt Ack Note Left of TargetGroup: Optional Receipt Ack from each receiving Target TargetGroup->>Initiator: Chunk Saved Ack Note Left of TargetGroup: Chunk Saved Ack from each receiving Target Note over Initiator: Initiator Retries unless sufficient replicas were confirmed

Of course before we can get Chunk X from somewhere within a TargetGroup we have to put it to that group.

Each member of the group identifies when it could accept the transfer. The Initiator picks the best set of targets with an overlapping delivery window to receive the required number of replicas.

The number of replicas can be reduced when some replicas already exist. This message can also complete the transaction if there are already sufficient replicas.

There is also a nearly identical Replicate Chunk transaction to test if there are sufficient replicas of an already existing Chunk and to put this missing replicas if there is not.

Get Version Manifest With NHID

sequenceDiagram Initiator->>TargetGroup: Get Version Manifest with NHID=X TargetGroup->>Initiator: Have Version Manifest with UVID X Can Deliver at T | Not here Note left of TargetGroup: Response is from each Target in TargetGroup Note over Initiator: Select best offer Initiator->>TargetGroup: Select Target to Supply Version Manifest Note over TargetGroup: Wait till specified time TargetGroup->>Initiator: Requested Version Manifest Note left of TargetGroup: From the selected target Note over Initiator: Initiator validates received Version Manifest, retries on error. Note over Initiator: Typically then fetch the referenced chunks.

Of course a storage system that only allowed you to retrieve content previously stored if you remembered a 256 or 512 arbitrary identifier wouldn't be very useful. We need to put and get named objects. Typically we want the current version of a named object.

Each object version is described by a Version Manifest. Version Manifests are also Chunks, but they are assigned to TargetGroups based upon their fully qualified object name (it is fully qualified because what the tenant perceives of as the "Fully Qualified" name is prefixed by the Tenant name).

Current Version Manifests are found by sending a named find requesting identifying the NHID (Name hash IDentier) of the Version Manifest desired. This is send to the TargetGroup hashed from the NHID. The default request seeks the most current version stored by each target in the group. The Group is derived from the NHID rather than the CHID.

Each receiving Target responds saying it could deliver a Version Manifest with NHID X and UVID Y (the unique version identifier, including the version's timestamp. It is made unique by adding the original Initiator's IP address as a tie-breaker). Each is the most current version known to that Target.

Once sufficient replies have been collected, the Initiator selects the Version Manifest it wants, and sends a message to the TargetGroup speciyfing which Target should supply the Version Manifest and at what time. Again, this allows the non-selected targets to release tentative resource claims.

Lastly the selected storage target delivers the selected Version Manifest to the Initiator at the negotiated time at the configured full rate.

Put Version Manifest

sequenceDiagram Initiator->>TargetGroup: Put Version Manifest with NHID=X TargetGroup->>Initiator: Could Accept Delivery at Times I - J Note left of TargetGroup: Response is from each Target in TargetGroup Note over Initiator: Select best set of Targets Initiator->>TargetGroup: Select Target Set to store Version Manifest at time T Note over Initiator: Wait till specified time Initiator->>TargetGroup: Version Manifest Note left of TargetGroup: To each Target previously selected TargetGroup->>Initiator: Receipt Ack Note Left of TargetGroup: Optional Receipt Ack from each receiving Target TargetGroup->>Initiator: Chunk Saved Ack Note Left of TargetGroup: Chunk Saved Ack from each receiving Target Note over Initiator: Initiator Retries unless sufficient replicas were confirme

Putting a new Version Manifest is nearly identical to putting a Payload Chunk, except that the Put request is sent to the NHID-derived group (rather than CHID-derived) and that there will not be a pre-existing Version Manifest with the same UVID (Unique Version IDentifier).

Dealing With Old Versions and More

The early releases of NexentaEdge implemented Version searches by having each Target maintain a list of Version Manifests they stored for each Object they stored.

We have a new approach that uses a two track system:

The Targets only track the current version. This is the most common version requested, and we save one persistent storage write for each new object version by only tracking the current version.
A "Namespace Manifest" which is a distributed object that uses MapReduce techn\niques to collect and query a distributed key-value store of all Version Manifests logged by any target in the cluster.

Namespace Manifests enable doing queries on any directory, or even any wildcard mask. Other object stores use some equivalent of Swift's ContainerDB to enumerate all versions within a single container. The Namespace Manifest allows queries for any directory, not just the root directories. It also allows the Namespace Manifest to be updated asynchronously, but reliably.

We'll cover the Namespace Manifest next time, and then how the Namespace Manifest enables true point-in-time snapshots even in a cluster with no cluster-wide synchronization.

This is done with multicast groups confined to the backend network by default, or by iterative unicasting otherwise. ↩

Consensus, Who Needs It?

March 20, 2018

Caitlin Bestler

The conventional tradeoff for distributed storage clusters is between transactional consistency and eventual consistency. Eventual consistency is usually viewed as the cheaper solution, both in terms of desirability and system cost. The critical cost of transactional consistency is the need to reach a consensus on ordering updates.

Eventual consistency is usually portrayed as simply tolerance for inconsistency on the presumption that momentary contradictions are acceptable as long as they go away eventually.

NexentaEdge takes a different approach. All stored chunks, whether metadata or payload, are unique, immutable and self-validated. References to these chunks do not include the locations where they are stored, but still enable those chunks to be efficiently retrieved.

This strategy allows NexentaEdge to provide guarantees beyond making thing consistent "eventually":

Any client will never retrieve a version older than the most recent version that the client has put itself. The changes in any version will never be automatically erased.
A version will only be expunged according to policy and after a version that is a successor to it is published.
No network partition will prevent a client from putting a new object version. Indeed no client will ever prevent another client from putting a new object version.

Never blocking a new object version because of the actions of another client is a feature, not a bug or limitation. Transactional systems can only guarantee non-overlapping edits after reaching a consensus. The consensus may be on which updater has the exclusive right to update an object now (distributed locking) or on which of multiple conflicting updates can be committed (MVCC or multi-version consensus control). Effectively the cluster must be serialized either before the update is initiated or before it can be completed. Distributed locking is more efficient when conflicting attempted edits are common, MVCC for the far more common situation where conflicts are rare.

So eventual consistency is actually what end consumers want for versioned document archives. Not accepting, or even delaying, the ability to record a new version is not good. What they would prefer is to reliably know when other versions are created and to minimize the time when a new update is not visible to someone wanting to further edit the same document.

What NexentaEdge offers is eventual consistency with the benefits of immutability and knowledge of what range of possible object versions could exist but which have not yet been propagated.

What lies behind this capability is simple, NexentaEdge has defined its metadata so that no consensus algorithm is needed. Other storage solutions may have clever consensus algorithms, but you cannot be more clever than no consensus algorithm at all.

Consensus is Expensive

The fundamental issue is that it is impossible to update the same information at multiple locations at exactly the same time. This has been expressed many ways, including CAP Theorem.

Distributed Storage systems that offer transactional consistency by requiring a cluster-wide consensus before the put transaction creating a new object version can complete. This may be based upon a priori locks or optimistic locking which detects conflicting edits and immediately applies the conflicting edits before reapplying the attempted edit.

Either strategy requires end-to-end communications covering at least a relevant quorum of node members. Of course, a quorum based consensus is dependent on agreement about how many votes are needed, which is why consensus algorithms always get complex. If a quorum consensus on either the lock or the specific edit cannot be achieved then the requested operation cannot proceed or complete. Disallowing puts of new versions is the last thing that a storage cluster supporting versioned documents should do.

Unique Chunks Do Not Require Consensus

NexentaEdge defines object versions in metadata chunks called Version Manifests. These chunks include the fully qualified object name and a unique version identifier.

Chunks are located using a cryptographic hash identifier of the chunk. For Version Manifests this is the Name Hash Identifier (NHID). All Version Manifests for a given object are stored within storage servers addressed by a single multicast group derived from the NHID. Payload Chunks, by contrast, are located based on the Content Hash Identifier (CHID). Depending on options selected the cryptographic hashes may be 256 or 512 bits.

Further, because the Version Manifests include their unique identifier, their Content Hash Identifiers (CHIDs) are also unique.

NexentaEdge Always Accepts New Versions

All NexentaEdge chunks are unique. They either have unique payload identified by a 256 or 512 bit cryptographic hash, or they have a Version Manifest that includes a unique identifier of the the version.

Two nodes can both put the same payload chunks without harm. Because the unique version identifier includes the IP address of the originating node there is only a single source for any new version. This does impose the onerous constraint that no single source can put two versions of the same object within a single system tick, currently 1/10,000th of a second. The round trip times to negotiate and confirm putting a new chunk will take longer than that.

Because the same Version Manifest has a unique identifier the source creating it does not need to consult with any other node before creating it. The only entity that is qualified to have an opinion on whether it created a new version of a given object at a given timestamp is itself. Instant consensus.

Namespace Manifest and Snapshots

Because Version Manifests are unique they can always be created. NexentaEdge collects and processes the transaction log entries noting each new Version Manifest to create a permanent registry of all Version Manifest that we call a Namespace Manifest. The Namespace Manifest can support complex metadata queries and makes it possible to take true point-in-time snapshots of a distributed storage cluster without requiring any consensus deriving blockage.

We'll follow up on the Namespace Manifest and Snapshots in our next blog.