Skip to content

John Bennett's blog

An overview of how RavenDB uses Etags

Saturday, December 15, 2012

Raven uses Etags for many purposes, including caching, concurrency control, indexing and replication.  In this post I’ll talk about Etags and give a quick rundown of where and how Raven makes use of them.

In the context of Raven, an Etag is just a big number (128-bits).  In Raven’s APIs, Etags are represented as Guids.  However, unlike a normal Guid, Raven Etags are sequential — they start at 00000000-0000-0000-0000-000000000000, and count up from there.

Etags are only meaningful within a single Raven database.  Each database has its own series of Etags starting at 0.  That means Etags are not universally unique like you expect Guids to be (for all practical purposes).  But within a single database on a single Raven server, there is a guarantee that an Etag is unique.

“Unique for what?” you might ask.  A database’s Etag is incremented on every single creation or modification of a document or attachment.

In that respect, a Raven Etag is similar to a timestamp value in Sql Server.  The value is meaningless; the value is unique for every data modification in a single database; and the value is larger than all previously generated values.

You can see the latest Etag in a database in the Studio, by clicking the Statistics link in the bottom left corner or by going to http://localhost:8080/databases/foo/stats where localhost:8080 is the Raven server and foo is the database name.  The “LastDocEtag” value reflects the most recent document modification.  The “LastAttachmentEtag” reflects the most recent attachment modification.

Attachments in a database are similarly tracked with a separate series of Etags.

Caching

The most obvious use is of Etags is as a standard HTTP ETag.  (I’m using “Etag” for Raven, since that’s how it appears in the API, and “ETag” for HTTP, since that’s the spelling in the spec.)

All communication with a Raven server happens over HTTP (the .NET client is a wrapper around the HTTP API).  Most applications will load a given document many times over the app lifetime. The first time Load() is called, the HTTP response contains an ETag header.  It’s value is the Raven Etag representing that exact version of the document.

The Raven client caches the response, including the Etag.  When another Load() occurs for the same document ID, the HTTP GET request includes an HTTP If-None-Match header with the Etag value.  The Raven server will compare that Etag with the current one for the document.  If they are the same, the Raven server returns an HTTP 304 Not Modified response.  That saves bandwith by not sending the entire contents of the document back across the wire when the client already has the latest. [Added based on comments: There is a highly optimized code path on the server for this case.]

(Aggressive caching can be used to prevent even that round-trip, but that’s a topic for another day.)

Optimistic concurrency control

By default in Raven, when you Load() and then Store() a document in a session, calling SaveChanges() will always result in the document being saved.  That’s great if only one process may ever be saving the document at a given time.

As soon as you have multiple instances of your application, or multiple users modifying the same document, that isn’t good enough.  Well, if you’re okay with “last person/process to save wins” it’s good enough.  Wouldn’t life be easy if that were the rule and not the exception?

Raven provides optimistic concurrency control simply by enabling a single setting:

session.Advanced.UseOptimisticConcurrency = true;

This is so often the correct thing to do, that I set it in my infrastructure that injects session objects into controllers and other types.

Optimistic concurrency works by taking advantage of Etags.  As mentioned under Caching above, when you call Load() the Raven client caches the document and its Etag.  With optimistic concurrency enabled, saving a document sends not just the new document data, but also the previously retrieved Etag value.  The Raven server checks that Etag against what already exists, if they aren’t exactly the same, a ConcurrencyException is thrown.

That is often all you need.  In cases where multiple human users might be editing the same document, you might need to do more work.  Optimistic concurrency on its own only prevents conflicts that occur during the milliseconds from when a doc is loaded to when it is changed, SaveChanges() is called, and the PUT reaches the server.

It would be easy for two people to load the same document into an editing UI, make conflicting changes, and save.  As long as the saves don’t arrive at your application within milliseconds of one another, the last person to hit Save will overwrite the earlier changes.  (The second request will load the document with the new Etag created from the first update.)

For this reason, you may need to track the originally loaded Etag in your editing UI when you render it, send it back to your app when the user submits change, and pass it along to Raven when you call Store() on the document.  This tells the Raven server: “only save if the existing Etag matches this one I’m giving you; otherwise throw a ConcurrencyException”.

I’ll cover that process in more detail in a later post. For now, the way to retrieve the Etag after loading a document is:

var etag = session.Advanced.GetEtagFor(entity);

Store() has an override that accepts the Etag you want Raven server to use when doing its concurrency check:

session.Store(entity, etag);

Indexing

Raven uses Etags to track the indexing process.  Unlike in Sql Server and most other relational databases, Raven indexing is an asychronous background process.  That is, a document is not indexed inside the transaction in which it is saved.

Documents are queued to be indexed in order of modification — which is also Etag order.  Documents are indexed in batches to improve performance.  When indexing of a batch succeeds, Raven records the Etag of the last document in the batch.  The next batch begins indexing by loading documents with Etags greater than the previously recorded value.  When a server is started, it can pick up indexing where it left off before it was stopped (or failed).

Map reduce indexes work in separate phases:  one for the index and another for the reduce.  Raven tracks the latest Etag indexed separately from the lastest Etag reduced.

You can see the data Raven uses for its internal index tracking in the same statistics mentioned above.  Each index has a LastIndexedEtag and LastReducedEtag value.

Replication

Raven manages the replication process using Etags, similar to how it manages the indexing process.  There are some additional interesting considerations, since multiple databases are involved.

I’ll go into (excruciating) detail on the replication process in future posts.  The short version is that it is a one-way, push-based process.  The source asks the destination “What document Etag did you last successfully get from me?”.  If the Etag in the response is not the latest Etag in the source database, the source starts POST-ing batches of documents to the destination, starting with the next highest Etag.

The interesting part is that Etags are per database.  Replication is between two, independent databases.  Raven does not consider them two replicas of the same database — they are entirely separate.  That means each is separately incrementing its own document Etags.  There is really no way around that, since clients may send updates into either one of the databases, and there is no way to synchronize the Etags across them that is both reliable and performant.  (If you doubt that, see https://en.wikipedia.org/wiki/CAP_theorem.)

The destination database in a replication pair records the last source document Etag it successfully received, and this is what it reports back to the source when asked.  That Etag is totally unrelated to document Etags in the destination.  The source does not record what it has previously replicated to that destination.  If you delete and recreate the destination database, all of the source’s documents will re-replicate to it from scratch.

As with indexing, replication of attachments is managed in the same way, but using the separate set of attachment Etags.

You can see the Etag tracking data for the replication process in the Studio by clicking the Replication Statistics link at the bottom center or by going to http://localhost:8080/databases/foo/replication/info where localhost:8080 is the source Raven server and foo is the source database name.  Each destination has a LastReplicatedEtag value.  (If you don’t see anything, you may not have replication enabled or any replication destinations actually configured.)

The astute reader may wonder how the source database knows that information, since I said above that it is recorded only in the destination.  The truth is that it is only persisted to disk in the destination.  The source keeps the same information up-to-date locally in memory (after retrieving it from the destination), primarily for displaying it in the Studio.

You can see the real, persisted truth in the destination by going to http://localhost:8081/databases/foo/docs?startsWith=raven/replication/sources where localhost:8081 is the destination Raven server and foo is the destination database name.  There will be one document for each source that has replicated to this destination.  (You’ll have a difficult time loading these documents individually due to their naming convention.  The “http://” in the ID causes problems.  The startsWith query string value is the way to go.)