4 things to know before using Azure DocumentDB/CosmosDB (NoSQL)

When I first started using DocumentDB, I thought it is just a Microsoft version of MongoDB. Six month later, I found out that first impression is somehow not essentially true. Here are four things I wish I knew before started.

DocumentDB Collection design

When it comes to NoSQL, a lot of people might think of irregular data, images, json and graph, etc. DocumentDB is one of the document database that is designed for storing large amount of “documents” and serve large amount of real-time queries. One of the most important lessons I learned when using NoSQL database is that: Design your query first. This is also the case if you are using HBase, Azure table and MongoDB.

The “schema” of your collection (which some NoSQL expert might say there is NO SCHEMA), should be optimized to serve your queries. My personal experiences is that you should de-normalize your collections so that one or multiple independent queries could get all the data you want. (Note that there is no such thing as “cross collection joining”). The best case scenario is that each of your application query hits one collection.

Reading from one or more collections is pretty straight-forward: you make your queries (preferably in async manner) for a handful of documents you need, then combine these documents as you want in you application or controller. However, writing documents could be a little bit tricky: Basic operation in DocumentDB guarantees “document-level” consistency (see rule of BASE),  meaning document is updated as a whole. Transaction update is not supported. To perform transaction in DocumentDB, please refer to my other post: Manage DocumentDB stored procedures – Partition, scale and limits.

DocumentDB Throughput and Scaling

Scaling DocumentDB is just a click of a button for most cases. Throughput is calculated by RUs (request unit). While the actual equation for calculating RU remains a black box to us, rule of thumbs is 1 KB/s = 1 RU. You could use this tool to calculate your usage: RU calculator. As of 4/27/2017, the maximum throughput of a single partition is 10000 RU/s and 10 GBs of storage. To scale up your collection, you need to either migrate onto “partitioned-collection” or submit a service ticket to help you out (which it very likely ends up doing it yourself). With partitioned collection you have 250 GB storage and up to 100,000 RU/s. It is very obvious that with these 250 GB storage you get 25 partitions, which partitioned by the partition key you set when creating the collection. To scale further, talk to customer support.

DocumentDB Stored procedure

Stored procedure in DocumentDB can support transaction operation across multiple documents. The way they do it is through a property that auto-generated named “Etag” to make sure no changes to any of the documents in process are changed during the process. Stored procedure will be executed in a “sandbox” before it is committed. Once committed, the change will be in effect to all trailing queries. However, getting stored procedure right in partitioned collection is a little bit tricky. See for more Manage DocumentDB stored procedures – Partition, scale and limits details.

DocumentDB Indexing

Indexing is also very important part of DocumentDB collection. It is generally the same as traditional relational database: you index a field, the query become faster. You could choose different type of index such as hashing, range or string to increases the speed of your query. See here for details: Indexing in DocumentDB

Leave a Reply