Manage AppInsights telemetry with DocumentDB (NoSQL)

Using Azure Application Insights + Stream Analytics, you can easily create a data pipeline in minutes that keep track of all your application activities.
However things gets a little tricky when you are querying data base on the “customDimensions” field.
When you are using DocumentDB to store Azure telemetry data, you have data like below:

{id":"xxxxxxxxxxxxxxxx","customdimensions":[{"dimension_1":"data1"},{"dimension_2":"data2"},{"dimension_3":"data3"}]}

 

 Single filter query

Essentially, the custom dimensions are converted into an “object array”. Each field is wrapped inside an anonymous object.
Everything will works just fine util you need to do query pivot on like this:

select * from c where c.customDimensions.dimension_1 == 'data1'

This is an invalid query since “dimension_1” is inside an anonymous object array. To solve this problem, you can use a “self-join” function provided by DocumentDB to “flatten” the object array. This query will give you the correct result:

select c from c join field_alias in c.customDimensions where field_alias.dimension_1 == 'data1'

This “self-join” essentially creates a list of documents like:

{id":"xxxxxxxxxxxxxxxx",tag:{"dimension_1":"data1"},"customdimensions":[{"dimension_1":"data1"},{"dimension_2":"data2"},{"dimension_3":"data3"}]};
{id":"xxxxxxxxxxxxxxxx",tag:{"dimension_2":"data2"},"customdimensions":[{"dimension_1":"data1"},{"dimension_2":"data2"},{"dimension_3":"data3"}]};
{id":"xxxxxxxxxxxxxxxx",tag:{"dimension_3":"data3"},"customdimensions":[{"dimension_1":"data1"},{"dimension_2":"data2"},{"dimension_3":"data3"}]};

Multiple filters query

Things will work like a charm until you need to run query with multiple filters:

select c from c join field_alias in c.customDimensions where field_alias.dimension_1 == 'data1' and field_alias.dimension_2 == 'data2'

Why? As explained above, when you are doing a self-join, you created a new column. There is no row that contains both dimension_1 and dimension_2. To solve this, use multiple self-join.

select c from c join field_alias_1 in c.customDimensions join field_alias_2 in c.customDimensions where field_alias_1.dimension_1 == 'data1' and field_alias_2.dimension_2 == 'data2'

Performance

As you can see, the more filters you use, the slower the query. The complexity of this query is O(n^filter_count).

So far there isn’t a good solution to handle telemetry data in DocumentDB (as far as I know).
In the next post, I am going to show you how to extract a field from custom dimension in Azure Stream Analytics pipeline.

Cheers.

Leave a Reply