This post re-captured ideas in a brilliant session hosted by Barr Moses, CEO and co-founder of Monte Carlo Data, on Data+AI Summit 2022. In this session, Barr shared her insights about data observability and present a solution to solve the problems that most Data Engineering teams faced today.
Session video: link
At a glance

- The problem of data downtime has been around for sometime but there are very limited solution to address the problem. It is costly to business due to lost of businesses.
- Data downtime problem is hard to be exhaustively detected by code. Most data outages(~90%) are reported by customer. (“The data looks wrong?”)
- App observability vs data observability: metrics, traces and logs. vs. freshness, volume, distribution, schema, and lineage.
- Data observability lifecycle: Detect, resolve and prevent.
The data downtime problem


The “data downtime problem” is prevailing in teams whose day-to-day job is to deal with business intelligence, analytics, reports, and machine learnings. It comes in different form according to the role of the data consumer. For example, “data downtime” for the data platform usually means the data is not available for the downstream customers; While from BI’s perspective, it could be missing or incomplete data return from the platform.
Moreover, the knowledge gap among the disciplines within an organization make it even harder to detect data outages in a reasonable timeline, often times the outage is reported by customer which triggers the process of investigation and recovery of data.
Why it is hard to deal with data downtime?
There are a lot of reasons from my own experiences why it is hard to detect data downtime, to name a few:
- Lack of a mature data lifecycle management framework that is followed by all roles.
- Detecting data problem is hard especially when the nature of the problem relates to “data quality” such as skewed values, and outliners.
- Missing tooling and automation for incident response.
- Need to fix the culture within the organization to setup “data mindset”.
- …
In summary, there needs to be a framework to manage data in a organized and modern way.
“App Observability” vs. “Data Observability”

Traditional application observability is defined in three pillars: metrics, traces and logs. Here Barr presents the definition of data observability in five pillars: freshness, volume, distribution, schema and lineage.
Freshness
Freshness of data refers to the age of data or the elapse before the data is available to the downstream consumer. Depending on the consumption pattern, freshness of the data can play a key role in success of the business such as detecting trending topics in digital marketing. The freshness requirement of data can be very challenging for the platform as it might lead to some form of streaming infrastructure that is costly and challenging to operate.
Distribution
Distribution of data plays a key role in cases such as machine learning and anomaly detection. As a old saying goes “garbage in, garbage out”, a skewed data input can create a biased model which can be misleading and behave differently between experiment and production.
Volume
Volume of data can be represented by different metrics such as “data completeness” in deterministic term or “coverage” in non-deterministic term. The deterministic term refers to compare the “expected volume” against the “actual volume”, whereas in non-deterministic term, often times a heuristic approach is implemented with some sort of threshold to decide if the volume of data received meets expectation. Fine-tuning the threshold can be very challenging in some case and is subject to change.
Schema
Often time schema change is a difficult to rollout due to the coordination between services or teams. There are literally thousand of things can go wrong with a change to the existing schema, including but not limiting to consistency, compatibility, backfilling data and versioning. It is also worth noting that once a schema change is rolled out, it could be very difficult to rollback as the change might be partial, or requires dedicated steps to revert the state back to LKG (last known good).
Lineage
Knowing some piece of data is “broken”, linage can help define and evaluate the impact of the outage. More over, in modern data business, lineage is becoming a key requirement to continue doing the business as regulation posts stricter rules on how user data should be tracked and managed. Data linage also is an important part of infrastructure to provide compliance and governance of the product.
Data observability in action: go beyond detection to “resolve” incidents


Using the right toolset and leverage engineering process to manage data is key. In Microsoft, we use a framework of “Detection -> Investigation -> Mitigation -> Triage -> Repair -> Post-mortem” to address an incident.
In the end
At the end of the date, data mentality and proper framework/toolset that provides guidance to different disciplines across the org is the key to success. Thank you Barr Moses and Monte Carlo Data for the excellent sharing!
Latest from the Blog
M365 Graph Learning for Search & Recommendation
Recap a talk on Microsoft Research Summit 2021 by “Arun Iyer” on challenges building a DEEGO graph learning solution for search and recommendation problems.
Why 90% AI/ML projects not productionized? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part 2
Why 90% AI/ML projects not productionized? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part 2
Why 90% AI/ML projects not productionized? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part 1
Why 90% of your AI/ML projects are not making to production? – Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part #1