This post is part #2 recap of a great tech sharing on Databricks Data+AI Summit 2022 by Aakrati Talati and Avesh Singh. This part #2 post will demonstrate how to use the Databricks Feature Store to get your ML project into production effortlessly.
Origin sharing: Enable Production ML with Databricks Feature Store

At a glance
- What makes it hard to productionize AI/ML projects? See Part #1: Why 90% of AI/ML projects are not making to production? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part #1
- Databricks Feature Store addresses the challenges by unifying online-offline featurizer logics, using stream-orientated data processing model for batch/stream compatibility, and providing point-in-time join function on feature tables for version control.
- The “as-of join” function can join feature tables using the point-in-time values in each feature tables.
- Besides feature management, Databricks Feature Store provides functionalities such as data lineage, feature discovery, model packaging, and real-time features that streamlines the inferencing experiences.
- Online+offline store setup covers most of the AI/ML project use cases.
Recap of Part #1: The challenges in AI/ML projects




The nature of productionizing an AI/ML project:
Production Machine Learning = Production Software + Production Data
The data challenges faced by AI/ML projects:
- Data silos: “spaghetti code” and isolated experiments.
- Online-offline skew: training vs. inferencing. Different tech stack with different perf metrics.
- Client configuration: version management for each piece of the E2E components. (data, featurizer, training, inferencing ,client)
- Data freshness: coordinate updates of schema of data and properties of features.
What is Databricks Feature Store?

Databricks Feature Store documentation: What is a feature store?
A feature store is a centralized repository that enables data scientists to find and share features and also ensures that the same code used to compute the feature values is used for model training and inference.
Databricks Feature Store organizes features in “feature table”. A “feature table” can be considered as a “table in relational database with rich features” which requires a partition_key for partitioning/indexing and a timestamp_key for data versioning . Under the hood each feature table is back by a Delta table and some metadata.
Databricks Feature Store – Using Feature Store to manage features online/offline
To consume data from a feature table, Databricks offers two options: offline store and online store. These two options targets different use cases in an AI/ML project lifecycle – offline for feature discovery, model training, and batch inferencing; online for real-time model inferencing with low-latency.

The basic workflow for creating and consuming a feature table looks like below:
- An existing pipeline that pulls in raw data from multiple sources into a Delta Lake (live) table. It can be a streaming job or a routine batch job.
- Instead of ingesting new data into a delta live table, ingest the data into a “feature table” on the feature store. Now you have a feature table with updated data. For batch job, you can define a trigger for the ingestion job to run on schedule.
- Next, you can define an dataframe that sources from multiple feature tables and joins them by a specified “lookup_key”, which is the column name used to join the dataframe with the new feature table. This will generate a dataframe with features from the feature tables.
- While creating a feature dataframe, you can also specify a timestamp to use the latest value at the point-in-time provided.
- Now you have a dataframe ready for your model training.
Databricks Feature Store – Feature Discovery

Having a centralized feature store makes feature sharing and discovery much more easier. As described above, features are managed in feature tables, powered by a delta table. One data engineering team can easily create a feature table and share it across the organization to other use cases. Databrick Feature Store also provides table name and feature name search functionality.
Training model with the joint of feature tables

Once you have the join query ready, train your model just like you do normally with a dataframe! The feature store sdk takes care of the rest under the hood.
Model Packaging

Databricks Feature Store lets you package your model along with metadata of the features used in training. This package can be deploy seamlessly into inferencing environment and the inference server will be able to do batch-scoring without further work. This feature is crucial to manage the quality of your model if your model runs online, or in re-training case.

Databricks Feature Store – Real-time features(features on-the-fly)
There might be cases where your model takes features from live traffic (for example: isTodayNationalHoliday base on user’s location). Pyfunc in MLFlow provides a solution for this case. In the training code, you need use pyfunc to create a custom conversion function that updates the input dataframe before your training logic.



Databricks Feature Store – Publish feature tables to online store
As of 07/26/2022, Databricks Feature Store only support three AWS online store offering: DynamoDB, Aurora, and RDS MySQL.

Now you have your feature table(s), and trained a model based on the existing data. What is next? Inferencing.
Inferencing generally refer to a web application, with a build-in AI/ML model, making decisions according to the user inputs. Note in most cases (if not all), the user payload does not come in the schema “understand-able” for the model. This is where the online store comes in to do the heavy-lifting of transformation for the web service, with a low latency. Publishing your feature table to an online store will deploy the featurizer code onto the online store, where it will be used to transform the raw payload.

Connecting the dots – End user experience
At the end of the day, all the heavy-lifting processes are invisible to the end-user. The end user simply makes a HTTP request, and receive personalized result, which comes from the confidence computation happened on the server. The server calls online store for payload conversion and serve the response based off the score.

Databricks Feature Store – Summary
Using a Feature Store is becoming strategically more and more important to productionizing AI/ML projects as it democratize data ownership, increases efficiency and reusability of data, and close out the gaps that brings a lot of unstableness/risk/COGS to the production system. Start thinking about using a feature store is definitely a right choice for your org in the long run.

If you like my post, subscribe for email updates
Or follow me on social media
Recent Posts
- M365 Graph Learning for Search & Recommendation
- Why 90% AI/ML projects not productionized? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part 2
- Why 90% AI/ML projects not productionized? Enable Production ML with Databricks Feature Store (Databricks Data+AI Summit 2022) – Part 1
- Beyond Monitoring: The Rise of Data Observability (Databricks Data+AI Summit 2022)
- Apache Spark on Kubernetes—Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)