Feature Stores and Why You Need Them

The Machine Learning and DevOps are generally well established within the data industry, with most teams having some experience with one or both and the value added by their inclusion in workflows and implementations.

MLOps, sometimes called Operational Machine Learning, on the other hand is still a relatively new. MLOps aims to resolve the challenges of getting machine learning models and processes into production for operational use.

One of the major challenges is how to manage features, data pipelines and ensure consistency between training and production. This is exactly what Feature Stores were designed to do. This blog will introduce you to the basics of Feature Stores and how they solve one of the largest impediments to machine learning success.

Want to learn more about what MLOps involves? Check out my previous blog here.

What is a Feature Store?

Feature Stores, as their name suggests, store and organise features for the explicit purpose of training models or making predictions. It is a centralised location to create and update features created from different data sources that you don’t want to compute each time but instead retrieve when needed.

Basic Feature Store Concepts

Features and Feature Values

Features are individual independent variables that act like an input into your model; a feature is a column within a dataset and a feature value is a single value within that column.
For example, below is an image from Tecton’s Blog ‘What is a feature store?’:

Diagram show the difference between a Feature and Feature Value.

Registry

A feature registry is the central interface for managing feature definitions and associated metadata to act as a single source of information for an organisation. A good feature store will also track lineage of each feature definition ie where the feature was created, who owns the definition etc.

Offline Store

An offline feature store will store large volumes of feature data that is used to create train and test data for model development as well as batch inference. They will usually be a SQL backed database optimised for storing large volumes of feature data.

Online Store

An online feature store is required for real-time inference. This will be a low-latency database (typically a key-value) used to enrich real-time datasets using the feature definition stored in the registry created when populating the offline store.

Why do I need a Feature Store?

There is a huge number of benefits to utilising a Feature Store for either/both batch and real-time scoring at all stages of the Machine Learning lifecycle.

Feature stores allows you to easily create new features for reuse. This means data scientists across teams can explore new features used in other models and encourages consistent feature definitions in different areas of the organisation.
With feature definitions shared across offline and online stores, promoting models to production is easier as the data pipelines can be shared across both training and serving.
Features are consistent between training and serving data so models perform better.
The registry allows the version, lineage and metadata of an individual feature to be tracked for security and governance.
Finally with features being created and saved over the course of the ML lifecycle, features can be tracked to alert for drift occurring in productionised models.

What’s Next?

There are a growing number of open source feature store implementations out in the world to get started with, with a wide range of strengths and weaknesses making some better than others (although which one is best to implement will be influenced by the use case in question).

Our favourites include Feast, one of the most popular cloud and on-prem feature stores, and Databricks Feature Store, a cloud feature store hosted in Databricks with a great UI and additional data pipeline features.

While Feature Stores don’t resolve all the challenges in machine learning, they are an essential to creating a MLOps workflow and offer a huge range of benefits.