AWS Feed
Getting started with Feast, an open source feature store running on AWS Managed Services
This post was written by Willem Pienaar, Principal Engineer at Tecton and creator of Feast.
Feast is an open source feature store and a fast, convenient way to serve machine learning (ML) features for training and online inference. Feast lets you build point-in-time correct training datasets from feature data, allows you to deploy a production-grade feature serving stack to Amazon Web Services (AWS) in seconds, and simplifies tracking features models are using.
Why Feast?
Most ML teams today are well versed in shipping machine learning models into production, but deploying models into production is only a small part of the MLOps lifecycle. Most teams don’t have a declarative way to ship data into production for consumption by machine learning models. That’s where Feast helps.
- Tracking and sharing features: Feast allows teams to define and track feature metadata (such as data sources, entities, and features) through declarative definitions that are version controlled in Git. This allows teams to maintain a versioned history of operationalized features, helping teams understand how features are performing in production, and enabling reuse and sharing of features across teams.
- Managed serving infrastructure: Feast takes all the work out of setting up data infrastructure. Feast makes configuring your data infrastructure for serving features possible, makes populating these stores with feature values easy, and provides an SDK for reading feature values from these stores at low latency.
- A consistent view of data: Machine learning models need to see a consistent view of features in training as they will see in production. Feast ensures this consistency through time-travel-based training dataset generation, and through a unified serving interface that helps your online models see a consistent view of features during inference and training.
Feast on AWS
With the latest release of Feast, you can take advantage of AWS storage services to run an open source feature store:
- Amazon Redshift and Amazon Simple Storage Service (Amazon S3) can be used as an offline store, which supports feature serving for training and batch inference of large amounts of feature data.
- Amazon DynamoDB, a NoSQL key-value database, can be used as an online store. Amazon DynamoDB supports feature serving at low latency for real-time prediction.
Use case: Real-time credit scoring
When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is made through a statistics model. Often, this model uses information about a customer to determine the likelihood that they will repay or default on a loan. This process is called credit scoring.
For this use case, we will demonstrate how a real-time credit scoring system can be built using Feast and scikit-Learn.
This real-time system is required to accept a loan request from a customer and respond within 100 ms with a decision on whether their loan has been approved or rejected.
A fully working demo repository for this use case is available on GitHub.
Data model
We have three datasets at our disposal to build this credit scoring system.
The first is a loan dataset. This dataset has features based on historic loans for current customers. Importantly, this dataset contains the target column, loan_status
. This column denotes whether a customer has defaulted on a loan.
Column | Description | Sample |
loan_id |
Unique id for the loan | 12208 |
dob_ssn |
Date of birth joined to SSN | 19790429_9552 |
zipcode |
Zip code of the customer | 30721 |
person_age |
Age of customer | 24 |
person_income |
Yearly income of the customer | 30000 |
person_home_ownership |
Home ownership class for customer | RENT |
person_emp_length |
How long the customer has been employed (months) | 2.0 |
loan_intent |
Reason for taking out loan | EDUCATION |
loan_amnt |
Loan amount | 3000 |
loan_int_rate |
Loan interest rate | 5.2 |
loan_status |
Status of loan | 0 |
event_timestamp |
When the loan was issued or updated | 2021-07-28 17:09:19 |
created_timestamp |
When this record was written to storage | 2021-07-28 17:09:19 |
The second dataset we will use is a zip code dataset. This dataset is used to enrich the loan dataset with supplementary features about a specific geographic location.
Column | Description | Sample |
zipcode |
Zip code to which features relate | 94546 |
city |
City to which features relate | CASTRO VALLEY |
state |
State to which features relate | CA |
tax_returns_filed |
Amount of tax returns filed in this zip code | 20616 |
population |
Total population of this zip code | 35351 |
wages |
Combined yearly earnings for all individuals in this zip code | 987939047 |
event_timestamp |
When the zipcode features were collected | 2017-01-01 12:00:00 |
created_timestamp |
When this record was written to storage | 2017-01-01 12:00:00 |
The third and final dataset is a credit history dataset. This is a dataset that contains the credit history on a per-person basis and is updated on a frequent basis by the credit institution. Every time a credit check is done on an individual, this dataset will be updated.
Column | Description | Sample |
dob_ssn |
Date of birth joined to SSN | 19530219_5179 |
credit_card_due |
How much this person owes on their credit cards | 0 |
mortgage_due |
How much this person owes on their mortgages | 91803 |
student_loan_due |
How much this person owes on their student loans | 0 |
vehicle_loan_due |
How much this person owes on their vehicle loans | 0 |
hard_pulls |
How many hard credit checks this person has had | 1 |
missed_payments_2y |
How many missed payments this person has had in the last 2 years | 1 |
missed_payments_1y |
How many missed payments this person has had in the last 1 years | 0 |
missed_payments_6m |
How many missed payments this person has had in the last 6 months | 0 |
bankruptcies |
How many bankruptcies this person has had | 0 |
event_timestamp |
When the credit check was executed | 2017-01-01 12:00:00 |
created_timestamp |
When this record was written to storage | 2017-01-01 12:00:00 |
The preceding loan, zip code, and credit history features will be combined into a single training dataset when building a credit-scoring model. However, historic loan data is not useful for making predictions based on new customers. Therefore, we will register and serve only the zip code and credit history features with Feast, and we will assume that the incoming request contains the loan application features.
An example of the loan application payload is as follows:
Amazon S3 and Redshift as a data source and offline store
A Redshift data source allows you to fetch historical feature values from Redshift for building training datasets and materializing features into an online store.
Install Feast using pip:
Initialize a blank feature repository:
This command will create a feature repository for your project. Let’s edit our feature store configuration using the provided feature_store.yaml
:
A data source is defined as part of the Feast Declarative API in the feature repo directory’s Python files. Now that we’ve configured our infrastructure, let’s register the zip code and credit history features we will use during training and serving.
Create a file called features.py
within the credit_scoring/
directory. Then add the following feature definition to features.py:
Feature views allow users to register data sources in their organizations into Feast, and then use those data sources for both training and online inference. The preceding feature view definition tells Feast where to find zip code and credit history features.
Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:
The preceding apply
command will:
- Store all entity and feature view definitions in a local file called
registry.db
. - Create an empty DynamoDB table for serving zip code and credit history features.
- Ensure that your data sources on Redshift are available.
Building a training dataset
Our loan dataset contains our target variable, so we will load that first:
But this dataset does not contain all the features we need in order to make an accurate scoring prediction. We also must join our zip code and credit history features, and we need to do so in a point-in-time correct way.
First, we create a feature store object from our feature repository:
Then we identify the features we want to query from Feast:
Then we make a query from Feast to enrich our loan dataset. Feast will automatically detect the zip code and dob_ssn join
columns and join the feature data in a point-in-time correct way. It does this by only joining features that were available at the time the loan was active.
Once we have retrieved the complete training dataset, we can:
- Drop
timestamp
columns and theloan_id
column. - Encode categorical features.
- Split the training dataframe into a train, validation, and test set.
Finally, we can train our classifier:
The full model training code is on GitHub.
DynamoDB as an online store
Before we can make online loan predictions with our credit scoring model, we must populate our online store with feature values. To load features into the online store, we use materialize incremental
:
This command will load features from our zip code and credit history data sources up to the $CURRENT_TIME
. The materialize command can be repeatedly called as more data becomes available in order to keep the online store fresh.
Fetching a feature vector at low latency
Now we have everything we need to make a loan prediction.
Conclusion
That’s it! We have a functional real-time credit scoring system.
Check out the feast GitHub repository for the latest features, such as on-demand transformation, Feast server deployment to AWS Lambda, as well as support for streaming sources.
The complete end-to-end real-time credit scoring system is available on GitHub. Feel free to deploy it and try it out.
If you want to participate in the Feast community, join us on Slack, or read the Feast documentation to get a better understanding of how to use Feast.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.