AWS Feed
Effective data lakes using AWS Lake Formation, Part 3: Using ACID transactions on governed tables
Data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for all enterprise data and serve as common choice for a large number of users querying from a variety of analytics and ML tools. Often times you want to ingest data continuously into the data lake from multiple sources and query against the data lake from many analytics tools concurrently. Previously, getting consistent results meant that you had to build custom pipelines to keep data consistent and resulted in delays before the data became available to the consuming tools.
We announced AWS Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. In previous posts, we focused on setting up Lake Formation governed tables (part1) and streaming ingest into the data lake (part2). In this post we focus on atomicity, consistency, isolation and durability (ACID) transactions. We demonstrate how ACID transactions work on Lake Formation governed tables on Amazon S3.
Effective data lake using AWS Lake Formation
|
ACID transactions in Lake Formation
AWS Lake Formation ACID transactions make it easy to keep S3 data updated while continuously ingesting new data and simultaneously running analytics queries that return consistent and up-to-date results. Lake Formation ACID transactions provide snapshot isolation, which ensures that concurrent transactions see a consistent version of the database. These versions are isolated from each other, and run concurrently without impacting each other. ACID transactions enable multiple users to concurrently and reliably add and delete S3 objects in a atomic manner, while isolating any existing queries by maintaining read consistency for queries against the data lake.
Lake Formation introduces a new table type called governed tables that support ACID transactions on S3 objects. A transaction can have multiple operations and span multiple governed tables. Furthermore, a transaction can be performed by multiple users using multiple tools. If an operation within a transaction fails, the entire transaction is rolled back. Any transactions that are in progress are isolated from the changes performed by another transaction. Transactions protect the integrity of governed tables by ensuring that you always see a consistent view of your metadata and data in S3.
To illustrate read consistency with ACID transactions consider the following example – assume that a user Andy starts a query in Amazon Athena that scans all partitions of a table to aggregate sales figures. Assume that another user Betty adds a partition to the same table while Andy’s query’s scan is in progress and not yet completed. If Andy performed the query within a transaction, they see query results that reflect the state of the table at the start of the query. The results don’t include data from Betty’s newly added partition.
Setting up resources with AWS CloudFormation
In this post, I demonstrate how you can create a new governed table and read from or write to the governed table using an Amazon SageMaker notebook and an AWS Glue development endpoint. The setup for this demonstration is a fictional e-commerce company that sells multiple products. We use two separate notebooks to illustrate ACID transactions. Both notebooks update and read the products table. Notebook1 starts off by populating the product table, which is then read and modified by Notebook2. Within the same transaction, Notebook1 then reads the products table and is isolated from changes made by Notebook2. After Notebook2 has finished its transaction, Notebook1 starts a new transaction and sees the modified data.
This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.
The CloudFormation template generates the following resources:
- AWS Identity and Access Management(IAM) users, roles, and policies
- Lake Formation data lake settings and permissions
- An AWS Glue development endpoint
- An AWS Glue SageMaker notebook
When following the steps in this section, use the Region us-east-1
or us-west-2
because as of this writing, this Lake Formation preview feature is available only in us-east-1
and us-west-2
.
To create your resources, complete the following steps:
- Sign in to the CloudFormation console in the
us-east-1
orus-west-2
Region. - Choose Launch Stack:
- Choose Next.
- For DatalakeAdminUserNameand DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
- For DataLakeBucketName, enter the name of your data lake bucket.
- For DatabaseName, leave as the default.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
Stack creation can take approximately 15 minutes.
Setting up sample notebooks on your SageMaker notebook
We also provide two Jupyter notebooks to replicate the same code in your environment.
- Sign in to the Amazon SageMaker console in the same Region using the
DatalakeAdmin3
user. - Choose Notebook instances.
- Select the notebook instance
aws-glue-<CloudFormation stack name>
and choose Open JupyterLab. - Choose Terminal.
- Run following command in the terminal:
Now you can see these two new notebooks in JupyterLab’s left pane.
Running ACID transactions on SageMaker notebook
To run your ACID transactions, complete the following steps:
- Open the Amazon SageMaker console, choose Notebooks instances.
- Select the notebook instance
aws-glue-<CloudFormation stack name>
and choose Open JupyterLab.
Let’s go through the notebooks and see how Lake Formation transactions work.
- Open the tab of notebook1.ipynb. This notebook simulates the ETL job and creates the products in the company catalog.
- In the first cell, replace
datalake-bucket
with your bucket name which you have used in CloudFormation template’s DataLakeBucketName section. - Run the first cell for initialization.
When the first cell is completed, Spark application information is visible.
- Run the cell under Create a DynamicFrame from sample data to create a sample data as a Spark DataFrame, and convert it to an AWS Glue DynamicFrame.
Now you can see a sample DynamicFrame including three product records: Television
, USB charger
, and Blender
.
- Run the cell under Begin a new transaction to start a new Lake Formation transaction by calling the
begin_transaction
API.
The transaction ID is printed out. We use this transaction ID for writing the data.
- Run the cell under Write the DynamicFrame into a Governed table to write the sample DynamicFrame into a new Lake Formation governed table.
A new governed table is automatically created. You do not have to create this new table in advance in the Lake Formation console or the AWS SDK.
- On the Lake Formation console, choose Tables.
You can see the products
table on the Lake Formation console.
- On the Amazon S3 console, choose your data lake bucket.
You can see that new data has also been ingested into Amazon S3.
- Go back to JupyterLab, and run the cell under Get transaction ID on notebook1 to start another transaction.
- Run the cell under Read original table from notebook1 to read from the governed table.
You can see the data being read from the governed table using the transaction ID.
Then, let’s open another notebook and run the cells to simulate other query user’s activity.
- Open the browser tab of notebook2.ipynb.
- In the first cell, replace
datalake-bucket
with your bucket name which you have used in CloudFormation template’s DataLakeBucketName section. - Run the first cell for initialization.
- Run the cell under Get transaction ID on notebook 2 to start a new Lake Formation transaction. This ensures that notebook2 uses a new transaction ID within whose bounds all Notebook2 operations occur.
- Run the cell under Read original table from notebook 2 to read from the governed table you created in notebook1.ipynb. This read uses new transactionID which is different from the one used in notebook 1.
You can see the data being read from the governed table which you created in notebook1.
- Run the subsequent two cells under Notebook 2 – Write 2 additional rows to write two new records (
product_id=00004
andproduct_id=00005
) into the governed table.
Note that this transaction is not committed yet.
- Run the next cell to read from the governed table using the same transaction ID, which is not committed yet.
You can see the uncommitted changes (product_id=00004
and product_id=00005
) because you’re using the same transaction ID that you used in writing the records.
Then let’s move to notebook1.ipynb and see how the uncommitted changes look in the other user’s notebook.
- Switch to the browser tab of notebook1.ipynb.
- Run the cell under No change – waiting for notebook 2 to write to confirm that you don’t see the two new records yet.
This is because the transaction you performed in notebook2.ipynb hasn’t been committed.
Now let’s move to notebook2.ipynb to commit the ongoing transaction to see how it looks in notebook1.ipynb after the commit.
- Switch to the browser tab of notebook2.ipynb.
- Run the cell under Commit notebook 2 updates to commit the change you made in notebook2.ipynb.
- Switch to the browser tab of notebook1.ipynb.
- Run the cell under Notebook 2 committed but still notebook 1 pointing original transaction ID to confirm that you don’t see the two new records yet.
This is because you’re using the original transaction ID txId2
when reading the data. This ensures that you are isolated from changes happening outside of this transaction.
- Run the subsequent cells under Notebook 1 gets new transaction ID which reflects the changes from notebook 2 to commit
txId2
, start another new transactiontxId3
, and read the data committed in Notebook 2.
Now you can see the committed change.
- Switch to the browser tab of Notebook 2.
- Run the subsequent cells under Both notebooks using the most recent transaction ID to start another new transaction, and read the data.
Now you can see that both notebooks show the latest data when using the most recent transaction ID.
Cleaning up
Now to the final step, cleaning up the resources.
- Empty the Amazon S3 data lake bucket, then delete the bucket.
- Delete the CloudFormation stack.
The governed table you created is also automatically deleted when you delete the CloudFormation stack.
Conclusion
In this post, we explained how to use Lake Formation transactions and illustrated atomicity and isolation between two applications simultaneously reading from and writing to the same table. With Lake Formation transactions, you can achieve ACID transactions over multiple tables and statements, and you can get consistent and up-to-date results.
Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, sign up for the preview.
Appendix: Setting up resources via the console
When following the steps in this section, use the Region us-east-1
or us-west-2
because as of this writing, this Lake Formation preview feature is available only in us-east-1
and us-west-2
.
Configuring IAM roles and users
First, you need to set up two IAM roles: one for AWS Glue ETL jobs, another for the Lake Formation data lake location.
IAM policies
To create the IAM policies for your roles complete the following steps:
- On the IAM console, create a new policy for Amazon S3.
- Save the IAM policy as
S3DataLakePolicy
using the instructions below (replace <datalake-bucket> with your bucket name): - Create the new IAM policy named
LFTransactionPolicy
with the following statements: - Create a new IAM policy named
LFLocationPolicy
with the following statements: - Create a new IAM policy named
LFQueryPolicywith
the following statements:
IAM role for AWS Glue development endpoint.
Create a new IAM role for the AWS Glue development endpoints:
- On the IAM console, create the role
GlueETLServiceRole
with the following AWS Glue trust relationship for AWS Glue: - Attach the following AWS managed policies:
AWSGlueServiceRole
AWSLakeFormationDataAdmin
- Attach the following customer managed policies:
S3DataLakePolicy
LFTransactionPolicy
IAM role for AWS Lake Formation data lake location
To create your IAM role for Lake Formation, complete the following steps:
- Create a new IAM role called
LFRegisterLocationServiceRolewith
the following Lake Formation trust relationship: - Attach the following customer managed policies:
S3DataLakePolicy
LFLocationPolicy
IAM users
To create your IAM users, complete the following steps:
- Create an IAM user named
DatalakeAdmin3
. - Attach the following AWS managed policies:
AWSLakeFormationDataAdmin
AmazonAthenaFullAccess
AWSGlueConsoleFullAccess
AWSGlueConsoleSageMakerNotebookFullAccess
IAMReadOnlyAccess
- Attach the customer managed policy
LFQueryPolicy
.
Creating your AWS Glue development endpoint
To create your AWS Glue development endpoint, complete the following steps:
- On the AWS Glue console, choose Dev endpoints.
- Choose Add endpoint.
- For Development endpoint name, enter
acid-transaction
. - For IAM role, choose
GlueETLServiceRole
. - Under Security configuration, script libraries, and job parameters (optional), for Worker type, choose 1X.
- Choose Next.
- For How do you want to provide the network information used to create the development endpoint?, leave as the default, and choose Next.
- Choose Next.
- Choose Finish.
Development endpoint creation can take up to 10 minutes to complete.
Creating your AWS Glue SageMaker notebook
To create your AWS Glue SageMaker notebook, complete the following steps:
- On the AWS Glue console, choose Notebooks.
- Under SageMaker notebooks, choose Create notebook.
- For Notebook name, enter
acid-transactions-notebook
. - For Attach to development endpoint, choose
acid-transaction
. - Select Create an IAM role.
- For IAM role, enter
acid-transactions-notebook
. - Choose Create notebook.
Notebook creation can take up to 5 minutes to complete.
About the Author
Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. He loves distributed data processing systems, especially Apache Spark and Apache Hadoop. He enjoys working on launching new features by collaborating with talented engineers across multiple teams.