AWS Feed
Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables
Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes:
- Supporting streaming updates and deletes in your data lakes, for example, database replication, and supporting privacy regulations such as GDPR and CCPA
- Achieving fine-grained secure sharing not only with table-level or column-level access control, but with row-level access control
- Optimizing the layout of various tables and files on Amazon S3 to improve analytics performance
We announced Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. These capabilities are available via new update and access APIs that extend the governance capabilities of Lake Formation with row-level security, and provide transactions over data lakes.
In this series of posts, we provide step-by-step instructions on how to use these new capabilities. This post focuses on setting up governed tables.
Governed Table
The Data Catalog now supports a new type of table: governed table. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions. Lake Formation transactions simplify ETL script and workflow development, and allow multiple users to concurrently and reliably insert, delete, and modify multiple governed tables. Lake Formation automatically compacts and optimizes storage of governed tables in the background to improve query performance.
Setting up resources with AWS CloudFormation
In this post, we demonstrate how you can create a new governed table using existing data on Amazon S3. In particular, we focus on creating a simple governed table based on a public S3 bucket to get you started without incurring any Amazon S3 storage costs. We use the Amazon Customer Reviews Dataset as sample data. Because we are using a public S3 bucket we limit the blog to read-only use cases. In the real-world you may want to do more by putting objects to your S3 buckets, adding them to the governed table, and enabling compaction.
This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.
The CloudFormation template generates the following resources:
- AWS Identity and Access Management (IAM) users, roles, and policies
- AWS Lake Formation data lake settings and permissions
When following the steps in this section, use the Region us-east-1
because as of this writing, these Lake Formation preview features are available only in us-east-1
. Please check availability of the features in other regions in the future.
To create your resources, complete the following steps:
- Sign in to the CloudFormation console in
us-east-1
Region. - Choose Launch Stack:
- Choose Next.
- For DatalakeAdminUserName and DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
- For DataAnalystUserName and DataAnalystUserPassword, enter your IAM user name and password for data analyst user.
- For DatabaseName, leave as the default.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
Stack creation can take up to 2 minutes.
Setting up a governed table
Now you can create and configure your first governed table in AWS Lake Formation.
Creating a governed table
To create your governed table, complete the following steps:
- Sign in to the Lake Formation console in
us-east-1
Region using theDatalakeAdmin1
user. - Choose Tables.
- Choose Create table.
- For Name, enter
amazon_reviews_governed
. - For Database, enter
lakeformation_tutorial_amazon_reviews
. - Select Enable governed data access and management.
- Select Enable row based permissions.
-
- For Data is located in, choose Specified path in another account.
- Enter the path
s3://amazon-reviews-pds/parquet/
. - For Classification, choose PARQUET.
- Choose Upload Schema.
- Enter the following JSON array into the text box:
- Choose Upload.
- Choose Add column.
- For Column name, enter
product_category
. - For Data type, choose String.
- Select Partition Key.
- Choose Add.
- Choose Submit.
Now you can see that the new governed table has been created.
When you choose the table name, you can see the details of the governed table, and you can also see Governance: Enabled
in this view. This means that this table is a Lake Formation governed table. Tables that are not governed should show as Governance: Disabled
.
You can also see lakeformation.aso.status: true
under Table properties. This means that automatic compaction is enabled for this table. For the example in this post, we don’t need automatic compaction. To disable automatic compaction, complete the following steps:
- Choose Edit table.
- Deselect Automatic compaction.
- Choose Save.
Currently, this governed table does not contain any data or partitions. In the next step, we add existing S3 objects to this governed table using manifest APIs.
Even if your data is located in the table location of the governed table, the data isn’t recognized until you add it to the governed table. Before adding objects to the governed table, let’s configure Lake Formation permissions to grant required permissions to users.
Configuring Lake Formation permissions
You need to grant Lake Formation permissions to your governed table. Complete the following steps:
Table-level permissions
- Sign in to the Lake Formation console in
us-east-1
Region using theDatalakeAdmin1
user. - Under Permissions, choose Data permissions.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the role
LFRegisterLocationServiceRole-
<CloudFormation stack name>
and the userDatalakeAdmin1
. - Select Table permissions.
- Under Table permissions, select Alter, Insert, Drop, Delete, Select, and Describe.
- Choose Grant.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the user
DataAnalyst1
. - Under Table permissions, select Select and Describe.
- Choose Grant.
Row-level permissions
- Under Permissions, choose Data permissions.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the role
LFRegisterLocationServiceRole-
<CloudFormation stack name>
, the usersDatalakeAdmin1
andDataAnalyst1
. - Select Row-based permissions.
- For Filter name, enter
allowAll
. - For Choose filter type, select Allow access to all rows.
- Choose Grant.
Adding table objects into the governed table
To add S3 objects to a governed table, you need to call the UpdateTableObjects
API. You can call it using the AWS Command Line Interface (AWS CLI) and SDK, and also the AWS Glue ETL library (the API is called implicitly in the library). For this post, we use the AWS CLI to explain the behavior at the API level. If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI. You also need to install the service model file provided in the Lake Formation preview program. You need to run the following commands using DatalakeAdmin1
user’s credential.
First, begin a new transaction with the BeginTransaction
API:
Now you can add files to this table within this transaction. For this post, we choose one sample partition product_category=Camera
from the amazon-reviews-pds
table, and choose one file under this partition. You need to know the Uri
, ETag
, and Size
of the files you add. So let’s find this information and copy it.
Create a new file named write-operations1.json
and enter the following JSON: (replace Uri
, ETag
, and Size
with the values you copied).
To add a file to the governed table, use the UpdateTableObjects
API call using the write-operations1.json
file you just created. (replace <transaction-id> with the transaction id you got in begin-transaction
command).
UpdateTableObjects
API call here. We use this timestamp for time travel queries later in this example. You can inspect changes before a transaction is committed by making the GetTableObjects
API call with the same transaction ID: (replace <transaction-id> with the id you got in begin-transaction
command).
Now let’s commit the transaction so this data is available outside this transacation to other users. To do this, call the CommitTransaction
API: (replace <transaction-id> with the transaction id you got in begin-transaction
command).
To simplify the example, we only add one partition with one file. In real-world usage, you may want to add all the files within all the partitions that you need.
Add partitions with following commands:
- Call the
BeginTransaction
API to start another Lake Formation transaction: - List Amazon S3 objects located on
amazon-reviews-pds
bucket to choose another sample file: - Call the
HeadObject
API against one sample file in order to copyETag
andSize
- Create a new file named
write-operations2.json
and enter the following JSON: (replaceUri
,ETag
, andSize
with the values you copied). - Call the
UpdateTableObjects
API usingwrite-operations2.json
: (replace <transaction-id> with the transaction id you got inbegin-transaction
command).
Querying the governed table using Amazon Athena
Now your governed table is ready! Let’s start querying the governed table using Amazon Athena. Sign in to the Athena console in us-east-1
Region using DataAnalyst1
user.
If it’s the first time you are running queries in Athena, you need to configure a query result location. For more information, see Specifying a Query Result Location.
To utilize Lake Formation preview features, you need to create a special workgroup named AmazonAthenaLakeFormationPreview
, and join the workgroup. For more information, see Managing Workgroups.
Running a simple query
Sign in to the Athena console in us-east-1
Region using the DataAnalyst1
user. First, let’s preview 10 records stored in a governed table:
You should see the query results like below.
Running an analytic query
Next, let’s run an analytic query with aggregation to simulate real-world use cases:
The following screenshot shows the results. This query returned the total number of reviews and average rating per product category.
Running an analytic query with time travel
Governed tables enable time travel – you can query a table as of a pervious time. To do this, in Athena, add a WHERE
clause that sets the column __asOfDate
to the epoch time (long integer) representation of the required date and time. Let’s run the time travel query: (replace <epoch-milliseconds> with the timestamp which is right after you made the CommitTransaction
call associated with the first UpdateTableObjects
call. To retrieve the epoch milliseconds, see the tips introduced after the screenshots in this post.)
The following screenshot shows the query results. The result only includes the record of product_category=Camera
. This is because that the file under product_category=Books
was added after the timestamp (1612267920000 ms = 2021/02/02 12:12:00 UTC
), specified in the time travel column __asOfDate
.
To retrieve epoch time from commands, you can run below commands.
The following command is for Linux (GNU date command):
The following command is for OSX (BSD date command):
Cleaning up
Now to the final step, cleaning up the resources.
- Delete the CloudFormation stack. The governed table you created is automatically deleted with the stack.
- Delete the Athena workgroup
AmazonAthenaLakeFormationPreview
.
Conclusion
In this blog post, we explained how to create a Lake Formation governed table with existing data in an AWS public dataset. In addition, we explained how to query governed tables and how to run time travel queries for governed tables. In Part 2 of this series, we will show you how to create a governed table to ingest streaming data and demonstrate how Lake Formation transactions can simplify streaming ETL.
Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, please sign up for the preview.
Appendix: Setting up resources via the console
Configuring IAM roles and IAM users
First, you need to set up two IAM roles, one is for AWS Glue ETL jobs, another is for the Lake Formation data lake location.
IAM policies
To create your policies, complete the following steps:
- On the IAM console, create a new Policy for Amazon S3.
- Save the policy as
S3DataLakePolicy
as follows: - Create a new IAM policy named LFLocationPolicy with the following statements:
- Create a new IAM policy named
LFQuery
Policy with the following statements:IAM role for AWS Lake Formation
To create your IAM role for the Lake Formation data lake location, complete the following steps:
- Create a new Lake Formation role called
LFRegisterLocationServiceRol
e with a Lake Formation trust relationship:Attach the customer managed policies
S3DataLakePolicy
andLFLocationPolicy
you created in the previous step.
This role is used to register locations with Lake Formation which in-turn performs credential vending for Athena at query time.
IAM users
To create your users, complete the following steps:
- Create an IAM user named
DatalakeAdmin
. - Attach the following AWS managed policies:
AWSLakeFormationDataAdmin
AmazonAthenaFullAccess
IAMReadOnlyAccess
- Attach the customer managed policy
LFQueryPolicy
. - Create an IAM user named
DataAnalyst
that can use Athena to query data. - Attach the AWS managed policy
AmazonAthenaFullAccess
. - Attach the customer managed policy
LFQueryPolicy
.
Configuring Lake Formation
If you’re new to Lake Formation, you can follow below steps for getting started with AWS Lake Formation.
- On the Lake Formation console, under Permissions, choose Admins and database creators.
- In the Data lake administratorssection, choose Grant.
- For IAM users and roles, choose your IAM user
DatalakeAdmin
. - Choose Save.
- In the Database creators section, choose Grant.
- For IAM users and roles, choose the
LFRegisterLocationServiceRole
. - Select Create Database.
- Choose Grant.
- Under Register and ingest, choose Data lake locations.
- Choose Register location.
- For Amazon S3 path, enter your Amazon S3 path to the bucket where your data is stored. This needs to be the same bucket you listed in
LFLocationPolicy
. Lake Formation uses this role to vend temporary Amazon S3 credentials to query services that need read/write access to the bucket and all prefixes under it. - For IAM role, choose the
LFRegisterLocationServiceRole
. - Choose Register location.
- Under Data catalog, choose Settings.
- Make sure that both check boxes for Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are deselected.
- Under Data catalog, choose Databases.
- Choose Create database.
- Select Database.
- For Name, enter
lakeformation_tutorial_amazon_reviews
. - Choose Create database.
About the Author
Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.