Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables

Mar 05 2021

Off

AWS Feed
Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables

Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes:

Supporting streaming updates and deletes in your data lakes, for example, database replication, and supporting privacy regulations such as GDPR and CCPA
Achieving fine-grained secure sharing not only with table-level or column-level access control, but with row-level access control
Optimizing the layout of various tables and files on Amazon S3 to improve analytics performance

We announced Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. These capabilities are available via new update and access APIs that extend the governance capabilities of Lake Formation with row-level security, and provide transactions over data lakes.

In this series of posts, we provide step-by-step instructions on how to use these new capabilities. This post focuses on setting up governed tables.

Governed Table

The Data Catalog now supports a new type of table: governed table. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions. Lake Formation transactions simplify ETL script and workflow development, and allow multiple users to concurrently and reliably insert, delete, and modify multiple governed tables. Lake Formation automatically compacts and optimizes storage of governed tables in the background to improve query performance.

Setting up resources with AWS CloudFormation

In this post, we demonstrate how you can create a new governed table using existing data on Amazon S3. In particular, we focus on creating a simple governed table based on a public S3 bucket to get you started without incurring any Amazon S3 storage costs. We use the Amazon Customer Reviews Dataset as sample data. Because we are using a public S3 bucket we limit the blog to read-only use cases. In the real-world you may want to do more by putting objects to your S3 buckets, adding them to the governed table, and enabling compaction.

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.

The CloudFormation template generates the following resources:

AWS Identity and Access Management (IAM) users, roles, and policies
AWS Lake Formation data lake settings and permissions

When following the steps in this section, use the Region us-east-1 because as of this writing, these Lake Formation preview features are available only in us-east-1. Please check availability of the features in other regions in the future.

To create your resources, complete the following steps:

Sign in to the CloudFormation console in us-east-1 Region.
Choose Launch Stack:
Choose Next.
For DatalakeAdminUserName and DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
For DataAnalystUserName and DataAnalystUserPassword, enter your IAM user name and password for data analyst user.
For DatabaseName, leave as the default.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation can take up to 2 minutes.

Setting up a governed table

Now you can create and configure your first governed table in AWS Lake Formation.

Creating a governed table

To create your governed table, complete the following steps:

Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
Choose Tables.
Choose Create table.
For Name, enter amazon_reviews_governed.
For Database, enter lakeformation_tutorial_amazon_reviews.
Select Enable governed data access and management.
Select Enable row based permissions.

Select Enable row based permissions.

1. For Data is located in, choose Specified path in another account.
2. Enter the path s3://amazon-reviews-pds/parquet/.
3. For Classification, choose PARQUET.
4. Choose Upload Schema.
5. Enter the following JSON array into the text box:

[ { "Name": "marketplace", "Type": "string" }, { "Name": "customer_id", "Type": "string" }, { "Name": "review_id", "Type": "string" }, { "Name": "product_id", "Type": "string" }, { "Name": "product_parent", "Type": "string" }, { "Name": "product_title", "Type": "string" }, { "Name": "star_rating", "Type": "int" }, { "Name": "helpful_votes", "Type": "int" }, { "Name": "total_votes", "Type": "int" }, { "Name": "vine", "Type": "string" }, { "Name": "verified_purchase", "Type": "string" }, { "Name": "review_headline", "Type": "string" }, { "Name": "review_body", "Type": "string" }, { "Name": "review_date", "Type": "bigint" }, { "Name": "year", "Type": "int" }
]

Choose Upload.
Choose Add column.
For Column name, enter product_category.
For Data type, choose String.
Select Partition Key.
Choose Add.
Choose Submit.

Now you can see that the new governed table has been created.

When you choose the table name, you can see the details of the governed table, and you can also see Governance: Enabled in this view. This means that this table is a Lake Formation governed table. Tables that are not governed should show as Governance: Disabled.
Now you can see that the new governed table has been created.

You can also see lakeformation.aso.status: true under Table properties. This means that automatic compaction is enabled for this table. For the example in this post, we don’t need automatic compaction. To disable automatic compaction, complete the following steps:

Choose Edit table.
Deselect Automatic compaction.
Choose Save.

Currently, this governed table does not contain any data or partitions. In the next step, we add existing S3 objects to this governed table using manifest APIs.

Even if your data is located in the table location of the governed table, the data isn’t recognized until you add it to the governed table. Before adding objects to the governed table, let’s configure Lake Formation permissions to grant required permissions to users.

Even if you locate your data in the table location of the governed table, the data isn’t recognized yet.

Configuring Lake Formation permissions

You need to grant Lake Formation permissions to your governed table. Complete the following steps:

Table-level permissions

Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
Under Permissions, choose Data permissions.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name> and the user DatalakeAdmin1.
Select Table permissions.
Under Table permissions, select Alter, Insert, Drop, Delete, Select, and Describe.
Choose Grant.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the user DataAnalyst1.
Under Table permissions, select Select and Describe.
Choose Grant.

Row-level permissions

Under Permissions, choose Data permissions.
Under Data permission, choose Grant.
For Database, choose lakeformation_tutorial_amazon_reviews.
For Table, choose amazon_reviews_governed.
For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name>, the users DatalakeAdmin1 and DataAnalyst1.
Select Row-based permissions.
For Filter name, enter allowAll.
For Choose filter type, select Allow access to all rows.
Choose Grant.

Adding table objects into the governed table

To add S3 objects to a governed table, you need to call the UpdateTableObjects API. You can call it using the AWS Command Line Interface (AWS CLI) and SDK, and also the AWS Glue ETL library (the API is called implicitly in the library). For this post, we use the AWS CLI to explain the behavior at the API level. If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI. You also need to install the service model file provided in the Lake Formation preview program. You need to run the following commands using DatalakeAdmin1 user’s credential.

First, begin a new transaction with the BeginTransaction API:

$ aws lakeformation-preview begin-transaction
{ "TransactionId": "7e5d506a757f32252ae3402a10191b13bfd1d7aa1c26a099d4a1911241589b8f"
}

Now you can add files to this table within this transaction. For this post, we choose one sample partition product_category=Camera from the amazon-reviews-pds table, and choose one file under this partition. You need to know the Uri, ETag, and Size of the files you add. So let’s find this information and copy it.

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Camera/
2018-04-09 15:37:05 65386769 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06 65619234 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06 64564669 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07 65148225 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07 65227429 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07 65269357 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08 65595867 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08 65012056 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09 65137504 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09 64992488 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet $ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Camera/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{ "AcceptRanges": "bytes", "LastModified": "Mon, 09 Apr 2018 06:37:07 GMT", "ContentLength": 65227429, "ETag": ""980669fcf6ccf31d2d686b9cccdd45e3-8"", "ContentType": "binary/octet-stream", "Metadata": {}
}

Create a new file named write-operations1.json and enter the following JSON: (replace Uri, ETag, and Size with the values you copied).

[ { "AddObject": { "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet", "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8", "Size": 65386769, "PartitionValues": [ "Camera" ] } }
]

To add a file to the governed table, use the UpdateTableObjects API call using the write-operations1.json file you just created. (replace <transaction-id> with the transaction id you got in begin-transaction command).

$ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations1.json$

Note the current date-time right after making the UpdateTableObjects API call here. We use this timestamp for time travel queries later in this example.

$ date -u
Tue Feb 2 12:12:00 UTC 2021

You can inspect changes before a transaction is committed by making the GetTableObjects API call with the same transaction ID: (replace <transaction-id> with the id you got in begin-transaction command).

$ aws lakeformation-preview get-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id>

{ "Objects": [ { "PartitionValues": [ "Camera" ], "Objects": [ { "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet", "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8", "Size": 65386769 } ] } ]
}

Now let’s commit the transaction so this data is available outside this transacation to other users. To do this, call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command).

$ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>

After you commit the transaction, you can see the partition on the Lake Formation console.

After running the preceding command, you can see the partition on the Lake Formation console.

To simplify the example, we only add one partition with one file. In real-world usage, you may want to add all the files within all the partitions that you need.

Add partitions with following commands:

Call the BeginTransaction API to start another Lake Formation transaction:

$ aws lakeformation-preview begin-transaction
{ "TransactionId": "d70c60e859e832b312668723cf48c1b84ef9109c5dbf6e9dbe8834c481c0ec81"
}

List Amazon S3 objects located on amazon-reviews-pds bucket to choose another sample file:

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Books/
2018-04-09 15:35:58 1094842361 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:35:59 1093295804 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1095643518 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1095218865 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:00 1094787237 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:33 1094302491 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1094565655 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1095288096 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1092058864 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:36:35 1093613569 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

Call the HeadObject API against one sample file in order to copy ETag and Size

$ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{ "AcceptRanges": "bytes", "LastModified": "Mon, 09 Apr 2018 06:35:58 GMT", "ContentLength": 1094842361, "ETag": ""9805c2c9a0459ccf337e01dc727f8efc-131"", "ContentType": "binary/octet-stream", "Metadata": {}
}

Create a new file named write-operations2.json and enter the following JSON: (replace Uri, ETag, and Size with the values you copied).

[ { "AddObject": { "Uri": "s3://amazon-reviews-pds/parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet", "ETag": "9805c2c9a0459ccf337e01dc727f8efc-131", "Size": 1094842361, "PartitionValues": [ "Books" ] } }
]

Call the UpdateTableObjects API using write-operations2.json: (replace <transaction-id> with the transaction id you got in begin-transaction command).
```
$ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations2.json
```
Call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command).
```
$ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>
```
Now the two partitions are visible on the Lake Formation console.

Now the two partitions are visible on the Lake Formation console.

Querying the governed table using Amazon Athena

Now your governed table is ready! Let’s start querying the governed table using Amazon Athena. Sign in to the Athena console in us-east-1 Region using DataAnalyst1 user.

If it’s the first time you are running queries in Athena, you need to configure a query result location. For more information, see Specifying a Query Result Location.

To utilize Lake Formation preview features, you need to create a special workgroup named AmazonAthenaLakeFormationPreview, and join the workgroup. For more information, see Managing Workgroups.

Running a simple query

Sign in to the Athena console in us-east-1 Region using the DataAnalyst1 user. First, let’s preview 10 records stored in a governed table:

SELECT * FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
LIMIT 10

You should see the query results like below.

The following screenshot shows the query results.

Running an analytic query

Next, let’s run an analytic query with aggregation to simulate real-world use cases:

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed GROUP BY product_category

The following screenshot shows the results. This query returned the total number of reviews and average rating per product category.

The following screenshot shows the results

Running an analytic query with time travel

Governed tables enable time travel – you can query a table as of a pervious time. To do this, in Athena, add a WHERE clause that sets the column __asOfDate to the epoch time (long integer) representation of the required date and time. Let’s run the time travel query: (replace <epoch-milliseconds> with the timestamp which is right after you made the CommitTransaction call associated with the first UpdateTableObjects call. To retrieve the epoch milliseconds, see the tips introduced after the screenshots in this post.)

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
WHERE __asOfDate = <epoch-milliseconds>
GROUP BY product_category

The following screenshot shows the query results. The result only includes the record of product_category=Camera. This is because that the file under product_category=Books was added after the timestamp (1612267920000 ms = 2021/02/02 12:12:00 UTC), specified in the time travel column __asOfDate.

The following screenshot shows the query results.

To retrieve epoch time from commands, you can run below commands.

The following command is for Linux (GNU date command):

$ echo $(($(date -u -d '2021/02/02 12:12:00' +%s%N)/1000000)) 1612267920000

The following command is for OSX (BSD date command):

$ echo $(($(date -u -j -f "%Y/%m/%d %T" "2021/02/02 12:12:00" +'%s * 1000 + %-N / 1000000')))
1612267920000

Cleaning up

Now to the final step, cleaning up the resources.

Delete the CloudFormation stack. The governed table you created is automatically deleted with the stack.
Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this blog post, we explained how to create a Lake Formation governed table with existing data in an AWS public dataset. In addition, we explained how to query governed tables and how to run time travel queries for governed tables. In Part 2 of this series, we will show you how to create a governed table to ingest streaming data and demonstrate how Lake Formation transactions can simplify streaming ETL.

Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, please sign up for the preview.

Appendix: Setting up resources via the console

Configuring IAM roles and IAM users

First, you need to set up two IAM roles, one is for AWS Glue ETL jobs, another is for the Lake Formation data lake location.

IAM policies

To create your policies, complete the following steps:

On the IAM console, create a new Policy for Amazon S3.

Save the policy as S3DataLakePolicy as follows:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::amazon-reviews-pds/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::amazon-reviews-pds" ] } ]
}

Create a new IAM policy named LFLocationPolicy with the following statements:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "LFPreview1", "Effect": "Allow", "Action": "execute-api:Invoke", "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus" }, { "Sid": "LFPreview2", "Effect": "Allow", "Action": [ "lakeformation:BeginTransaction", "lakeformation:CommitTransaction", "lakeformation:AbortTransaction", "lakeformation:GetTableObjects", "lakeformation:UpdateTableObjects" ], "Resource": "*" } ]
}

Create a new IAM policy named LFQuery Policy with the following statements:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "LFPreview1", "Effect": "Allow", "Action": "execute-api:Invoke", "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus" }, { "Sid": "LFPreview2", "Effect": "Allow", "Action": [ "lakeformation:BeginTransaction", "lakeformation:CommitTransaction", "lakeformation:AbortTransaction", "lakeformation:ExtendTransaction", "lakeformation:PlanQuery", "lakeformation:GetTableObjects", "lakeformation:GetQueryState", "lakeformation:GetWorkUnits", "lakeformation:Execute" ], "Resource": "*" } ]
}

IAM role for AWS Lake Formation

To create your IAM role for the Lake Formation data lake location, complete the following steps:

Create a new Lake Formation role called LFRegisterLocationServiceRole with a Lake Formation trust relationship:
```
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "lakeformation.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ]
}
```
Attach the customer managed policies S3DataLakePolicy and LFLocationPolicy you created in the previous step.

This role is used to register locations with Lake Formation which in-turn performs credential vending for Athena at query time.

IAM users

To create your users, complete the following steps:

Create an IAM user named DatalakeAdmin.
Attach the following AWS managed policies:
1. AWSLakeFormationDataAdmin
2. AmazonAthenaFullAccess
3. IAMReadOnlyAccess
Attach the customer managed policy LFQueryPolicy.
Create an IAM user named DataAnalyst that can use Athena to query data.
Attach the AWS managed policy AmazonAthenaFullAccess.
Attach the customer managed policy LFQueryPolicy.

Configuring Lake Formation

If you’re new to Lake Formation, you can follow below steps for getting started with AWS Lake Formation.

On the Lake Formation console, under Permissions, choose Admins and database creators.
In the Data lake administratorssection, choose Grant.
For IAM users and roles, choose your IAM user DatalakeAdmin.
Choose Save.
In the Database creators section, choose Grant.
For IAM users and roles, choose the LFRegisterLocationServiceRole.
Select Create Database.
Choose Grant.
Under Register and ingest, choose Data lake locations.
Choose Register location.
For Amazon S3 path, enter your Amazon S3 path to the bucket where your data is stored. This needs to be the same bucket you listed in LFLocationPolicy. Lake Formation uses this role to vend temporary Amazon S3 credentials to query services that need read/write access to the bucket and all prefixes under it.
For IAM role, choose the LFRegisterLocationServiceRole.
Choose Register location.
Under Data catalog, choose Settings.
Make sure that both check boxes for Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are deselected.
Under Data catalog, choose Databases.
Choose Create database.
Select Database.
For Name, enter lakeformation_tutorial_amazon_reviews.
Choose Create database.

About the Author

Noritaka Sekiyama p Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.

Posted inAWS News

TagsAWS