Today, we announced the next generation of Amazon SageMaker, which is a unified platform for data, analytics, and AI, bringing together widely-adopted AWS machine learning and analytics capabilities. At its core is SageMaker Unified Studio (preview), a single data and AI development environment for data exploration, preparation and integration, big data processing, fast SQL analytics, model development and training, and generative AI application development. This announcement includes Amazon SageMaker Lakehouse, a capability that unifies data across data lakes and data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data.
In addition to these launches, I’m happy to announce data catalog and permissions capabilities in Amazon SageMaker Lakehouse, helping you connect, discover, and manage permissions to data sources centrally.
Organizations today store data across various systems to optimize for specific use cases and scale requirements. This often results in data siloed across data lakes, data warehouses, databases, and streaming services. Analysts and data scientists face challenges when trying to connect to and analyze data from these diverse sources. They must set up specialized connectors for each data source, manage multiple access policies, and often resort to copying data, leading to increased costs and potential data inconsistencies.
The new capability addresses these challenges by simplifying the process of connecting to popular data sources, cataloging them, applying permissions, and making the data available for analysis through SageMaker Lakehouse and Amazon Athena. You can use the AWS Glue Data Catalog as a single metadata store for all data sources, regardless of location. This provides a centralized view of all available data.
Data source connections are created once and can be reused, so you don’t need to set up connections repeatedly. As you connect to the data sources, databases and tables are automatically cataloged and registered with AWS Lake Formation. Once cataloged, you grant access to those databases and tables to data analysts, so they don’t have to go through separate steps of connecting to each data source and don’t have to know built-in data source secrets. Lake Formation permissions can be used to define fine-grained access control (FGAC) policies across data lakes, data warehouses, and online transaction processing (OLTP) data sources, providing consistent enforcement when querying with Athena. Data remains in its original location, eliminating the need for costly and time-consuming data transfers or duplications. You can create or reuse existing data source connections in Data Catalog and configure built-in connectors to multiple data sources, including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Aurora, Amazon DynamoDB (preview), Google BigQuery, and more.
Getting started with the integration between Athena and Lake Formation
To showcase this capability, I use a preconfigured environment that incorporates Amazon DynamoDB as a data source. The environment is set up with appropriate tables and data to effectively demonstrate the capability. I use the SageMaker Unified Studio (preview) interface for this demonstration.
To begin, I go to SageMaker Unified Studio (preview) through the Amazon SageMaker domain. This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop ML models together. Creating a project automatically sets up AWS Glue Data Catalog databases, establishes a catalog for Redshift Managed Storage (RMS) data, and provisions necessary permissions.
To manage projects, you can either view a comprehensive list of existing projects by selecting Browse all projects, or you can create a new project by choosing Create project. I use two existing projects: sales-group, where administrators have full access privileges to all data, and marketing-project, where analysts operate under restricted data access permissions. This setup effectively illustrates the contrast between administrative and limited user access levels.
In this step, I set up a federated catalog for the target data source, which is Amazon DynamoDB. I go to Data in the left navigation pane and choose the + (plus) sign to Add data. I choose Add connection and then I choose Next.
I choose Amazon DynamoDB and choose Next.
I enter the details and choose Add data. Now, I have the Amazon DynamoDB federated catalog created in SageMaker Lakehouse. This is where your administrator gives you access using resource policies. I’ve already configured the resource policies in this environment. Now, I’ll show you how fine-grained access controls work in SageMaker Unified Studio (preview).
I begin by selecting the sales-group project, which is where administrators maintain and have full access to customer data. This dataset contains fields such as zip codes, customer IDs, and phone numbers. To analyze this data, I can execute queries using Query with Athena.
Upon selecting Query with Athena, the Query Editor launches automatically, providing a workspace where I can compose and execute SQL queries against the lakehouse. This integrated query environment offers a seamless experience for data exploration and analysis.
In the second part, I switch to marketing-project to show what an analyst experiences when they run their queries and observe that the fine-grained access control permissions are in place and working.
In the second part, I demonstrate the perspective of an analyst by switching to the marketing-project environment. This helps us verify that the fine-grained access control permissions are properly implemented and effectively restricting data access as intended. Through example queries, we can observe how analysts interact with the data while being subject to the established security controls.
Using the Query with Athena option, I execute a SELECT statement on the table to verify the access controls. The results confirm that, as expected, I can only view the zipcode and cust_id columns, while the phone column remains restricted based on the configured permissions.
With these new data catalog and permissions capabilities in Amazon SageMaker Lakehouse, you can now streamline your data operations, enhance security governance, and accelerate AI/ML development while maintaining data integrity and compliance across your entire data ecosystem.
Now available
Data catalog and permissions in Amazon SageMaker Lakehouse simplifies interactive analytics through federated query when connecting to a unified catalog and permissions with Data Catalog across multiple data sources, providing a single place to define and enforce fine-grained security policies across data lakes, data warehouses, and OLTP data sources for a high-performing query experience.
You can use this capability in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Asia Pacific (Tokyo) AWS Regions.
To get started with this new capability, visit the Amazon SageMaker Lakehouse documentation.