Today, we announced the next generation of Amazon SageMaker, which is a unified platform for data, analytics, and AI, bringing together widely-adopted AWS machine learning and analytics capabilities. This announcement includes Amazon SageMaker Data and AI Governance, a set of capabilities that streamline the management of data and AI assets.
Data teams often face challenges when trying to locate, access, and collaborate on data and AI models across their organizations. The process of discovering relevant assets, understanding their context, and obtaining proper access can be time-consuming and complex, potentially hindering productivity and innovation.
SageMaker Data and AI Governance offers a comprehensive set of features by providing a unified experience for cataloging, discovering, and governing data and AI assets. It’s centered around SageMaker Catalog built on Amazon DataZone, providing a centralized repository that is accessible through Amazon SageMaker Unified Studio (preview). The catalog is built directly into the SageMaker platform, offering seamless integration with existing SageMaker workflows and tools, helping engineers, data scientists, and analysts to safely find and use authorized data and models through advanced search features. With the SageMaker platform, users can safeguard and protect their AI models using guardrails and implementing responsible AI policies.
Here are some of the key Data and AI governance features of SageMaker:
- Enterprise-ready business catalog – To add business context and make data and AI assets discoverable by everyone in the organization, you can customize the catalog with automated metadata generation which uses machine learning (ML) to automatically generate business names of data assets and columns within those assets. We improved metadata curation functionality, helping you attach multiple business glossary terms to assets and glossary terms to individual columns in the asset.
- Self-service for data and AI workers – To provide data autonomy for users to publish and consume data, you can customize and bring any type of asset to the catalog using APIs. Data publishers can automate metadata discovery through data source runs or manually published files from the supported data sources and enrich metadata with generative AI–generated data descriptions automatically as datasets are brought into the catalog. Data consumers can then use faceted search to quickly find, understand, and request access to data.
- Simplified access to data and tools – To govern data and AI assets based on business purpose, projects serve as business use case–based logical containers. You can create a project and collaborate on specific business use case–based groupings of people, data, and analytics tools. Within the project, you can create an environment that provides the necessary infrastructure to project members such as analytics and AI tools and storage so that project members can easily produce new data or consume data they have access to. This helps you add multiple capabilities and analytics tools to the same project, depending on your needs.
- Governed data and model sharing – Data producers own and manage access to data with a subscription approval workflow that allows consumers to request access and data owners to approve. You can now set up subscription terms to be attached to assets when published and automate subscription grant fulfillment for AWS managed data lakes and Amazon Redshift with customizations using Amazon EventBridge events for other sources.
- Bring a consistent level of AI safety across all your applications: Amazon Bedrock Guardrails helps evaluate user inputs and Foundation Model (FM) responses based on use case specific policies, and provides an additional layer of safeguards regardless of the underlying Foundation Models. AWS AI portfolio provides hundreds of built-in algorithms with pre-trained models from model hubs, including TensorFlow Hub, PyTorch Hub, Hugging Face, and MxNet GluonCV. You can also access built-in algorithms using the SageMaker Python SDK. Built-in algorithms cover common ML tasks, such as data classifications (image, text, tabular) and sentiment analysis.
For seamless integration with existing processes, SageMaker Data and AI Governance provides API support, enabling programmatic access for setup and configuration.
How to use Amazon SageMaker Data and AI Governance
For this demonstration, I use a preconfigured environment. I go to the Amazon SageMaker Unified Studio (preview) console, which provides an integrated development experience for all your data and AI use cases. This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop ML models together.
Let me start with the Govern menu in the navigation bar.
New data governance capabilities called domain units and authorization policies that help you create business unit- and team-level organization and manage policies according to your business needs. With the addition of domain units, you can organize, create, search, and find data assets and projects associated with business units or teams. With authorization policies, you can set access policies for creating projects and glossaries.
Domain units also help you with self-service governance over critical actions such as publishing data assets and utilizing compute resources within Amazon SageMaker. I choose a project and navigate to the Data sources tab in the left navigation pane. You can use this section to add new or manage existing data sources for publishing data assets to the business data catalog, making them discoverable for all users.
I return to the homepage and continue exploring by choosing Data Catalog, which serves as a centralized hub where users can explore and discover all available data assets across multiple data sources within the organization. This catalog connects to various data sources, including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and AWS Glue.
The semantic search feature helps you find relevant data assets quickly and efficiently using natural language queries, which makes data discovery more intuitive. I enter events in the Search data area.
You can apply filters based on asset type, such as AWS Glue table and Amazon Redshift.
Amazon Q Developer integration helps you interact with data using conversational language, making it easier for users to find and understand data assets. You can use example commands such as “Show me datasets that relate to events” and “Show me datasets that relate to revenue.” The detailed view provides comprehensive information about each dataset, including AI-generated descriptions, data quality metrics, and data lineage, helping you understand the content and origin of the data.
The subscription process implements a controlled access mechanism where users must justify their need for data access, providing proper data governance and security. I choose Subscribe to request access.
In the pop-up window, I select a Project, provide a Reason for request such as need access, and choose Request. The request is sent to the data owner.
This final step makes sure that data access is properly governed through a structured approval workflow, maintaining data security and compliance requirements. During the owner approval process, the data owner receives a notification and can review the request details before choosing to approve or deny access, after which the requester can access the data table if approved.
Now available
Amazon SageMaker Data and AI Governance offers significant benefits for organizations looking to improve their data and AI asset management. The solution helps data scientists, engineers, and analysts overcome challenges in discovering and accessing resources by offering comprehensive features for cataloging, discovering, and governing data and AI assets, while providing security and compliance through structured approval workflows.
For pricing information, visit Amazon SageMaker pricing.
To get started with Amazon SageMaker Data and AI Governance, visit Amazon SageMaker Documentation.