AWS Feed
Using AWS DevOps Tools to model and provision AWS Glue workflows
This post provides a step-by-step guide on how to model and provision AWS Glue workflows utilizing a DevOps principle known as infrastructure as code (IaC) that emphasizes the use of templates, source control, and automation. The cloud resources in this solution are defined within AWS CloudFormation templates and provisioned with automation features provided by AWS CodePipeline and AWS CodeBuild. These AWS DevOps tools are flexible, interchangeable, and well suited for automating the deployment of AWS Glue workflows into different environments such as dev, test, and production, which typically reside in separate AWS accounts and Regions.
AWS Glue workflows allow you to manage dependencies between multiple components that interoperate within an end-to-end ETL data pipeline by grouping together a set of related jobs, crawlers, and triggers into one logical run unit. Many customers using AWS Glue workflows start by defining the pipeline using the AWS Management Console and then move on to monitoring and troubleshooting using either the console, AWS APIs, or the AWS Command Line Interface (AWS CLI).
Solution overview
The solution uses COVID-19 datasets. For more information on these datasets, see the public data lake for analysis of COVID-19 data, which contains a centralized repository of freely available and up-to-date curated datasets made available by the AWS Data Lake team.
Because the primary focus of this solution showcases how to model and provision AWS Glue workflows using AWS CloudFormation and CodePipeline, we don’t spend much time describing intricate transform capabilities that can be performed in AWS Glue jobs. As shown in the Python scripts, the business logic is optimized for readability and extensibility so you can easily home in on the functions that aggregate data based on monthly and quarterly time periods.
The ETL pipeline reads the source COVID-19 datasets directly and writes only the aggregated data to your S3 bucket.
The solution exposes the datasets in the following tables:
Table Name | Description | Dataset location | Provider |
countrycode | Lookup table for country codes | s3://covid19-lake/static-datasets/csv/countrycode/ | Rearc |
countypopulation | Lookup table for the population of each county | s3://covid19-lake/static-datasets/csv/CountyPopulation/ | Rearc |
state_abv | Lookup table for US state abbreviations | s3://covid19-lake/static-datasets/json/state-abv/ | Rearc |
rearc_covid_19_nyt_data_in_usa_us_counties | Data on COVID-19 cases at US county level | s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-counties/ | Rearc |
rearc_covid_19_nyt_data_in_usa_us_states | Data on COVID-19 cases at US state level | s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-states/ | Rearc |
rearc_covid_19_testing_data_states_daily | Data on COVID-19 cases at US state level | s3://covid19-lake/rearc-covid-19-testing-data/csv/states_daily/ | Rearc |
rearc_covid_19_testing_data_us_daily | US total test daily trend | s3://covid19-lake/rearc-covid-19-testing-data/csv/us_daily/ | Rearc |
rearc_covid_19_testing_data_us_total_latest | US total tests | s3://covid19-lake/rearc-covid-19-testing-data/csv/us-total-latest/ | Rearc |
rearc_covid_19_world_cases_deaths_testing | World total tests | s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/ | Rearc |
rearc_usa_hospital_beds | Hospital beds and their utilization in the US | s3://covid19-lake/rearc-usa-hospital-beds/ | Rearc |
world_cases_deaths_aggregates | Monthly and quarterly aggregate of the world | s3://<your-S3-bucket-name> /covid19/world-cases-deaths-aggregates/ |
Aggregate |
Prerequisites
This post assumes you have the following:
- Access to an AWS account
- The AWS CLI (optional)
- Permissions to create a CloudFormation stack
- Permissions to create AWS resources, such as AWS Identity and Access Management (IAM) roles, Amazon Simple Storage Service (Amazon S3) buckets, and various other resources
- General familiarity with AWS Glue resources (triggers, crawlers, and jobs)
Architecture
The CloudFormation template glue-workflow-stack.yml defines all the AWS Glue resources shown in the following diagram.
Modeling the AWS Glue workflow using AWS CloudFormation
Let’s start by exploring the template used to model the AWS Glue workflow: glue-workflow-stack.yml
We focus on two resources in the following snippet:
AWS::Glue::Workflow
AWS::Glue::Trigger
From a logical perspective, a workflow contains one or more triggers that are responsible for invoking crawlers and jobs. Building a workflow starts with defining the crawlers and jobs as resources within the template and then associating it with triggers.
Defining the workflow
This is where the definition of the workflow starts. In the following snippet, we specify the type as AWS::Glue::Workflow
and the property Name
as a reference to the parameter GlueWorkflowName
.
Parameters: GlueWorkflowName: Type: String Description: Glue workflow that tracks all triggers, jobs, crawlers as a single entity Default: Covid_19 Resources: Covid19Workflow: Type: AWS::Glue::Workflow Properties: Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity Name: !Ref GlueWorkflowName
Defining the triggers
This is where we define each trigger and associate it with the workflow. In the following snippet, we specify the property WorkflowName
on each trigger as a reference to the logical ID Covid19Workflow
.
These triggers allow us to create a chain of dependent jobs and crawlers as specified by the properties Actions and Predicate.
The trigger t_Start
utilizes a type of SCHEDULED
, which means that it starts at a defined time (in our case, one time a day at 8:00 AM UTC). Every time it runs, it starts the job with the logical ID Covid19WorkflowStarted
.
The trigger t_GroupA
utilizes a type of CONDITIONAL
, which means that it starts when the resources specified within the property Predicate
have reached a specific state (when the list of Conditions
specified equals SUCCEEDED). Every time t_GroupA
runs, it starts the crawlers with the logical ID’s CountyPopulation
and Countrycode
, per the Actions
property containing a list of actions.
TriggerJobCovid19WorkflowStart: Type: AWS::Glue::Trigger Properties: Name: t_Start Type: SCHEDULED Schedule: cron(0 8 * * ? *) # Runs once a day at 8 AM UTC StartOnCreation: true WorkflowName: !Ref GlueWorkflowName Actions: - JobName: !Ref Covid19WorkflowStarted TriggerCrawlersGroupA: Type: AWS::Glue::Trigger Properties: Name: t_GroupA Type: CONDITIONAL StartOnCreation: true WorkflowName: !Ref GlueWorkflowName Actions: - CrawlerName: !Ref CountyPopulation - CrawlerName: !Ref Countrycode Predicate: Conditions: - JobName: !Ref Covid19WorkflowStarted LogicalOperator: EQUALS State: SUCCEEDED
Provisioning the AWS Glue workflow using CodePipeline
Now let’s explore the template used to provision the CodePipeline resources: codepipeline-stack.yml
This template defines an S3 bucket that is used as the source action for the pipeline. Any time source code is uploaded to a specified bucket, AWS CloudTrail logs the event, which is detected by an Amazon CloudWatch Events rule configured to start running the pipeline in CodePipeline. The pipeline orchestrates CodeBuild to get the source code and provision the workflow.
For more information on any of the available source actions that you can use with CodePipeline, such as Amazon S3, AWS CodeCommit, Amazon Elastic Container Registry (Amazon ECR), GitHub, GitHub Enterprise Server, GitHub Enterprise Cloud, or Bitbucket, see Start a pipeline execution in CodePipeline.
We start by deploying the stack that sets up the CodePipeline resources. This stack can be deployed in any Region where CodePipeline and AWS Glue are available. For more information, see AWS Regional Services.
Cloning the GitHub repo
Clone the GitHub repo with the following command:
$ git clone https://github.com/aws-samples/provision-codepipeline-glue-workflows.git
Deploying the CodePipeline stack
Deploy the CodePipeline stack with the following command:
$ aws cloudformation deploy
--stack-name codepipeline-covid19
--template-file cloudformation/codepipeline-stack.yml
--capabilities CAPABILITY_NAMED_IAM
--no-fail-on-empty-changeset
--region <AWS_REGION>
The preceding screenshot shows that the pipeline failed. This is because we haven’t uploaded the source code yet.
In the following steps, we zip and upload the source code, which triggers another (successful) run of the pipeline.
Zipping the source code
Zip the source code containing Glue scripts, CloudFormation templates, and Buildspecs file with the following command:
$ zip -r source.zip . -x images/* *.history* *.git* *.DS_Store*
You can omit *.DS_Store* from the preceding command if you are not a Mac user.
Uploading the source code
Upload the source code with the following command:
$ aws s3 cp source.zip s3://covid19-codepipeline-source-<AWS_ACCOUNT_ID>-<AWS_REGION>
Make sure to provide your account ID and Region in the preceding command. For example, if your AWS account ID is 111111111111 and you’re using Region us-west-2, use the following command:
$ aws s3 cp source.zip s3://covid19-codepipeline-source-111111111111-us-west-2
Now that the source code has been uploaded, view the pipeline again to see it in action.
Choose Details within the Deploy stage to see the build logs.
To modify any of the commands that run within the Deploy stage, feel free to modify: deploy-glue-workflow-stack.yml
Try uploading the source code a few more times. Each time it’s uploaded, CodePipeline starts and runs another deploy of the workflow stack. If nothing has changed in the source code, AWS CloudFormation automatically determines that the stack is already up to date. If something has changed in the source code, AWS CloudFormation automatically determines that the stack needs to be updated and proceeds to run the change set.
Viewing the provisioned workflow, triggers, jobs, and crawlers
To view your workflows on the AWS Glue console, in the navigation pane, under ETL, choose Workflows.
To view your triggers, in the navigation pane, under ETL, choose Triggers.
To view your crawlers, under Data Catalog, choose Crawlers.
To view your jobs, under ETL, choose Jobs.
Running the workflow
The workflow runs automatically at 8:00 AM UTC. To start the workflow manually, you can use either the AWS CLI or the AWS Glue console.
To start the workflow with the AWS CLI, enter the following command:
$ aws glue start-workflow-run --name Covid_19 --region <AWS_REGION>
To start the workflow on the AWS Glue console, on the Workflows page, select your workflow and choose Run on the Actions menu.
To view the run details of the workflow, choose the workflow on the AWS Glue console and choose View run details on the History tab.
The following screenshot shows a visual representation of the workflow as a graph with your run details.
Cleaning up
To avoid additional charges, delete the stack created by the CloudFormation template and the contents of the buckets you created.
1. Delete the contents of the covid19-dataset bucket with the following command:
$ aws s3 rm s3://covid19-dataset-<AWS_ACCOUNT_ID>-<AWS_REGION> --recursive
2. Delete your workflow stack with the following command:
$ aws cloudformation delete-stack --stack-name glue-covid19 --region <AWS_REGION>
To delete the contents of the covid19-codepipeline-source bucket, it’s simplest to use the Amazon S3 console because it makes it easy to delete multiple versions of the object at once.
3. Navigate to the S3 bucket named covid19-codepipeline-source-<AWS_ACCOUNT_ID>- <AWS_REGION>.
4. Choose List versions.
5. Select all the files to delete.
6. Choose Delete and follow the prompts to permanently delete all the objects.
7. Delete the contents of the covid19-codepipeline-artifacts bucket:
$ aws s3 rm s3://covid19-codepipeline-artifacts-<AWS_ACCOUNT_ID>-<AWS-REGION> --recursive
8. Delete the contents of the covid19-cloudtrail-logs bucket:
$ aws s3 rm s3://covid19-cloudtrail-logs-<AWS_ACCOUNT_ID>-<AWS-REGION> --recursive
9. Delete the pipeline stack:
$ aws cloudformation delete-stack --stack-name codepipeline-covid19 --region <AWS-REGION>
Conclusion
In this post, we stepped through how to use AWS DevOps tooling to model and provision an AWS Glue workflow that orchestrates an end-to-end ETL pipeline on a real-world dataset.
You can download the source code and template from this Github repository and adapt it as you see fit for your data pipeline use cases. Feel free to leave comments letting us know about the architectures you build for your environment. To learn more about building ETL pipelines with AWS Glue, see the AWS Glue Developer Guide and the AWS Data Analytics learning path.