Amazon Web Services Feed
Migrating IBM Netezza to Amazon Redshift using the AWS Schema Conversion Tool
The post How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime described a high-level strategy to move from an on-premises Netezza data warehouse to Amazon Redshift. In this post, we explain how a large European Enterprise customer implemented a Netezza migration strategy spanning multiple environments, using the AWS Schema Conversion Tool (AWS SCT) to accelerate schema and data migration. We also walk you through validating that the schema and data content were migrated as expected and followed Amazon Redshift best practices.
Solution overview
It’s important to build a migration plan unique to your organization’s processes and non-functional requirements. The following plan is a real-world use case from a large European Enterprise customer. It details the different environments migrated to and the tasks, tools, and scripts used to complete the work:
- Assess migration tasks
- Understand the scope of the migration
- Record objects to be migrated into a migration runbook
- Set up the migration environment
- Install AWS SCT
- Configure AWS SCT for Netezza source environments
- Migrate to the development environment
- Create users, groups, and schema
- Convert schema
- Migrate data
- Validate data
- Transform ETL, UDF, and procedures
- Migrate to other pre-production environments
- Create users, groups, and schema
- Convert schema
- Migrate data
- Validate data
- Transform ETL, UDF, and procedures
- Migrate to the production environment
- Create users, groups, and schema
- Convert schema
- Migrate data
- Validate data
- Transform ETL, UDF, and procedures
- Business validation (including optional dual-running)
- Cut over
Assessing migration tasks
To plan and keep track of the migration tasks, you should produce a tracker of all the Netezza databases, tables, and views in scope. This information forms a migration runbook that is updated during the migration to document the progress of data migration from Netezza to Amazon Redshift. For each table identified, record the number of rows and size in GB.
Some Netezza source systems contain two Netezza data warehouses, for example one for ETL loading throughout the day and one for end-user reporting users. Make sure it’s clear which data warehouses are in scope for the migration.
Setting up the migration environment
The migration strategy uses the AWS SCT to accelerate schema object conversion and migrate the data from the Netezza database to the Amazon Redshift cluster. The following diagram illustrates this architecture.
The migration should ensure the following:
- The AWS SCT is installed within the AWS account onto an Amazon Elastic Compute Cloud (Amazon EC2) instance to facilitate migration operations, orchestrate the AWS SCT data extraction agents, and provide access via a user-friendly console.
- The AWS SCT data extraction agents are installed and run as close to the Netezza data warehouse as possible. AWS strongly recommends installing them on premises within the same subnet as the Netezza data warehouse.
During the transfer of data from the on-premises data center to the AWS account, you can use either a direct connection or offline storage. AWS Snowball is a petabyte-scale offline solution for moving large amounts of data into the AWS account where sufficient bandwidth of a direct connection isn’t available. AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to an AWS account. You can establish private connectivity between your AWS account and your data center, office, or co-location environment by using Direct Connect, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Using Direct Connect also adds flexibility in case extract jobs need to be re-run.
Configuring AWS SCT for the Netezza source environment
The AWS SCT is installed on an EC2 instance running Microsoft Windows 10 with administrator privileges. Choosing Microsoft Windows as the operating system allows your users to graphically control the creation of projects, modify profiles, start and view the progress of the conversions, and view the output of the migration assessment reports.
Because you don’t perform the data migration directly on the AWS SCT console, a general purpose EC2 instance with 4 vCPU, 16 GB memory, 100 GB storage, and moderate network bandwidth is sufficient.
You should configure several AWS SCT data extraction agents to match the amount of data to be concurrently transferred and the number of Netezza connections available. You can install the data extraction agents on on-premises VM instances running Linux with root administration privileges. The size of each instance is 8 vCPU, 32 GB memory, and up to 10 Gb network capacity. For disk storage, we use 1TB of 500 IOPS Provisioned SSD because intermediate results are stored on disk.
It’s preferable that the on-premises instances are located as close as possible to the Netezza data warehouse, ideally only a single network hop away. This is important because each data extraction agent creates a table on the instance file system as storage for the extracted data. Also, for each agent, the CPU chosen is more powerful because the compression of the extracted data is processor intensive.
As stated earlier, the number of agents should be proportionate to the amount of concurrent data streams being transferred and the number of Netezza connections available for the transfer. A rule of thumb is to have one data extraction agent for each TB of compressed Netezza data to be migrated in parallel. For optimum performance, it’s recommended that each agent is installed on a single VM instance.
You should work with the DBA team to ensure as many Netezza concurrent connections are made available to the data extraction agents as possible. For the best performance, allocating all the available connections gives all the power of the source database, but if you need to run workloads in parallel with the data extracts, asking for a smaller amount (for example, 21) can suffice. This is a trade-off between resources available against the time required to migrate the data.
For this use case, we allocated seven extraction agents, because the largest project phase extracted 6 TB of Netezza data. The DBA team configured 21 Netezza concurrent connections, so each agent was configured with three parallel data extraction processes (known as threads; see the following configuration file).
Two parameters on the data extraction agents can impact the length of time it takes for the data to migrate from Netezza to the agents: the number of connections and the number of threads.
Tuning is required for each data extraction agent to maximize throughput during the data migration phase. Tuning is achieved by modifying the file /usr/share/aws/sct-extractor/conf/settings.properties
, and the file must be applied against each agent. See the following code:
The preceding code has the following features:
extractor.source.connection.pool.size
defines the number of connections the agent opens against the Netezza data warehouse.extractor.extracting.thread.pool.size
defines the number of parallel jobs the agent can spawn concurrently. The sum of this parameter for all the agents should be smaller than the maximum concurrent connections configured from Netezza.- It’s an AWS recommendation to have
extractor.source.connection.pool.size
1.5 times larger than extractor.extracting.thread.pool.size. This is because while a task is running, the AWS SCT may need additional connections to retrieve metadata from Netezza to create additional tasks or other operations, such as to collect table statistics.
Migrating to the development environment
The first task to undertake is data model schema transformation. It consists of transforming the Netezza schema objects into Amazon Redshift-compliant syntax and deploying them into the Amazon Redshift development environment. Before migrating the Netezza tables and views, you must create the schemas, groups, and users.
Creating schemas, users, and groups
If you don’t follow this step, all the objects are created in the Amazon Redshift public schema, which isn’t recommended. The following best practices aren’t specific to Netezza migration, but you can use them as a checklist during this step:
- Create schemas to logically separate views and tables.
- Groups are easier to maintain than many users because you can grant permissions to groups, and you can add and remove users from groups. Also, groups can direct all traffic from all users in the group to a specific Amazon Redshift WLM queue (which can control priorities as well as QMR limits).
- Grant permissions at the schema level to allow selected groups to access the schema. This is independent of the permissions for the objects within the schema.
- Finally, assign users to groups.
Transforming the schema
The AWS SCT analyzes the Netezza data model schema, converts the syntax into Amazon Redshift-compliant DDL statements, and applies the target schema to the Amazon Redshift cluster. The AWS SCT accelerates this phase by making sure Amazon Redshift best practices are taken into account during the transformation.
Within Amazon Redshift, column-level encoding makes sure that the most performant level of compression is applied to every data block of storage for the tables. It’s recommended that the latest ZSTD encoding is applied to all varchar, char, Boolean, and geometry columns, and the AZ64 encoding is applied to all other columns, including integers and decimals.
To improve zone map performance, don’t encode the first column of a sort key (set to raw encoding).
Netezza supports both character-length and byte-length semantics.
If character length semantics (the default) is selected, the length is specified in terms of characters, and it can consume more bytes than the length indicates. For example, if the varchar datatype length is set to 100, it allows multi-byte characters from 1–4 bytes to a maximum of 400 bytes.
If the bytes length semantics is selected, length is specified in terms of bytes. It can support only the number of bytes specified in the varchar. For example, if the varchar datatype is set to 100 only, it allows storing characters up to 100 bytes. This includes single byte and multi byte.
As of this writing, Amazon Redshift doesn’t support character-length semantics, which can lead to String length exceeds DDL length
errors while loading the data into Amazon Redshift tables. The simplest solution is to multiply the length of such attributes by 4. A more efficient solution requires determining the maximum length of each varchar column in bytes in Netezza, adding an additional 20% buffer to the maximum length, and setting that as the maximum value for the Amazon Redshift varchar datatype column.
If the Netezza column maximum length in bytes is less than Amazon Redshift column length in bytes, you don’t need to increase the size of the column length in Amazon Redshift. The following query gets the column datatype in Netezza:
The following script generates a query to get the maximum amount of bytes actually used for each varchar column:
During data migration, you can use the following query to identify the reason for the load failure:
If the error reason is String length exceeds DDL length
, you need to increase the length of the affected column.
As recommended earlier, based on the maximum column length in Netezza, you should add an additional 20% buffer to it and set that as maximum length to Amazon Redshift.
The following in Netezza is example output from the preceding SQL command:
The following code is output in Amazon Redshift:
AWS SCT uses statistics from the source database with user-specified optimization strategies to determine the appropriate distribution key and sort key strategies for the target schema. These optimization strategies require collecting statistics from the source database in order to activate the most relevant optimization rule for each table.
It’s recommended to do the following:
- Choose the current Netezza key distribution style as a good starting point for an Amazon Redshift table’s key distribution strategy. When the table is within Amazon Redshift with representative workloads, you can optimize the distribution choice if needed.
- Set the Amazon Redshift distribution style to auto for all Netezza tables with random distribution. This makes sure that Amazon Redshift automatically chooses the most performant distribution style depending on the number of rows in the table.
Migrating the data
You use the AWS SCT to migrate the data from the source Netezza data warehouse to the Amazon Redshift cluster. The AWS SCT migrates data with a three-phase approach:
- Extract – Extracts data from Netezza and stores it into the file system of on-premises AWS SCT data extraction agents
- Upload – Uploads data from the agents to Amazon Simple Storage Service (Amazon S3)
- Copy – Loads the data from Amazon S3 into Amazon Redshift via the COPY command
For any migration, especially ones with large volumes of data or many objects to migrate, it’s important to plan and migrate the tables in smaller tasks. This is where tracking the runs and progress via the migration runbook from the assessment phase is important.
Segment the source tables based on their size. The following choices were successful for a 60 TB Netezza migration:
- One AWS SCT task for all tables less than 5 GB
- One AWS SCT task for all tables 5–15 GB
- Multiple AWS SCT tasks for tables under 50 GB; a few tables per task
- One AWS SCT task for each table bigger than 50 GB
You should refine configuration according to the available migration windows. The approach ensures the following:
- A task is an atomic process; if it succeeds, all the managed tables are migrated successfully
- If it fails, it might be more convenient to run the entire task from scratch rather than double-check the status and consistency of each table
- Task size should trade off between the mentioned opposite poles
To manage the substitution of special characters during these phases, set the following parameters:
- For NULL values as a string, enter
~~~~
. By default, this is not checked. Numeric and date type nulls are by default extracted as ‘N’ and loaded to Amazon Redshift as nulls.- If it is unchecked or if it is checked and value is left black, AWS SCT extracts char/varchar type null as ‘N’ and the COPY command has the NULL AS ‘N’ parameter set. This causes issues during COPY operations when we have data with value ‘N’ in any column.
- If checked and value is
~~~~
, AWS SCT extracts char and varchar type null as~~~~
and the COPY command has the NULL AS~~~~
parameter set. Using junk characters (such as~~~~
) extracts char and varchar null values as~~~~
, and the COPY command replaces and loads~~~~
as NULL. This way, we can extract and load char and varchar null values. This doesn’t cause issues during COPY when we have data with the value ‘N’ in any column. - If checked and value is ”, AWS SCT extracts the char and varchar type null as ” and the COPY command has the NULL AS ” parameter set. NULL AS ” is equivalent to EMPTYASNULL.
- Deselect Use blank as null value. If BLANKASNULL is set (which is default setting), it replaces white space characters (‘ ‘) with NULL for char and varchar datatypes, and if the column is NOT NULL, inserting NULL fails. Deselecting BLANKASNULL loads the data as it is in the source.
- Deselect Use empty as null value. If EMPTYASNULL is set (which is default setting), it replaces empty data (two delimiters in succession with no characters between the delimiters) with NULL for char and varchar datatypes. This is not needed.
The following screenshot shows our configuration for the AWS SCT tasks.
To keep track of the tasks and record them accurately in the migration runbook, on the AWS S3 settings tab, set the folder name to be the same as the task name. Using a consistent naming convention allows easier tracking of progress in the runbook, and is useful during troubleshooting for any issues encountered.
For each subject area in scope, the extraction can either occur while sharing the connections and threads with other process during the day, or it’s recommended for the initial data load to schedule the tasks during the evening, weekend, or agreed schedule with as many Netezza resources as possible.
Breaking the migration down into smaller tasks allows you to log the progress in the migration runbook and run individual tasks to completion during the allocated migration window.
It’s recommended to migrate a small sample table first to test the parameter settings. The following sample table contains specific examples of edge cases that can provide quick feedback as to the suitability of the parameter settings:
When migrating large Netezza tables, data is migrated on a table-by-table basis using multiple data extraction agents. You should split large tables (for example, tables with more than 20 million rows or greater than 50 GB) into partitions using the AWS SCT virtual partitions functionality. Using virtual partitioning is a recommended best practice for data warehouse migrations using the AWS SCT extractors.
Virtual partitions decrease the migration timeline of a table by parallelizing the extraction of a configurable amount of subsections. You can migrate partitions in parallel, and extract failure is limited to a single partition instead of the entire table.
The AWS SCT creates a subtask for each table partition. Then, when the migration is running, AWS SCT assigns the subtask to an available data extractor to run. The AWS SCT orchestrates which subtask runs on which extractor, thereby keeping all extractors as busy as possible throughout the migration.
To use virtual partitioning, you should identify an attribute that you can use to evenly split the table. It’s important that the virtual partitions are well balanced in order to exploit the benefit of the parallelism. The AWS SCT usually virtually defines such partitions at extraction time—virtual partitions aren’t related to how data is stored into the source data warehouse.
AWS SCT provides three types of virtual partitioning: list, range, and auto split. For more information, see Use virtual partitioning in the AWS Schema Conversion Tool.
When using list partitioning, for very big tables (over 100 GB), the Netezza data slice IDs are an option for the partition key.
Migrating to other pre-production environments
After the data migration has successfully been proven in the development environment, you may choose to migrate to other pre-production environments. Apply the same steps and validation checks, including:
- Validate that the schema deployment matches the development environment.
- Validate the data migration has completed successfully, and that no data load errors are logged into the STL_LOAD_ERRORS table. The typical reasons for errors at this stage include schema mismatch, different input file formats, or insufficient varchar length for the input data.
- Validate the ETL deployment is loading the data as expected.
Migrating to the production environment
Migration to the production environment follows the same processes as the non-production environments, with the addition of the following steps:
- Undertake the task of business validation with your stakeholders to measure the accuracy of the migration in meeting the program goals:
- Undertake a period of dual-running the ETL deployment with production data being dual-loaded into the Netezza data warehouse and the production Amazon Redshift cluster.
- Compare the results sets from the Netezza data warehouse and the production Amazon Redshift cluster (the data validation scripts in the following section support this task).
- Update the migration runbook for each source table to record the number of records migrated, which validation checks have been run, and any discrepancies found during the checks.
- Run reports and dashboards against the Netezza data warehouse and the production Amazon Redshift cluster and ensure the results match.
- Obtain sign-off upon successful completion of these business validation tests.
- After you successfully complete the dual-running of both ETL and reporting deployments, the source of truth is transferred from the Netezza data warehouse to the production Amazon Redshift cluster by decommissioning the Netezza ETL deployment and the Netezza data warehouse, and re-pointing all reporting and dashboard connections to the Amazon Redshift cluster.
- When the Amazon Redshift cluster is live, monitor the cluster and ensure data model best practices are being followed.
Validating the data
After you migrate the data model schema and data contents to Amazon Redshift, you should perform data-validation tests to measure the migration’s success. The scripts included in this section cover checks commonly undertaken during migration engagements. All these scripts must be run by a superuser account.
Amazon Redshift utilities
The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. These scripts aren’t deployed by default into Amazon Redshift clusters. The recommendation is to deploy the views into an admin schema.
Comparing source vs. target table and view counts
For Netezza, enter the following code:
For Amazon Redshift, enter the following code:
Comparing source vs. target table constraints
For Netezza, enter the following code:
For Amazon Redshift, enter the following code:
Generating missing constraints from Netezza
Run the following SQL statements in Netezza to generate the DDL statements to add any missing constraints in Amazon Redshift:
Run the generated script against the Amazon Redshift database.
Identifying tables with insufficient varchar column length
For Netezza, enter the following code:
For Amazon Redshift, enter the following code:
Comparing source vs. target row count
Remove the final UNION ALL from the following two scripts output before running.
For Netezza, enter the following:
For Amazon Redshift, enter the following code:
Comparing source vs. target columns
For Netezza, enter the following code:
For Amazon Redshift, enter the following code:
Comparing source vs. target distribution key
For Netezza, enter the following code:
For Amazon Redshift, enter the following code:
Verifying if any invalid UTF-8 characters were replaced
For Amazon Redshift, enter the following code:
Identifying COPY errors
For Amazon Redshift, enter the following code:
Additional data validation checks
In addition to checking the row count for each table, you should perform tests on data quality to guarantee production data readiness:
- During this activity, run tailored queries and validate them against Amazon Redshift tables and views. The recommendation is to run such checks against records that include NULL values as well as strings including trailing whitespaces.
- Compute and compare statistics (min, max, average, sum, checksums) on numeric attributes against Netezza equivalents.
Conclusion
In this post, we detailed a project migration plan to migrate from Netezza to Amazon Redshift. We included examples of sizing the AWS SCT data extraction agents depending on the volume of data to migrate and the resources made available for the transfer. Validation of successful schema and data migration is vital, and we included several scripts to validate that the data model and data content meet expectations.
Special thanks go to AWS colleagues Arturo Bayo, Boopathi P, and Sunil Vora for their project delivery and support with this post.
About the Authors
Mattia Berlusconi is a Data & Analytics consultant with AWS Professional Services supporting enterprises in adopting innovative solutions for organizing and exploiting data to achieve their business objectives. He is specialized in building data platforms and managing database migrations.
Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics practice within AWS Professional Services accelerating customers’ adoption of AWS analytics services.