AWS Feed
Migrate data into Amazon ES using remote reindex

Amazon Elasticsearch Service (Amazon ES) recently launched support for remote reindexing. This feature adds the ability to copy data to an Amazon ES domain from self-managed Elasticsearch running on-premises, self-managed on Amazon Elastic Compute Cloud (Amazon EC2) on AWS, or another Amazon ES domain.

Remote reindex supports Elasticsearch 1.5 and higher for the remote Elasticsearch cluster and Amazon ES 6.7 and higher for the local domain.

The remote reindex feature migrates data from the remote cluster using the Elasticsearch scan API function and reindexes each document to the local Amazon ES domain.

In this post, we cover the following common use cases for using remote reindex to migrate data into an Amazon ES domain:

Use case 1: Copying from a self-managed Elasticsearch cluster using ELB

Our first use case has the following configuration:

  • Remote – Elasticsearch self-hosted in AWS on Amazon EC2 1.5 or higher
  • Local – Amazon ES domain 6.7 or higher

The following diagram illustrates our architecture.

bdb1220 remote reindex es 1

Before getting started, make sure you have the following prerequisites:

  • An Amazon EC2 server running in a public subnet with access to Amazon ES running in a private subnet within the same VPC
  • ELB running in a public subnet of the same VPC as the remote Elasticsearch cluster with a security group configured to allow inbound traffic on port 443 and listeners configured on port 443 to an Elasticsearch DNS endpoint

To copy your data, complete the following steps:

  1. Open the Kibana dashboard on the Amazon EC2 server to connect to the local Amazon ES domain (for example, https://vpc-abc123.us-east-1.es.amazonaws.com/_plugin/kibana).
  2. The connection to the local Elasticsearch cluster needs to be authorized to perform reindex operations. If the local cluster is secured with basic authorization, it only needs a username and password. However, if it’s using fine-grained access control, the user performing reindex operations needs reindex privileges on the local Amazon ES domain and read index privileges on the remote Elasticsearch cluster.
  3. Run the POST reindex operation on the local Amazon ES domain using Kibana Dev Tools to reindex data from the remote Elasticsearch cluster. See the following code:
    POST _reindex/?pretty=true&scroll=10h&wait_for_completion=true
    { "source": { "remote": { "host": "https://<remote endpoint>:443", "username": "<username>", "password": "<password>", "socket_timeout": "30m" }, "size": 1000, "index": "movies" }, "dest": { "index": "movies" }
    }

You can perform the same reindex operation using curl commands:

curl -XPOST -u <username>:<password> "https://<local-domain-endpoint>/_reindex/?pretty=true&scroll=10h&wait_for_completion=false" -H 'Content-Type: application/json' -d’{ "source": {"remote": {"host": "https://< local-domain-endpoint >:443", "socket_timeout": "60m", “external”: true, "username": "<username>", "password": "<password>" },  "size": 1000, "index": “movies” }, "dest": {"index": “movies” }}'

Check the progress of index migration on the local Amazon ES domain using the following command:

GET <local-domain-endpoint>/movies/_search

In the preceding code, you copy the movies index from the remote Elasticsearch cluster to the local Amazon ES domain. The remote reindex operation sends a scroll request to the remote domain with the following default values:

  • Search context of 5 minutes
  • Socket timeout of 30 seconds
  • Batch size of 1,000

Refer to the Performance improvements section later in this post for information about tuning these values.

We use the external flag to indicate that the index is hosted outside of the Amazon ES.

Use case 2: Copying from self-managed Elasticsearch using an NGINX proxy server

Our second use case has the following configuration:

  • Remote – Self-hosted Elasticsearch on premises
  • Local – Amazon ES domain version 6.7 or higher

The following diagram illustrates our architecture.

bdb1220 remote reindex es 2

Make sure you have the following prerequisites:

  • Amazon ES version 6.7 or higher with software release version R2020117 with an Amazon VPC
  • An Amazon EC2 server running in a public subnet with network connectivity to the local Elasticsearch cluster
  • Public internet connectivity to an NGINX reverse proxy server that can connect to the remote Elasticsearch cluster

Connectivity to the remote cluster is secured using TLS encryption, therefore you need a certificate signed by a public certificate authority. For instructions on configuring security credentials for NGINX, see Update: Using Free Let’s Encrypt SSL/TLS Certificates with NGINX. If you’re generating certificates for external domains, see Manual for additional options.

To properly route the reindex requests, you need to modify the NGINX reverse proxy default configuration file default.conf under /etc/nginx/conf.d directory. Update the following key variables:

  • /etc/nginx/cert.crt
  • /etc/nginx/cert.key
  • $ES_endpoint

For more information, see Using a Proxy to Access Amazon ES from Kibana.

After you establish connectivity with the remote Elasticsearch cluster, you can run the reindex operation as outlined in the first use case. Be sure to change the host argument to the DNS name of the publicly accessible NGINX reverse proxy.

Use case 3: Copying from a public Amazon ES domain using IAM credentials

Our next use case has the following configuration:

  • Remote – Publicly accessible Amazon ES domain version 1.5 or higher
  • Local – Amazon ES domain version 6.7 or higher

The following diagram illustrates our architecture.

bdb1220 remote reindex es 3

To copy the data, complete the following steps:

  1. Create an IAM user that has been granted access to both the local and remote Amazon ES domain. The following code is an example access policy:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<arn of user>" }, "Action": "es:*", "Resource": "<arn of Amazon ES domain>/<index name>/*" } ]
    }
  1. On the IAM console, create an access key for the specified user.
  2. Record the access key ID and secret in a secure location.
  3. Make the call to the _reindex request using IAM credentials to sign the request using the Signature Version 4 (SigV4) signing process.

To simplify the signing process, you can use the Postman application and the AWS Signature authorization type.

  1. Launch Postman.
  2. Enter the endpoint URL of the local Amazon ES domain in the address bar, followed by
    / _reindex/?pretty=true&scroll=10h&wait_for_completion=true.
  3. On the HTTP method drop-down menu, choose POST.
  4. On the Authorization tab, for the authorization type, choose AWS Signature.
  5. For AccessKey, enter your IAM user’s access key ID.
  6. For SecretKey, enter your IAM user’s secret key.
  7. Specify the appropriate AWS Region that matches the Region of your Amazon ES domain
  8. For Service Name, enter es.
  9. On the Body tab, select raw.
  10. Enter the local and target JSON as shown in use case 1, but make the setting "external": false.
  11. Enter Send.

You can check the progress using Kibana on the local Amazon ES domain through Dev Tools by issuing a search on the remote index similar to use case 1.

Use case 4: Copying from an Amazon ES domain in the same VPC using IAM credentials

Our final use case has the following configuration:

  • Remote – Amazon ES domain with VPC access version 1.5 or higher
  • Local – Amazon ES domain with VPC access version 6.7 or higher

The following diagram illustrates our architecture.

bdb1220 remote reindex es 4

Every Amazon ES domain is made up of its own internal VPC infrastructure. When you create a new Amazon ES domain in an existing VPC, an Elastic Network Interface (ENI) is created for each data node in the Amazon ES VPC. Because the remote reindex operation is performed from the local Amazon ES domain, and therefore within its own private VPC, you don’t access the remote Amazon ES domain’s VPC. Instead, you need a publicly accessible reverse proxy.

To copy the data, complete the following steps:

  1. Create an IAM user that has been granted access to both the local and remote Amazon ES domain. The following code is an example access policy:
    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<arn of user>" }, "Action": "es:*", "Resource": "<arn of Amazon ES domain>/<index name>/*" } ]
    }
  2. On the IAM console, create an access key for the specified user.
  3. Record the access key ID and secret in a secure location.
  4. Set up an EC2 instance with a NGINX reverse proxy for the remote Amazon ES VPC endpoint as outlined in use case 1.

This EC2 instance must be within the same VPC as the Amazon ES domain. Because you’re signing your requests, make sure that the NGINX configuration contains the following:

proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
  1. Use a machine on the same VPC as the Amazon ES domain (either a running EC2 instance, or a local machine connected via VPN) to make the call to the _reindex request using IAM credentials to sign the request using the Signature Version 4 (SigV4) signing process.

To simplify the signing process, you can use the Postman application and the AWS Signature authorization type.

  1. Launch Postman.
  2. Enter the endpoint URL of the local Amazon ES domain in the address bar, followed by
    / _reindex/?pretty=true&scroll=10h&wait_for_completion=true.
  3. On the HTTP method drop-down menu, choose POST.
  4. On the Authorization tab, for the authorization type, choose AWS Signature.
  5. For AccessKey, enter your IAM user’s access key ID.
  6. For SecretKey, enter your IAM user’s secret key.
  7. Specify the appropriate AWS Region that matches the Region of your Amazon ES domain
  8. For Service Name, enter es.
  9. On the Body tab, select raw.
  10. Enter the local and target JSON as shown in use case 1, but make the setting "external": false.
  11. For the "source" argument, use the externally accessible URL for the NGINX reverse proxy.
  12. Enter Send.

You can check the progress using Kibana on the local Amazon ES domain through Dev Tools by issuing a search on the remote index similar to use case 1.

Performance improvements

The remote reindex operation allows you to modify local index settings before data copy, for example adjusting the number of primary shards. A best practice is to create an index with the required settings on your local domain before starting the reindexing operation.

To speed up the reindex performance, disable the refresh interval and replica shards using following settings:

PUT movies/_settings
{ "refresh_interval" : "-1", "number_of_replicas" : 0
}

When the reindex operation is complete, adjust the replicas count and refresh interval to your desired settings.

The reindex operation additionally provides options such as copying only a subset of documents, copying unique documents, or even combining one or more indexes. In the following example code, the remote reindex operation copies data from the kibana_sample_data_commerce index, which matches the currency field with the value EUR:

POST _reindex/?pretty=true&scroll=10h&wait_for_completion=true
{ "source": { "remote": { "host": "https://<remote endpoint>:443", "username": "<username>", "password": "<password>", "socket_timeout": "30m" }, "query": { "bool": { "filter": { "term": { "currency": "EUR" } } } }, "size": 10000, "index": "kibana_sample_data_ecommerce" }, "dest": { "index": "kibana_sample_data_ecommerce" }
}

For more information about the available reindexing options, see Reindex data.

The local cluster pulls the data from the remote cluster using scroll queries. Depending on the dataset, you need to set up the time duration for which scroll context is valid on the remote cluster. To make sure the remote reindex operation doesn’t timeout while dealing with large datasets, set the scroll value higher (10–36 hours).

size determines the batch size for every single scroll call, and its value is dependent on the nature of the data and the cluster configuration. Initially set it to a lower value (such as 100), and increase it only if it improves performance.

socket_timeout is the maximum period of inactivity supported on the HTTP connection between the local and remote cluster. Basically, the local cluster gets the data in batches using the scroll query and triggers a bulk call. If there are too many pending bulk requests, it waits before fetching the next batch of documents. If the wait is higher than the configured socket timeout, the reindex fails. We recommend setting a higher timeout value (1–2 hours) to prevent failures.

Limitations

Keep in mind the following limitations when using remote reindex:

  • As of this writing, the remote reindex operation doesn’t support scroll slicing, which allows multiple scroll operations for same request in parallel. The operation is only as fast as an index operation with a single client connection.
  • You can’t restart the task if a failure occurs. If the node performing the operation dies, you have to re-trigger the reindex operation.
  • The remote reindex operation simply copies a snapshot of the index at that particular time. In situations where the indexes are continuously being updated on the remote cluster, repeat the reindex operation to sync data between the two clusters.

Conclusion

In this post, we covered how to use the remote reindex operation in Amazon ES to copy index data from your remote cluster into an Amazon ES domain. We also looked at several performance tuning options available with the reindex operation.

If you have questions or suggestions, please leave a comment.


About the authors

ryan peterson 100Ryan Peterson is a Senior Solutions Architect at Amazon Web Services based in Irvine, CA. Ryan works closely with the Amazon CloudSearch and Amazon Elasticsearch Service teams, providing help and guidance to a broad range of customers that have search workloads they want to move to the AWS Cloud.

 

 

 

viralshaViral Shah is a Senior Solutions Architect with the AWS Data Lab team based out of New York, NY. He has over 20 years of experience working with enterprise customers and startups primarily in the data and database space. He loves to travel and spend quality time with his family.