Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose

Apr 25 2020

Off

Today we are adding a new Amazon Kinesis Data Firehose feature to set up VPC delivery to your Amazon Elasticsearch Service domain from the Kinesis Data Firehose. If you have been managing a custom application on Amazon Kinesis Data Streams to keep traffic private, you can now use Kinesis Data Firehose and load your data into an Amazon Elasticsearch Service endpoint in a VPC without having to invest, operate, and scale ingestion and delivery infrastructure. You can start using this new feature from Kinesis Data Firehose console, AWS CLI, and API by selecting Amazon Elasticsearch Service as the destination, the specific domain with VPC access, and setting the VPC configuration with subnets and the optional security groups.

Before this feature

Amazon Elasticsearch Service domains can have public or private endpoints. Public endpoints are backed by IP addresses on the public internet. Private endpoints are backed by IP addresses within the IP space of your VPC.

If you have been using an Amazon Elasticsearch Service VPC endpoint, you most likely use Kinesis Data Streams or similar soultion to ingest streaming data. This means running a custom application on the stream that delivers it to the Amazon Elasticsearch Service VPC domain. You likely had to perform the following actions:

Implement buffering
Format conversions
Perform compression
Apply transformation
Manage backup
Handle transient delivery failures

Additionally, you have to build, scale, monitor, update, and maintain this custom application.

Kinesis Data Firehose delivery to Amazon Elasticsearch Service VPC endpoint

Kinesis Data Firehose can now deliver data into an Amazon Elasticsearch Service VPC endpoint. This provides a secure and easy way to ingest, transform, and deliver streaming data. You don’t need to worry about managing your data ingestion and delivery infrastructure. With this new feature, Kinesis Data Firehose enables additional secure communication to Amazon Elasticsearch Service VPC endpoints. Amazon Elasticsearch Service endpoints that live within a VPC give you an extra layer of security.

How it works

When you create a Kinesis Data Firehose delivery stream that delivers data to an Amazon Elasticsearch Service VPC endpoint, Kinesis Data Firehose creates an Elastic Network Interface (ENI) in each subnet you select. If you only use one Availability Zone, Kinesis Data Firehose places an endpoint into only one subnet. Similarly, when you create an Amazon Elasticsearch Service VPC endpoint, it creates endpoints in the subnets you chose. Kinesis Data Firehose uses ENI to deliver the data to your Amazon Elasticsearch Service ENI, all inside your VPC. The following screenshot outlines the resulting architecture with a single subnet.

For this walkthrough, you have two security groups:

kdf-sec-grp for your Kinesis Data Firehose endpoint
es-sec-grp for your Amazon Elasticsearch Service endpoint

To let Kinesis Data Firehose access your Amazon Elasticsearch Service VPC endpoint, security group es-sec-grp needs to allow the ENI that Kinesis Data Firehose created to make HTTPS calls. Kinesis Data Firehose scales the ENIs automatically to meet the throughput requirements. As Kinesis Data Firehose scales ENIs, the outbound rules of the enclosing security group kdf-sec-grp control the data stream. You should configure the Amazon Elasticsearch Service security group (es-sec-grp) to allow HTTPS traffic from the Kinesis Data Firehose security group (kdf-sec-grp). The Kinesis Data Firehose security group needs to allow outbound HTTPS traffic, and its destination is the Amazon Elasticsearch Service security group. With Kinesis Data Firehose VPC delivery, you do not need to make the Firehose security group open to outside traffic.

You can also use the same security group for Kinesis Data Firehose and Amazon Elasticsearch Service endpoints. If you use the same security group for both, make sure the security group inbound rule allows HTTPS traffic.

For your existing delivery streams, you can change the destination endpoint. The new destination must be accessible within the same VPC, subnets, and security groups. Changing either of the VPC, subnets, and security groups requires you to recreate a delivery stream.

All existing Kinesis Data Firehose limits apply to this capability. For example, you can increase the default 50 delivery streams per account by submitting a quota increase request. Also, Kinesis Data Firehose creates one or more ENIs per VPC destination subnet per delivery stream. Kinesis Data Firehose automatically scales the number of ENIs as needed based on the actual throughput. The default throughput limit per delivery stream is 5 MB/second (dependent on Region). You can request an increase to this limit by submitting a support case.

You need to make sure you have enough ENIs available. By default, VPC has a quota of 5000 ENIs per Region. For more information, see Amazon VPC Quotas.

The advantage of using a managed service like Kinesis Data Firehose is that you can focus on the value of your data and not the underlying plumbing. You can configure the frequency of data delivery from your delivery stream to your Amazon Elasticsearch Service domain. Kinesis Data Firehose buffers incoming data before delivering it to Amazon ES. You can configure the values for Amazon Elasticsearch Service buffer size (1 MB–100 MB) or buffer interval (60–900 seconds), and the condition satisfied first triggers data delivery to Amazon Elasticsearch Service. In case data delivery fails for an Amazon Elasticsearch Service destination, you can specify a retry duration between 0 and 7,200 seconds when you create the delivery stream. If data delivery to your Amazon Elasticsearch Service endpoint fails, Kinesis Data Firehose retries data delivery for the specified time duration. After the retrial period, Kinesis Data Firehose skips the current batch of data and moves on to the next batch. Skipped documents go to your Amazon S3 bucket elasticsearch_failed folder, which you can use for manual backfill.

For more information about sizing, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

Solution overview

To show you how to use this new feature, this post uses stock demo data available on the Kinesis Data Firehose console to deliver to an Amazon Elasticsearch Service endpoint in VPC. The following diagram illustrates the workflow.

This use case simulates a producer sending stock ticker data to the delivery stream (A). You use an AWS Lambda function (B) to add a timestamp to the stock records so that you can create Kibana visualization. Kinesis Data Firehose streams the stock records to the Amazon Elasticsearch Service endpoint (C) in your VPC. Finally, you can visualize the data using Kibana (D).

This post uses the Amazon Management Console to implement this solution, but you can also use AWS CLI.

Creating security groups

Start by creating two security groups: one for the Amazon Elasticsearch Service VPC endpoint (es-sec-grp) and another for the delivery stream (kdf-sec-grp). Create security groups without any rules first. After you have created them, set the inbound and outbound rules. The following table summarizes these rules.

Creating an Amazon Elasticsearch Service VPC endpoint

To create an Amazon Elasticsearch Service endpoint in VPC, complete the following steps:

On the Amazon Elasticsearch Service console, choose Create a new Domain.
For Deployment Type and Latest Version, choose Development and Testing.
Choose Next.
Give your Amazon Elasticsearch Service endpoint a name.
Select your instance type.

This post uses m5.xlarge.elasticsearch. For production environments, select the appropriately sized instance type. For this post, leave the number of nodes at 1, though best practice is to set it to 2.

Set EBS storage size per node to 100 GiB.
Leave the rest of the settings at their defaults and choose Next.
Select the VPC and private subnet for your Amazon Elasticsearch Service endpoint and the security group for Amazon Elasticsearch Service that you created previously (es-sec-grp).
To access Kibana, choose fine-grained access.
Choose Create Master User.

In this post, we are using internal user database enabled with HTTP basic authentication. For production environments, use IAM roles and configure the appropriate fine-grained access. For more information, see Fine-Grained Access Control in Amazon Elasticsearch Service.

Choose Allow open access to Domain.

Security groups already enforce IP-based access policies. This step opens access to your Amazon Elasticsearch Service endpoint to resources in your VPC, and your Amazon Elasticsearch Service endpoint is not accessible to the internet. For an additional layer of security in your Amazon Elasticsearch Service endpoint, use access policies that specify IAM users or roles. For more information about controlling access to your domains, see Identity and Access Management in Amazon Elasticsearch Service.

Choose Next.
Review your settings and choose Confirm.

The following screenshot shows an example of what your Amazon Elasticsearch Service endpoint VPC settings should look like.

Creating a Lambda function for record transformation

Create a Lambda function to add a timestamp to the data feed. Complete the following steps:

On the Lambda console, choose Create Function.
Choose Author from scratch.
Name your function; for example, tmakAddTSToStream.
Choose Python 3.7 as your runtime.
Choose Create.

The following code is for your Lambda function (under the basic settings section, change the timeout from 3 sec to 45 sec):

import base64
import json
from datetime import datetime

def lambda_handler(event, context):
    
    send_back = []
    now = datetime.utcnow().isoformat()

    for record in event['records']:
        stock_rec = json.loads(base64.b64decode(record['data']))
        stock_rec["timestamp"] = now
       
        record_w_ts = {
                'recordId': record['recordId'],
                'result': 'Ok',
                'data': base64.b64encode(json.dumps(stock_rec).encode('utf-8') + b'n').decode('utf-8')
        }
        send_back.append(record_w_ts)

    return {'records': send_back}

Creating a Kinesis Data Firehose delivery stream

To create your delivery stream, complete the following steps:

On the Kinesis Data Firehose console, under Data Firehose, choose Create Delivery Stream.
Enter a name for your stream; for example, tmak-kdf-stock-delivery-stream.
For source, choose Direct PUT or other sources.
Choose Next.
For Data transformation, choose Enabled.
Choose the Lambda function you created.
Choose Next.
Choose Amazon Elasticsearch Service as the destination for your delivery stream.
For Index, enter stockdata.

The VPC section populates automatically. Make sure you use the security group you created for Kinesis Data Firehose (kdf-sec-grp).

For Backup Mode, choose Failed records only.

You can select an existing S3 bucket or create a new one. The following screenshot shows an example of your delivery stream settings.

Choose Next.
Review the buffering settings and set any tags to identify your stream.

A delivery stream that delivers to VPC destinations needs permissions to manage ENIs, list VPCs, and subnets. The console gives you the option to create a new role based on a template that includes all the needed permissions. You can also use an existing role if you already created one.

Choose Next.
Review the settings and choose Create Stream.

It may take up to a few minutes to see the stream status show as Active. See the following screenshot.

On the Amazon EC2 console, under Network and Security, you can see the endpoints created in your VPC by Kinesis Data Firehose and Amazon ES. See the following screenshot.

Configuring Kibana fine-grained access for Kinesis Data Firehose

You need to give Kinesis Data Firehose permissions to deliver stock data to your Amazon Elasticsearch Service endpoint. You can accomplish this via the Kibana console or API. For more information, see API on the Open Distro for Elasticsearch website.

For more information about controlling access to your Amazon Elasticsearch Service endpoint, see How to Control Access to Your Amazon Elasticsearch Service Domain.

Because your Amazon Elasticsearch Service endpoint is in the VPC to access Kibana, you must first connect to the VPC. This process varies by network configuration, but likely involves connecting to a VPN or corporate network. For this post, create a remote desktop EC2 instance public subnet of your VPC. The newly created security group (rdp-sec-grp) protects the instance. You can modify the es-sec-grp security group and allow inbound RDP traffic from rdp-sec-grp so you can access the Kibana URL. The following diagram illustrates this architecture.

Kinesis Data Firehose uses the delivery role to sign HTTP (Signature Version 4) requests before sending the data to the Amazon Elasticsearch Service endpoint. You manage Amazon Elasticsearch Service fine-grained access control permissions using roles, users, and mappings. This section describes how to create roles and set permissions for Kinesis Data Firehose.

The roles you create in this section are different from IAM roles. For more information, see Key Concepts.

Complete the following steps:

Navigate to Kibana (you can find the URL on the Amazon Elasticsearch Service console).
Enter the master user and password that you set up when you created the Amazon Elasticsearch Service endpoint.
Under Security, choose Roles.
Choose Add New Role.
Name your role; for example, firehose-role.
For cluster permissions, add cluster_composite_ops and cluster_monitor.
Under Index permissions, choose Index Patterns and enter stockdata*.
Under Permissions, add three action groups: crud, create_index, and manage.
Choose Save Role Definition.

In the next step, you map the IAM role that Kinesis Data Firehose uses to the role you just created.

Under Security, choose Role Mappings.
Choose the role you just created (firehose-role).
For Backend Roles, choose Add Backend Role.
Enter the IAM ARN of the role Kinesis Data Firehose uses: arn:aws:iam::123456789012:role/firehose_stream_role_name.

You can find your delivery stream ARN on the Kinesis Data Firehose console.

Streaming stock data through Kinesis Data Firehose

To stream your stock data, complete the following steps:

On the Kinesis Data Firehose console, choose the stream you created.
Choose Test with demo data.
Choose Start sending demo data.

If everything is working, you see message Demo data is being sent to your delivery stream. Wait a few minutes before you choose Stop sending demo data.

Analyzing and visualizing data

To analyze and visualize your data, complete the following steps:

On the Kibana console, choose Management.
Choose Index patterns.
For Index pattern, enter stockdata*.
Choose Next.
For the Time filter field, choose timestamp.
Choose Visualize.
Create a new visualization and choose Line.
For Index pattern, choose stockdata*.
For Y-Axis, choose Aggregation=Average and Field=price.
For X-Axis, choose Aggregation=Data Histogram, Field=timestamp, and Interval=seconds.
Under X-Axis, choose Add Sub-buckets.
Choose Split Series.
Set Sub-Aggregation=Terms and Field=ticker_symbol.keyword.
Choose Apply Changes.

The following screenshot shows an example visualization.

You can see the raw data by choosing Discover on the Kibana dashboard. See the following screenshot.

Summary

This post demonstrated how you can move an Amazon Elasticsearch Service endpoint inside your VPC with Kinesis Data Firehose. Additionally, you do not need to enable and secure public access to your Amazon Elasticsearch Service endpoint. If you have been reluctant to expose your Amazon Elasticsearch Service endpoint to the internet but want to stream data, you can now do so with Kinesis Data Firehose.

About the Authors

Tarik Makota is a Principal Solutions Architect with the Amazon Web Services. He provides technical guidance, design advice and thought leadership to AWS’ customers across US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

from AWS Big Data Blog: https://ift.tt/357I1z0

The post Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose appeared first on Amazon Web Services Feed.

Posted inAWS News