Real-time processing provides a notable advantage over batch processing — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, many businesses rely on data being available within minutes, seconds, or even milliseconds. With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Live dashboards are updated automatically as new data comes in.
Despite all the benefits, real-time streaming adds a lot of additional complexity to the overall data processes, tooling, and even data format. Therefore, it’s crucial to carefully weigh out the pros and cons of switching to real-time data pipelines. In this article, we’ll look at several options to reap the benefits of a real-time paradigm with the least amount of architectural changes and maintenance effort.
Traditional approach
When you hear about real-time data pipelines, you may immediately start thinking about Apache Kafka, Flink, Spark Streaming, and similar frameworks which require a lot of knowledge to operate a distributed event streaming platform. Those open-source platforms are best suited to scenarios:
- when you need to continuously ingest and process reasonably* large amounts *of real-time data,
- when you anticipate* multiple producers and consumers* and you want to decouple their communication,
- or when you want to own the underlying infrastructure, possibly on-prem (e.g. compliance).
While many companies and services attempt to facilitate the management of underlying distributed clusters, the architecture still remains fairly complex. Therefore, you need to consider:
- whether you have the resources to operate those clusters,
- how much data do you plan to process by using this platform,
- whether the added complexity is worth the effort.
In the next sections, we’ll look at alternative options if your real-time needs don’t justify the added complexity and costs of a self-managed distributed streaming platform.
Amazon Kinesis
AWS realized the customer’s difficulties in managing message-bus architectures a long time ago (2013). As a result, they came up with Kinesis — a family of services that attempt to make real-time analytics easier. By leveraging serverless Kinesis Data Streams, you can create a data stream with a few clicks in the AWS management console. Once you configured your estimated throughput and the number of shards, you can start implementing data producers and consumers. Even though Kinesis is serverless, you still need to monitor the message size and the number of shards to ensure that you don’t encounter any unexpected write throttles.
In my previous article, you can find an example of a Kinesis producer (source) sending data to a Kinesis data stream using a Python client, and how to continuously send micro-batches of data records to S3 (consumer/destination) by leveraging a Kinesis Data Firehose delivery stream.
Alternatively, to consume data from Kinesis Data Stream, we could:
- aggregate and analyze data with Kinesis Data Analytics,
- use Apache Flink to send this data into Amazon Timestream.
The main benefits of using Kinesis Data Streams as compared to sending data directly to your desired application are latency and decoupling. Kinesis allows you to store data within the stream for up to seven days and have multiple consumers that would receive data at the same time. This means that if a new application would need to collect the same data, you could add a new consumer to the process. This new consumer would not affect other data consumers or producers thanks to decoupling on the Kinesis architecture level.
Amazon Timestream
As mentioned in the previous section, the major advantage of Kinesis is decoupling. If you don’t need multiple applications that would regularly consume data from the stream, you could considerably streamline the process by using Amazon Timestream — a serverless time-series data store allowing you to analyze data in (near) real-time. The underlying architecture is smart enough to* ingest data first into an in-memory store* for fast retrieval of real-time data, and then it automatically moves “old” data to cheaper long-term storage according to the specified retention period. For more about Timestream, have a look at this article.
Why would you use a time-series database for real-time data?
Any new data record comes into the stream at a particular time. You may be tracking price changes over time, sensor measurements, logs, CPU utilization — practically any real-time streaming data is some sort of a time series. Therefore, it makes sense to consider using a time series database such as Timestream. The simplicity of the service makes it very appealing, especially if you would like to use SQL to retrieve data for analytics.
When comparing the SQL interface of Timestream against the one available in Kinesis Data Analytics, Timestream is a clear winner. Kinesis SQL is quite obscure and introduces a lot of specific vocabulary. In contrast, Timestream provides an intuitive SQL interface with many useful built-in time-series functions, making time-based aggregation (ex. minutely or hourly time buckets) much easier.
**Side note: don’t use semicolons at the end of your queries in Timestream. If you do, you’ll get an error.
Demo: Real-time ingestion into Timestream using Python
To demonstrate how Timestream works, we’ll be sending cryptocurrency price changes into a Timestream table.
Let’s start by creating a Timestream database and table. We can do all that either from the AWS management console or from AWS CLI:
The above code should create a database in your AWS region. Make sure that you use one of the regions in which Timestream is available.
**Side note: The easiest way to find available regions for any AWS service is to check the pricing page: https://aws.amazon.com/timestream/pricing/.
Now we can create a table. You need to specify your in-memory and magnetic store retention period.
Our database and table are created. Now we can get the latest price data from the Cryptocompare API. This AP provides many useful endpoints to get the latest information about a cryptocurrency market. We will focus on getting real-time price data for selected cryptocurrencies.
We’ll get data in the following format:
{‘BTC’: {‘USD’: 34406.27},
‘DASH’: {‘USD’: 178.1},
‘ETH’: {‘USD’: 2263.64},
‘REP’: {‘USD’: 26.6}}
Additionally, we need to convert this data to the proper Timestream format with a time column, measures, and dimensions. Here is the full script that we can use to ingest new data every 10 seconds: https://gist.github.com/d00b8173d7dbaba08ba785d1cdb880c8.
That’s it! The most time-consuming part is the definition of your dimensions and measures (lines 21–44). You should be careful about the design of your measures and dimensions: with Timestream you can only query data from a single table. No JOINS between tables are allowed. Therefore, it’s important to think ahead about your access patterns before you start ingesting data into Timestream.
Here is how the data looks like in the end. Note that the ingestion time is presented in UTC:
AWS Timestream: exploring the results in the query console — image by author
We could now easily connect Timestream to Grafana for near real-time visualization. But that’s a story for another article.
Never-ending script
In the Timestream example above, running in a single process, we used a never-ending loop defined using while True. This is a common approach for a simple service ingesting data all the time, typically running as a background process or a service in a container orchestration platform.
Minutely scheduled jobs
An alternative to a continuously running script is a service that is scheduled to run every minute. The benefit of this approach is that it allows you to treat this near real-time process as a batch job, which simplifies your architecture. You can think of it as a reversed Kappa architecture: while Kappa processes batch in the same way as real-time data (streaming-first approach), this approach “batchifies” real-time data streams (batch-first approach) into micro-batches.
Instead of while True, we now still ingest data roughly every 10 seconds but the actual process is executed once per minute, allowing us to track which runs were successful, and does not depend on the health of a single job run:
There is no “right” or “wrong” approach. The main purpose of this method is to treat near real-time ingestion as a batch job. Here is a full Gist: https://gist.github.com/d953cdbc6edbf8b224815cc5d8b53f73.
Which option should you choose?
The following questions may help you to make the right decision for your use case:
- Which problem(s) do you want to solve using real-time streaming: is it anomaly detection, alerting, product recommendation, dynamic pricing algorithm, tracking current market prices, understanding user behavior? Having a specific use case in mind can help you determine the right tool for the job, especially because there are a lot of specialized tools on the market.
- Which latency is acceptable in your use case? Is it OK if your data is available for analytics 1 minute after the event or stream has been received? Or on the contrary, you need a millisecond latency because otherwise, this data will no longer be actionable?
- How many resources (employees and budget) do you have to keep your platform operational? Does it make sense to spin up your own Kafka cluster, use some managed service, or maybe a serverless option such as Amazon Kinesis or Amazon Timestream can address your needs?
- How do you plan to monitor and observe the health of your data streams? If you leverage serverless technologies, Dashbird may be a good option to easily monitor your serverless AWS stack and notify you about failures and anomalies in real-time.
- How much training would be needed to teach your team how to use this specific platform?
- Which data sources would need to be ingested in real-time, i.e. data producers?
- What is the target datastore (data lake, data warehouse, specific database) from which you would want to retrieve this data, i.e. data consumers? And how do you want to retrieve this data — via SQL, Python, or perhaps only via analytical dashboards?
- In which way (architecture-wise) would you want to process this data? Is Kappa, Lambda, or other architecture worth considering to distinguish between real-time and batch?
Conclusion
Ultimately, it depends on the problem that you try to solve using real-time processing technologies, the scale of your problem, and available resources. In many scenarios, a simple minutely batch job may be sufficient. It allows having a single architecture for all data processing needs, and data available within few minutes or even seconds after its generation.
For other scenarios, Kinesis Data Streams or Amazon Timestream may provide simple yet effective means to add (near) real-time capabilities with very little maintenance effort. Lastly, if you do have employees who know how to operate Kafka, Flink, or Spark Streams, those can be helpful if you want to own your infrastructure and not being reliant on cloud providers. As always, thinking about the problem at hand will help assess the trade-offs and make the best decision for your use case.
Further reading:
How to save hundreds of hours on Lambda debugging?
AWS Kinesis vs SNS vs SQS (with Python examples)