Skip to content

Data Streaming in Cloud Computing

dataStreaming cloudComputing cloudNative microservice #microservice

is the continuous transmission of data from a source to a destination. With streaming, data sources send data frequently, sometimes multiple times per second, and in small quantities. Contrast that with the more traditional batch processing, where operations run infrequently and transmit larger amounts of data each time.
Source: https://www.udacity.com/blog/2021/08/what-is-data-streaming.html

People who work with this: Data scientist

Aka: Streaming data processing

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling.
...
Stream processing requires ingesting a sequence of data, and incrementally updating metrics, reports, and summary statistics in response to each arriving data record. It is better suited for real-time monitoring and response functions.

https://aws.amazon.com/streaming-data/

In contrast there is: Batch processing

Batch processing can be used to compute arbitrary queries over different sets of data. It usually computes results that are derived from all the data it encompasses, and enables deep analysis of big data sets. MapReduce-based systems, like Amazon EMR, are examples of platforms that support batch jobs

Many organizations are building a hybrid model by combining the two approaches, and maintain a real-time layer and a batch layer. Data is first processed by a streaming data platform such as Amazon Kinesis to extract real-time insights, and then persisted into a store like S3, where it can be transformed and loaded for a variety of batch processing use cases.

treaming data processing requires two layers: a storage layer and a processing layer.

The storage layer needs to support record ordering and strong consistency to enable fast, inexpensive, and replayable reads and writes of large streams of data.

The processing layer is responsible for consuming data from the storage layer, running computations on that data, and then notifying the storage layer to delete data that is no longer needed.

... You also have to plan for scalability, data durability, and fault tolerance in both the storage and processing layers.

Infrastructure tools to build streaming data applications

Common in event driven architecture

Storage and processing layer: Kafka

  • Does: collect, process, and store continuous streams of event data or data that has no precise beginning or end.

Distributed data store optimized to store and handle real time data. Uses pub/sub to stream records and fault tolerance storage.

Kafka provides three main functions to its users:

  • Publish and subscribe to streams of records
  • Effectively store streams of records in the order in which records were generated
  • Process streams of records in real time

https://aws.amazon.com/msk/what-is-kafka/

RabbitMQ

Open source message broker that uses messaging queue approach. Queues are spread across a cluster of nodes and optionally replicated, with each message only being delivered to a single consumer.

https://www.rabbitmq.com/
https://aws.amazon.com/msk/what-is-kafka/

Storage and processing layer: AWS Kinesis

  • Does: collects, processes and analyzes streaming data

process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information

Amazon Kinesis is a real-time, fully managed, and highly scalable cloud service for streaming large volumes of data on AWS
https://www.whizlabs.com/blog/what-is-aws-kinesis/