Chapter 11: Analytics (again)
I feel like I'm stuck in Ground Hog Day, or seeing the black cat repeat its movements in The Matrix. I just can't avoid Analytics with AWS.
This makes perfectly good sense, Data Analytics is one of the driving use cases for cloud adoption by any account. You need a flexible, powerful, scalable, durable, elastic infrastructure that can cope with large and unpredictable amounts of streaming data, spikes in processing requirements, low latency, and lots of data storage and retrieval over longer periods of time.
So I was checking Chapter 11 and realised that even though I'd looked at Analytics previously I still had more questions. Some of them are: What role do the different AWS databases have for Data Analytics pipelines? Where does it fit in terms of NoSQL/SQL and high write throughput, low read latency etc? When should you/can you use DynamoDB? Cassandra? Can you build a Kappa architecture using AWS services? What are all the current AWS Data Analytics services? How do Data Analytics services related to IoT services? (Same in my mind). How do you architect a complex Data Analytics systems taking into account the price, limits and different purposes of each component and database? What if some look similar, which should you pick? What are the real differences etc.
First (which as usual I found last), the best starting point is the updated data analytics whitepaper. It makes for easy and compact reading, most of the answers to the above can be found in it.
What are the current AWS Data Analytics services? Some of these are not mentioned in "the book" (Athena, QuickSight, Glue?). From the docs:
Partially, it looks like Kappa, but has the (only?) limitation that data can only be stored for 24 hours (maybe 7 days at the max).
Another aspect of temporal models addresses the future. Within a linear temporal model only one future is assumed, whereby a branching temporal model allows the existence of at least one but also multiple futures (paths). Moreover, a circular temporal model defines the future to be recurring. In the majority of cases regarding temporal data analysis, a linear temporal model is used. This is plausible because of the temporal concepts and operators mostly used within the field. If a branching or circular temporal model is utilized, simple concepts like before, or after may be difficult to be applied
Thus, within this book a linear temporal model is assumed
Actually a chapter from this book.
My previous random thoughts on future operators.
I also wonder where "Rules engines" fit with (AWS) data analytics? There are rules for IoT, how about a fast rules engine for streaming data? E.g. could you use a Rete rules engine (I had the fastest in the world briefly in the last 1980's), for streaming data???
Also what are the actual requirements for streaming data processing?
The 8 requirements of Real-Time Stream Processing paper and this blog make a good case for some basic features. I agree with one of their observations, that Polling shouldn't be a feature. But SQL? No thanks. This reminds me of a potential issue I had with Kinesis during the AWS summit presentation in Sydney recently. The speaker implied that the only way get data out of Kinesis (streams?) was by polling, which when I talked with him afterwards he didn't seem to understand the potential issue with this. Odd. TODO Check
Looks to be correct.
Maybe Lambda is a work around?
Based on this I would say that Kinesis is NOT A REAL TIME STREAM PROCESSING SYSTEM (pity). I would like to be wrong...?
This blog looks at Kinesis in production (c.f. Kaftka). They conclude that latency and throughput are ok for a distributed real-time stream processing system (but not perfect, particularly 5 reads per second limit).
This makes perfectly good sense, Data Analytics is one of the driving use cases for cloud adoption by any account. You need a flexible, powerful, scalable, durable, elastic infrastructure that can cope with large and unpredictable amounts of streaming data, spikes in processing requirements, low latency, and lots of data storage and retrieval over longer periods of time.
So I was checking Chapter 11 and realised that even though I'd looked at Analytics previously I still had more questions. Some of them are: What role do the different AWS databases have for Data Analytics pipelines? Where does it fit in terms of NoSQL/SQL and high write throughput, low read latency etc? When should you/can you use DynamoDB? Cassandra? Can you build a Kappa architecture using AWS services? What are all the current AWS Data Analytics services? How do Data Analytics services related to IoT services? (Same in my mind). How do you architect a complex Data Analytics systems taking into account the price, limits and different purposes of each component and database? What if some look similar, which should you pick? What are the real differences etc.
First (which as usual I found last), the best starting point is the updated data analytics whitepaper. It makes for easy and compact reading, most of the answers to the above can be found in it.
What are the current AWS Data Analytics services? Some of these are not mentioned in "the book" (Athena, QuickSight, Glue?). From the docs:
Service | Product Type | Description |
---|---|---|
Amazon Athena | Serverless Query Service | Easily analyze data in Amazon S3, using standard SQL. Pay only for the queries you run. |
Amazon EMR | Hadoop | Provides a managed Hadoop framework to process vast amounts of data quickly and cost-effectively. Run open source frameworks such as Apache Spark, HBase, Presto, and Flink. |
Amazon Elasticsearch Service | Elasticsearch | Makes it easy to deploy, operate, and scale Elasticsearch on AWS. |
Amazon Kinesis | Streaming Data | Easiest way to work with streaming data on AWS. |
Amazon QuickSight | Business Analytics | Very fast, easy-to-use, cloud-powered business analytics for 1/10th the cost of traditional BI solutions. |
Amazon Redshift | Data Warehouse | Fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data using your existing business intelligence tools. |
AWS Glue | ETL | Prepare and load data to data stores. |
AWS Data Pipeline | Data Workflow Orchestration | Helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. |
And the current IoT services (summary)
Connected devices, such as sensors, actuators, embedded devices, smart appliances, and wearable devices, connect to AWS IoT over HTTPS, WebSockets, or secure MQTT. Included in AWS IoT is a Device Gateway that allows secure, low-latency, low-overhead, bi-directional communication between connected devices and your cloud and mobile applications.
The AWS IoT service also contains a Rules Engine which enables continuous processing of data sent by connected devices. You can configure rules to filter and transform the data. You also configure rules to route the data to other AWS services such as DynamoDB, Kinesis, Lambda, SNS, SQS, CloudWatch, Elasticsearch Service with built-in Kibana integration, as well as to non-AWS services, via Lambda for further processing, storage, or analytics.
There is also a Device Registry where you can register and keep track of devices connected to AWS IoT, or devices that may connect in the future. Device Shadows in the AWS IoT service enable cloud and mobile applications to query data sent from devices and send commands to devices, using a simple REST API, while letting AWS IoT handle the underlying communication with the devices.
To me this looks similar, the Rules Engine looks like an event/stream processing system, and integration with the expected AWS services such as DynamoDB, Kinesis, Lambda etc is supported.
Kinesis: Where can/should streaming data be persisted?
I'll start where left off ages ago (way back in March) looking at Kinesis.
Kinesis has 3 services, Firehose (where's the fire?), Streams, and Analytics. Part of my problem started when I read that FireHose receives streaming data and stores it, in S3, Redshift or Elasticsearch.
Only? This is where I started to wonder if I'd understood anything up till now. I have some experience with event/streaming systems (well, a few actually), and my recent experience involved having to use Cassandra to write lots of event data in a hurry. I.e. with event streams you need low latency high throughput write storage.
Only? This is where I started to wonder if I'd understood anything up till now. I have some experience with event/streaming systems (well, a few actually), and my recent experience involved having to use Cassandra to write lots of event data in a hurry. I.e. with event streams you need low latency high throughput write storage.
How many of the above databases satisfy this requirement? I would have guess "none of the above". My first pick would probably be DynamoDB based on what I know of AWS so far. However, this doesn't seem to be an option odd. On the other hand I think I see what they are doing, these are all good options for writing data for subsequent demanding reads/analysis (OLAP). But not for real time stream processing and persisting. A bit of hunting later here's what I found.
This has a good summary of the 3 types of data analysis you need to sometime or other.
With a list of AWSs underneath for each. One obvious problem is that they are all different!
From p31 possible storage options are explored with pros and cons, S3, Redshift, RDS and DynamoDB.
The summary is:
- Storage Options Amazon S3 Amazon Redshift, Amazon RDS, Amazon DynamoDB
- Storage Options: Amazon S3 Amazon S3
- • Actions can directly write into (JSON) files on S3
- • Very simple to configure, just provide bucket name
- • Results in 1 file per event • Lots of small files can be hard to handle
- • Inefficient when processing with Hadoop / Amazon EMR or when importing into Redshift
- • Useful when you have a very low frequency of events, e.g. when you only want to log outliers to S3
- • Buffer data using Amazon Kinesis or Amazon Kinesis Firehose to get fewer, larger files
- • Buffering, compression & output to S3 is built into Firehose – no other infrastructure needed!
- • Kinesis Connector Library can be extended to perform transformation, filter or serialize data
- • Additional Control over Buffering & Output Formats
- • Added complexity: Requires Amazon EC2 workers running Kinesis Connector Library Amazon Kinesis Firehose
- Storage Options: Amazon Redshift
- • Actions can forward data Amazon Kinesis Firehose
- • Buffering & output to Redshift is built into Firehose
- • Very easy to setup
- • Fully managed
- • Use Amazon Kinesis as an alternative
- • More control: Use Kinesis Connector Library to perform transformation, filter or serialize data
- • Added complexity: Requires Kinesis Connector Library etc. to execute on Amazon EC2 Amazon Kinesis Firehose Amazon Redshift
- Storage Options: Amazon DynamoDB
- • Actions can directly write into Amazon DynamoDB
- • Creates one row per event, can define: • Hash Key, Range Key and attributes to store • E.g. Hash Key = deviceID, range key=timestamp…
- • Very simple to configure, just provide table & field names
- • Adding GSIs and LSIs provides additional flexibility and enables different queries
- • SELECTs can read from DynamoDB for fast lookups
- • AWS Lambda function provides additional flexibility:
- • Transform data
- • Write into different/multiple tables
- • Enrich data with contextual information pulled in from other sources
- • Only able to process one event at a time! (i.e., AWS Lambda –when called from AWS IoT– cannot aggregate events before writing to DynamoDB) Amazon DynamoDB AWS Lambda
And the winner(s) are:
Recommendations
Want to run a lot of queries constantly? Use Kinesis Firehose to write into Amazon Redshift
Need fast lookups, e.g., in Rules or Lambda functions? write into DynamoDB, add indices if necessary
Have a need for heavy queries but not always-on? Use Kinesis Firehose & S3, process with Amazon EMR.
it would be nice to know what the (default) performance and limitations are for each of the data base options, and how much it costs to scale/elasticise them.
What do other people suggest?
What do other people suggest?
This article uses DynamoDB and DynamoDB Streams for a Lambda architecture:
And this AWS doc says that Rules can be used to write IoT data to DynamoDB, so why not streaming data?
And this AWS doc says that Rules can be used to write IoT data to DynamoDB, so why not streaming data?
Kappa on AWS?
Another question I had was how would you implement a Kappa architecture on AWS (real-time and persisted data in same immutable store for querying using a single interface/language), and is Kinesis it???Partially, it looks like Kappa, but has the (only?) limitation that data can only be stored for 24 hours (maybe 7 days at the max).
Kappa on AWS
DynamoDB is suggested as part of the solution in this article, and also compares AWS with Google Cloud. Also mentions Apache Beam which is a Kappa thingy.
It was by Tyler Akidau, and he may have ventured into the madness of the idea I had (Data Bogs), he calls it Accumulation.
And this blog uses DynamoDB as part of a Kinesis data analytic pipeline (Raven).
And this blog uses DynamoDB as part of a Kinesis data analytic pipeline (Raven).
Architecting with AWS Data Analytics services?
Maybe my brain is not big enough, but to me this all seems a bit ad-hoc. How about tool support for architecting data analytics (including IoT) solutions and pipelines in AWS (including 3rd party and open source tools as required)?
You at least need to be able to take into account service limits, and performance, scalability, elasticity and price requirements and constraints, at each service boundary (inputs, outputs). I.e. latency, throughput, price, etc.
How would you do this? A simple table may suffice, but is still manual and error prone in terms of how you could use it.
Our software performance modelling tool would be entirely suitable as it models workloads (inputs), services (software components with different inputs/outputs and measured performance/scalability characteristics), and servers (capacity/elasticity). Metrics predicted include performance, throughput, utilisation and cost.
How would this work?
Our software performance modelling tool would be entirely suitable as it models workloads (inputs), services (software components with different inputs/outputs and measured performance/scalability characteristics), and servers (capacity/elasticity). Metrics predicted include performance, throughput, utilisation and cost.
How would this work?
AWS services can be defined as "services" with defined latency and throughputs inputs and outputs, and different functions such as NoSQL or SQL. Elasticity and scalability can be handled by allowed more "servers" to be allocated (with increased price), and/or different services with higher capacities (e.g. corresponding to a different number of Kinesis stream shards).
A model could be built to explore different architectural alternative, and simulations run with different workload (e.g. arrival rate for different data streams/event stream types) and the system pipeline throughput and latency can be predicted, and any resource bottlenecks detected (and if they can be changed, for example by increasing resources, or changing the AWS product/service).
TODO Collect the performance/scalability/elasticity/prices defaults and limitations for each AWS Data Analytics service in a table.
EMR, Data Pipeline, Import/Export
Other services covered in this section are Amazon Elastic MapReduce (EMR) - including Hadoop, Hive, Pig, and Spark - Amazon Data Pipeline (a workflow for moving and transforming data (e.g. with EMR) you can schedule tasks, and tasks can have preconditions which determine if a task runs), AWS Import/Export uses appliances to move lots of data (Snowball is a big box thing with encryption enforced, Disk is a disk, with optional encryption).
Note: Other AWS workflow services are Lambda step functions, and Amazon Simple Workflow Service (SWF). How is Data Pipeline similar? Different?
Doesn't look like a Snowball to me, more like "big grey box"
PS
Looking at the docs again it seems that Kinesis Analytics is "just" SQL (with temporal operators). If you want something more sophisticated to process the streams then you need to write you own Kinesis applications:
And guess what? apps have state, so use DynamoDB to keep track of it (this seems to be a common AWS pattern, use DynamoDB to keep track of state for restarts etc)).
And using Kinesis apps (KCL) to process DynamoDB Streams (how many different "stream" services are there? > 1 obviously).
This seems to be common practice (i.e. using DynamoDB for session state persistence).
PPS
From past experience I think this is vital, it's very hard to debug multiple concurrent temporal stream processing queries running at the same time on lots of data.
Also do any current data analytics languages/operators support future operators? I recall now that I have come across the problem of reasoning about the future in model logics and temporal logics (e.g. in Machine learning with autonomous learners reasoning about what will happen in the future when they do something, e.g. stack a block on top of another block).
I can't find any reference to future time temporal operators and models in current data analytics platforms, odd. Do they just ignore this problem? This book says:
Thus, within this book a linear temporal model is assumed
Actually a chapter from this book.
My previous random thoughts on future operators.
I also wonder where "Rules engines" fit with (AWS) data analytics? There are rules for IoT, how about a fast rules engine for streaming data? E.g. could you use a Rete rules engine (I had the fastest in the world briefly in the last 1980's), for streaming data???
Also what are the actual requirements for streaming data processing?
The 8 requirements of Real-Time Stream Processing paper and this blog make a good case for some basic features. I agree with one of their observations, that Polling shouldn't be a feature. But SQL? No thanks. This reminds me of a potential issue I had with Kinesis during the AWS summit presentation in Sydney recently. The speaker implied that the only way get data out of Kinesis (streams?) was by polling, which when I talked with him afterwards he didn't seem to understand the potential issue with this. Odd. TODO Check
Looks to be correct.
Maybe Lambda is a work around?
Based on this I would say that Kinesis is NOT A REAL TIME STREAM PROCESSING SYSTEM (pity). I would like to be wrong...?
This blog looks at Kinesis in production (c.f. Kaftka). They conclude that latency and throughput are ok for a distributed real-time stream processing system (but not perfect, particularly 5 reads per second limit).
Nice blog! Keep update with us AWS Online Training
ReplyDeleteThanks for this post.
ReplyDeleteAWS Training In Hyderabad
Best AWS Training in Hyderabad
Nice blog,Thanks for sharing.
ReplyDeleteAWS Training in Hyderabad
Best AWS Training in Hyderabad
Interesting Blog. It is really very helpful post. keep it up keep blogging. otherwise anyone wants to learn PDMS Training course so, contact here- +91-9311002620 or visit website- https://www.htsindia.com/Courses/cad-cam-cae/pdms-training-course
ReplyDeleteMany claim that the key to success is that aData Lakebe built on a relational database, so that different teams can use the same schema and data structures. A relational database is also a lot more flexible than a NoSQL database, since it is easier to add and remove tables, as well as to change a table’s schema.
ReplyDeleteWant to change your career in Selenium? Red Prism Group is one of the best training coaching for Selenium in Noida. Now start your career for Selenium Automation with Red Prism Group. Join training institute for selenium in noida.
ReplyDelete