Chapter 7: Amazon Redshift

Amazon Redshift is a data warehouse cloud service for Big Data. It supports SQL, is based on PostgreSQL, and supports parallel execution of queries for low latency and scalability. I believe it supports an immediate consistency model (c.f. eventual consistency for some of the other Amazon DBs).

A B(l)og Monster

Another service is Redshift Spectrum which supports SQL queries against S3 "Data Lakes". The Data Lake concept worries me slightly as without proper drainage and flow a Lake can very quickly end up as a putrid Bog, complete with a (dancing) Bog (Blog?) Monster!



Clusters

A Amazon Redshift warehouse is made up of a cluster consisting of a leader node and >= 1 compute nodes. Multiple node types are supported with different mixes of CPU, memory, and storage.

Your client application communicates with the leader node only (via JDBC/ODBC), which in turn coordinates query execution across the data which is distributed across multiple "slices" across compute nodes.

Query performance can be increased by adding more compute nodes.  Clusters can be resized (which also is tutorial (x2) sized). During a resize Redshift creates a new cluster and migrates the data from the old to the new cluster.  During resizes the database is read-only.

Tuning




Tuning early valve radios was tricky enough for the user, but when I worked in a TV repair shop as a kid my least favourite job was dial cord re-stringing as the cord had invariably come off all the pulleys, drums, shafts, etc, so trial and error was the only way to work out how it was supposed to be strung. Unfortunately Redshift tuning looks equally frustrating?



When creating a table you need to choose how to distribute the records across the nodes and slices in a cluster. The distribution style (incorrectly titled "distribution strategy" in the book) has a big impact on query performance , storage requirements, data loading, etc. The styles are EVEN, KEY, and ALL.  Sort keys (1 or more columns as sort keys) may also impact performance.

In fact, tuning Reshift performance appears to be a critical activity with a tutorial devoted to it with the following steps:
This appears to be a potentially time consuming and ongoing task.  I wonder if it can be automated and optimised by Redshift itself, or if performance modelling could be possible (e.g. given sufficient monitoring and table data) to run predictive simulations? Some extra advice is in blogs, slides (which mention some open source tools), and an evaluation.

Redshift supports INSERT and UPDATE to create and modify table records, but for bulk data COPY is is more efficient. COPY can read from multiple data sources concurrently (including S3 and DynamoDB). After a COPY you should use VACUUM and ANALYZE to optimise space and update statistics. 

For large Redshift clusters and multiple users configure Workload Management (WLM) to define multiple queues and concurrency levels.  Again this looks potentially complex and open ended and I wonder what automation and/or predictive modelling is available or possible?

PS
A few follow up questions: Can the leader node be saturated? Is it Elastic? Is there a sensible default for the distribution style? How much impact can distribution style have (worst, typical, best case) in practice? Theory?

Quick answer to the 1st question appears be "yes" as some SQL operations only operate on the leader node:

An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node network as data is distributed by the leader to the compute nodes.

PPS
As usual don't forget price and limits. This blog set up a Redshift cluster for a few hours with example pricing.

PPPS
And complexity, Redshift REST APIs have 62 operations and 39 data types, a total of 101.

PPPPS
This site supports comparisons between multiple different Databases

P5S
It appears that I've accidentally invented the term "Data Bog", or Databog (which I've been using for at least a year or 2 after hearing about Data Lakes working on a client project).   If you do a google search the word ""Databoge" comes up  which is German for "Databook".  Now that I've "invented" it I guess I should start to try and define it eventually.

First, Data Lake is a now common term for a body of data which isn't as organised or structured as other types of Data storage architectures, but which is more organised that a "Sea of Data" (given that according to Revelation there won't be any Seas in heaven only Rivers and Lakes of Fire), and has meta data and is structured upon consumption by applications (I think).

Also possible (and references) are "Data Puddles" (isolated splashes of data not connected or well organised?),  Data buckets (think S3), Data Ponds (holding ponds for data as it enters the Data Lake?), Data droplets (from Semantic Web terminology?) , Data stream (that's been around for a while), Data Swamp (What you get if you don't keep track of where you data comes from in a Data Lake).

Some of these terms refer to Wetlands and include Marsh, Swamp, Bog, Fen. The categories of Wetlands in Australia are more complex.  The main feature of a Bog appears to be that it fills with rainwater, and has no outlet (it only drains through the soil).

What sort of things do you find in Bogs? According to the a children's poem and book, a dog and a frog and a hog are all on a log on a bog, but then end up in the bog (except the log I guess).

I had some first hand experience of swamps and bogs on my honeymoon. We went bushwalking (with tents and 7 days worth of food) along the East coast track on Hinchinbrook Island. On the first night we encountered a bog which we had to cross to get to the camp site. I crossed it 5 times as my wife "couldn't see the bottom" and was afraid something would eat here. She almost ended up in the bog, as by the final trip across - after first carrying my pack, then her pack, then finally her - I was ready to "accidentally" drop her in! And then the same again the next day (to get back onto the main track making a total of 10 times). A few days later we had a choice between crossing a narrow looking creek (but with a warning sign) or spending an extra day going around it via a swamp in order to avoid potential crocs in the creek (we went around).



What do I learn from this? That bogs are unpleasant but won't kill you, whereas swamps may have nasty bitey things (e.g. crocs).

Bogs are also known to preserve things well, for example Bog people (bodies, mummies) have been found in Bogs, often 1000s of years after they fell (or were pushed?) in.


Even odder is the idea (from above link) that:

Bogs were both resources and possibly ominous supernatural portals to/from other worlds.


It appears that gifts (even offerings of people?) may have been made in exchange for the valuable resource of peat obtained from bogs. Maybe this suggests a possible definition for "Data Bog"?!



Definition 1: A Data Bog is a data architecture for both querying and storing results, and the data/results are preserved but potentially transformed over a long period of time (what you get out over time isn't necessarily the same as what you put in?!)

Or:


Definition 2: A Data Bog is a data architecture that requires offerings of data to be put into in proportion to the quantity and quality of the data that you take out (otherwise you may get some very unpleasant things out!)  I.e. if you don't pay sufficient attention to what's going in the quality of what comes out may deteriorate. 



Comments

  1. It is efficient to ask for useful data warehouse solutions for transforming the data as well as maintaining it with the proper results.

    ReplyDelete
  2. AWS Redshift Consulting Services - In the era of big data, organizations face the challenge of efficiently processing and analyzing vast amounts of data to derive actionable insights. Amazon Redshift, a fully managed cloud data warehousing service, addresses this challenge by providing a scalable, fast, and cost-effective solution for data storage and analytics. To leverage the full potential of Amazon Redshift, businesses often seek the expertise of AWS Redshift consulting services.

    ReplyDelete

Post a Comment

Popular posts from this blog

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.