Lambda, Kappa and now (maybe) a [Mu, Tardis, Multiverse, "Bog" or Ouroboros architecture] (pick one)?


The Doctors time line.

I've been interested in stream and event processing approaches for 10+ years now (and have some background in AI, using temporal logics, and autonomous machine learning in temporal robot domains etc).

Some of my experiences included integrating stream (real-time) data with historical data for processing and visualisation at CSIRO ICT Centre 10+ years ago (E.g. from sensor networks via Jabber/XMPP protocols, real-time in-memory CEP of streams, and persisting (some of the) streams in databases/memory for subsequent lookups and processing etc).

However, more recent experience is with building our performance modelling technology which now includes the ability to automatically build performance models from lots of APM data (more or less in real-time if the APM product and API can keep up, which is debatable, some of the APM data is big and there is lots of it, e.g. each transaction from Dynatrace can be up to 1MB in size (XML), and the REST API can only serve about 10TPS of trace data (but the system being monitored can have 100s to 1000s of TPS of transactions).

We have another of different approaches for getting and processing and ingesting APM data, and ahve experimented with Apache Hive and Spark, and spit out the metrics from our simulation engine into Cassandra (as nothing else can keep up with the fact that we are simulating hours of monitoring data and seconds or minutes).

So recently there's been renewed interest in stream processing architectures. The Lambda architecture was one of the 1st (recent) ones to gain attention. Here's a picture, the idea is that there are 3 layers, 1 for real-time/speed/streaming data and the other for "batches", and a "serving" layer which combines real-time and historical data. This is more or less (probably exactly) the approach I used 10+ years ago and makes good sense as a 1st approximation. But it turns out to be unnecessarily complicated!





Along came the Kappa architecture. Where's the batch/historical layer gone? Well if you think about i (hard) you actually don't need to keep track of real-time and historical data in different layers, you can just combine them (in an immutable data store yeah) and use this for any sort of query you like. Clever, but probably assumes you have good caching and can keep most of the real-time data in memory? (depends on window sizes etc I guess).





From https://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/ 

Oh, the other downside is that you may have to "regenerate" (like Dr Who?!) or "reprocess" the data sometimes. I keep forgetting how this works exactly so here's the explanation from the above link (If it's wrong I can blame them):

The key idea is to handle both real-time data processing and continuous data reprocessing using a single stream processing engine. Data reprocessing is an important requirement for making visible the effects of code changes on the results. As a consequence, the Kappa architecture is composed of only two layers: stream processing and serving. The stream processing layer runs the stream processing jobs. Normally, a single stream processing job is run to enable real-time data processing. Data reprocessing is only done when some code of the stream processing job needs to be modified. This is achieved by running another modified stream processing job and replaying all previous data.

Note that this may take time! And I think you have to have access to spare compute resources on demand to enable sufficient concurrency and maintain the existing stream which you are reprocessing...

Ok, so here's my "clever" idea. Our performance modelling tool can take real-time APM data and build performance models on the fly (continuous performance modelling and prediction), and spit out real-time predictive metrics. But guess what (and ignoring the real-time aspect for the time being, it's not so important), the predicted metrics are the same as the input metrics (just for the future). So you can have a recursive loop and consume the outputted predictive metrics as input metrics to build a performance model from and so on. Is this useful? Maybe.  I've used it for conducting experiments on how good the performance models are (accuracy and fidelity), as the more times you can repeat this process and not end up with complete nonsense the better the accuracy and fidility is (in theory the ideal case if you can do it infinitely with no degradation, in practice the only way you do this is by building a "model" using all of the actual raw input data as models by definition introduce abstraction/simplification, but a model of just the raw data isn't a model and isn't really all that useful Oh well.  So the other idea is continuous performance modelling. Maybe of multiple alternatives (at once?) In this case things can get complicated as you have to keep track of the real-time/historical input data, and all the possible alternative future predictions (over different time frames potentially, from minutes to months if load forecasting is part of the problem and you have sufficient long term (historical) data available. We have done this, it's not fast to do at present.

Ok, so here's the thing. What if you change the Kappa architecture and feed the results of the "queries" (or predictive analytics in our case) back into the input of the real-time engine, continuously? Would this work? Is it useful? The "concept" is that now the real-time immutable store has real-time (from the "outside" world), historical, and predicted/future data (possibly multiple alternatives) all in the one store.

At this point we need to invoke Dr Who and time travel metaphors. You have past, present and possibly multiple future available (maybe this should be called a "Tardis" architecture!)
Rather than modifying the above or drawing one from scratch I think the idea can be captured from mythology, the symbol for infinity or Ouroboros (A dragon eating its own tail ouch).






To make this work you would need to supplement the stream processing with temporal operators that also support future time, and possibly multiple time-lines?
Also known as the Multiverse theory.

I also wondered about calling this a "Mu" architecture (which doesn't seem to be taken yet, but I sort of like Tardis architecture, or maybe my idea recently for a "Data Bog" could be repurposed for this role given it's nature as a Portal to another dimension???)

The Doctors timelines!

PS
Note that you DON'T WANT TO GET TRAPPED IN A TIME LOOP!
This is a "very bad thing" and can actually happen (easily) with stream processing systems if (a) the architecture isn't designed to cope with detecting duplicate events and (b) you subscribed from events that you can accidentally end up producing (easier said that done if you have a distributed broker architecture and are dealing with overlapping spatial and temporal data regions whoops, I've done if before, you end up with a data cascade (not sure this is the right term or not) or chain reaction (a self amplifying chain of events).

More on event cascades: https://www.nature.com/articles/srep33321 

P2S
I'm stuck with what to call this architecture. Mu architecture (as not taken yet that I can tell), Tardis architecture, Multiverse architecture, Bog architecture, or Ouroboros architecture?

P3S
This blog points out problems with Kappa architectures (one size doesn't fit all) and also considers the use of machine learning and predictions in this context:
https://jornfranke.wordpress.com/2016/11/11/lambda-kappa-microservice-and-enterprise-architecture-for-big-data/
from
https://jornfranke.wordpress.com/

And what's this? Another architecture? The Zeta!

P4S
And just what is time anyway? Does it go in directions?Is there an "arrow of time"?  Is it just "negative" in the past and "positive" in the future?  This is an interesting article in Wired by a philosopher of physics who argues that Time is more common sense than some physicists believe, It has a direction, and is "result" of the laws of physics. I.e. the universe if generative, and one "state" results from the laws grinding away and producing the next (my explanation).   Of course, this doesn't necessarily preclude you running it backwards?

Comments

  1. Want to change your career in Selenium? Red Prism Group is one of the best training coaching for Selenium in Noida. Now start your career for Selenium Automation with Red Prism Group. Join training institute for selenium in noida.

    ReplyDelete

Post a Comment

Popular posts from this blog

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.