Predictive Analytics Workflows vs. Machine Learning Workflows?

Is one workflow the same as another workflow? (When they produce the same outcome?)

Recently I was wondering how similar our predictive analytics software performance modelling workflow is to current best-practice Machine Learning/Data Analytics workflows (maybe called pipelines).

Our basic workflow is:

Get the data (APM and other data)
Examine a subset and see if it makes "sense" and if there is sufficient quality and quantity (i.e. is complete, we have all the types of information needed, for the whole system, all transaction types and transaction samples for the period of time required, etc)
Take a sample of the data and pre-process it to remove exceptions, anomalous transactions, etc (often requires an incremental and manual/visualisation approach as well as our automated algorithms)
Decide what type of model to build (depending on the question to be answered, the complexity of the system, what level of aggregation of workloads, software and infrastructure etc).
Automatically build an initial model from the sample data. This builds a model of the required type and structure which is correctly parameterised from the performance data. Lots and different sorts of pre-processing is required depending on the number and type and format and semantics of the data sources (e.g. aggregation, correlation and integration of data sources, statistical and regression analysis to get the correct distributions of metrics, etc). We have used a variety of technologies for this including XML/JSON/Apache Hive and Spark, etc.
Calibrate the model from the same or different data samples. This basically scales some of the parameters so that the model produces the correct results for the observed/modelled load on given infrastructure. Calibration requires extra data sources including infrastructure and workload data and metrics.
Model validation. How well does it work against test data? How useful is the model to answer the questions posed? Is it high enough fidelity? Does it show the things we need to see and hide the things we don't care about? I.e. is it the right level of abstraction?
Sensitivity analysis is often done here to determine if the model is robust enough for the purpose required, or if some components are too sensitive it may indicate that more data is required and we start from 3 with larger sample sizes.
If model accuracy for the purpose required is sufficient then use the model (e.g. explore alternative workloads, transaction mixes, software changes, infrastructure changes, etc) and make predictions.
If model is not accurate enough then change model type and repeat from 4 and/or repeat from 3 with more or different data. In some cases need to go back to 1 as the data itself may be problematic (incomplete, biased, etc) and different data sampling is needed.
If the amount of data (in terms of time, or size) or we have a time deadline for modelling (e.g. 2 hours max until we need predictions) then incremental/dynamic sampling can be done at 3 using increasing sizes of data until a model is produced for the given constraints.
And of course you don't stop there, the whole process can be incremental, piecemeal, and continuous, with say a model built up out of data from both current production system and most recent changes from DevOps, and then applied to past workload data (e.g. worst case load from last month with forecasting applied to next month with inference for worst case transaction mix theoretically predicted). And then run against the latest code commit in production for next 2 hours in a Canary environment to check if the commit is causing any problems.
And our course we want this workflow to be repeatable, automatic, scalable, fast, usable (by non experts), transparent, error proof, to work with multiple different vendor APM products etc.
Data, model and prediction providence (i.e. can you trace it back? Repeat it? Vary it?) is also critical. We found you have to build this into your tool chain/pipeline (not the 1st time I've found this, also when at CSIRO in the 1990s for a modelling tool architecture, similar conceptually actually as you have lots of qns, some data, and a bunch of tools/methods to solve them, and need to pick and chain them together and try them out, and repeat etc). Also for predictive analytics the predictions are just metrics to, in theory you can use them as a source of data for further modelling, so they need to be store somewhere with this in mind (see my idea of a data analytics architecture that allows you treat past, present and future data identically ). You can also use the predictions to test the model by seeing how many times you can learn from the previous output until you just get noise. An ideal model would allow this a large number (infinite?!) of times.

How does this compare with ML workflows?

A couple of ML pipelines online include basic steps such as preprocessing, feature extraction, training, testing; or data cleaning, feature (selection, preprocessing, construction), model selection, parameter optimization, model validation.

One of the main differences appears to be "features". Do we have features in performance models? Well yes, and these related to both the type of questions we need answered, the type of system being monitoring and modelled, and the monitoring data and choice of model to build. So the features are extracted in the pre-processing stage.

How about parameter optimisation and training? This is similar to model calibration but also our incremental approach used for sampling data from larger data sets. Model validation is similar and is done using test data (if available) or or new data not yet seen from the future.

There's probably no exact match between model sensitivity analysis and ML? Oh ok actually this is part of the ML workflow for ML model assessment using Emulators.

And where in the ML workflow do you select the correct type and variant of ML algorithm depending on data and problem to solve? This is perhaps a "meta-step"? E.g. a basic breakdown is around the "5 tribes of ML".

Actually I (and of course most other ML people) think there are more than 5 classes of algorithms.

See also Google Cloud ML workflow.

The latest AWL ML guide including workflows.

And yes this stuff is all getting too complex. It needs automation (i.e. ML for ML!)

PS
This process for Data Mining was brought to my attention recently. It's similar to above, but reminds me that I forgot to mention the (obvious, but often implicit) stages of business problem understanding and deployment! Both are critical.

Search This Blog

A computer scientist learns Amazon Web Services (AWS)

Predictive Analytics Workflows vs. Machine Learning Workflows?

Is one workflow the same as another workflow? (When they produce the same outcome?)

Comments

Post a Comment

Popular posts from this blog

Chapter 11: AWS Directory Service, Cloud Directory

AWS Solution Architecture Certification Postscript

Chapter 2: Amazon Simple Storage Service (S3) and Amazon Glacier Storage (for storing your beer?)