Big Data/Data Analytics Performance and Scalability: Opportunities for Performance Modelling Data Analytics Platforms and Applications

April 30, 2017

Big Data/Data Analytics Performance and Scalability: Opportunities for Performance Modelling Data Analytics Platforms and Applications

Paul Brebner, Draft V1 February 2016, Draft V2 April 2017

Data Analytics and Performance Modelling

My last 10 years R&D with NICTA and then start-up CTO experience resulted in a Software Performance Modelling technology, which is itself an example of data analytics. The most recent innovation is the ability to automatically build performance models from software monitoring data (lots of it). Our processing pipeline (and experience with client technologies) includes commercial and open source data analytics tools (e.g. SPLUNK, Hive, R, Cassandra, Amazon cloud, etc).

· I have recently applied automatic performance modelling to a client problem (Department of Immigration, Visa Risk System) involving predicting real time analytics performance and scalability issues due to code changes during their DevOps lifecycle.

· Some recent published research concludes that there are potentially big problems but also significant opportunities with the performance and scalability of Big Data/Data Analytics. Some problems run up to 10 times slower with different configurations. Some problems just won’t run at all on given infrastructure. Other problems will cost more to solve.

· Apache Spark was motivated/invented through performance modelling of previous approaches TODO track down the paper again, here are some on performance modelling Spark:

http://ieeexplore.ieee.org/document/7336160/

And this one (not modelling) on performance:

https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-ousterhout.pdf

· Some previous colleagues and co-authors have published in this area including Professor Wolfgang Emmerich on Data Analytics for business innovation, and Ian Gorton on Data Analytics Architecture:

o http://blog.zuehlke.com/en/author/wolfgang-emmerich/

o https://sites.google.com/site/iangortonhome/

An example of Performance Modelling for Data Analytics

· Following is a simple example data analytics performance model. It shows the impact on total time and cost for a DataAnalytics problem with performance data borrowed from a publication. The problem is a MapReduce problem with 1TB input data, 100 cores available, and running on a public cloud such as Amazon EC2 at 10c/core/hour.

· It shows the difference between same job run on Hadoop and then SPARK architecture (single pass). Speedup of about 20% and reduced cost of $10 per run. Over a year assuming a typical business use case of 1 run/day this amounts to a cost saving of $3,650. Similar modelling may be used to optimise more complex problems and configuration options, including checking if workloads and architectures will scale up to signicantly larger sizes (e.g. What if 1PetaByte of data? What if 10,000 cores?, what is the optimal price/performance, etc.

· Given that performance models can now be routinely and automatically built for problems such as this for arbitrary sized/complexity systems from APM vendor data (e.g. Dynatrace), it would be relatively straightforward to investigate some typical Data Analytics applications and platforms, and determine if there were performance and scalability problems or possible optimisations, etc.

· There is also the substantial opportunity to invent a new better performing/scalable/cost effective/cloud capable etc Data Analytics solution through the use of performance modelling of existing platforms and alternatives to solve specific novel problems (e.g. in areas of Big Data, IoT, Application/Cloud Performance Management - real-time autonomic systems, e.g. consuming AWS X-Ray monitoring data and automatically scaling AWS services and infrastructure as load/problems changes over time.

· There is still a lot of manual configuration and perforamance engineering, particularly for databases, for AWS which is time consuming, error prone, and insfuficiently elastic and automatic for a cloud platform and which could be potentially addressed through a combination of monitoring and performance modelling and prediction.