Big Data/Data Analytics Performance and Scalability: Opportunities for Performance Modelling Data Analytics Platforms and Applications

Big Data/Data Analytics Performance and Scalability: Opportunities for Performance Modelling Data Analytics Platforms and Applications

Paul Brebner, Draft V1 February 2016, Draft V2 April 2017

Data Analytics and Performance Modelling

My last 10 years R&D with NICTA and then start-up CTO experience resulted in a Software Performance Modelling technology, which is itself an example of data analytics.  The most recent innovation is the ability to automatically build performance models from software monitoring data (lots of it).   Our processing pipeline (and experience with client technologies) includes commercial and open source data analytics tools (e.g. SPLUNK, Hive, R, Cassandra, Amazon cloud, etc).


·       I have recently applied automatic performance modelling to a client problem (Department of Immigration, Visa Risk System) involving predicting real time analytics performance and scalability issues due to code changes during their DevOps lifecycle.

·       Some recent published research concludes that there are potentially big problems but also significant opportunities with the performance and scalability of Big Data/Data Analytics.  Some problems run up to 10 times slower with different configurations.  Some problems just won’t run at all on given infrastructure. Other problems will cost more to solve. 

·       Apache Spark was motivated/invented through performance modelling of previous approaches TODO track down the paper again, here are some on performance modelling Spark:

http://ieeexplore.ieee.org/document/7336160/


And this one (not modelling) on performance:

https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-ousterhout.pdf



·       Some previous colleagues and co-authors have published in this area including Professor Wolfgang Emmerich on Data Analytics for business innovation, and Ian Gorton on Data Analytics Architecture:

An example of Performance Modelling for Data Analytics

·       Following is a simple example data analytics performance model.  It shows the impact on total time and cost for a DataAnalytics problem with performance data borrowed from a publication.  The problem is a MapReduce problem with 1TB input data, 100 cores available, and running on a public cloud such as Amazon EC2 at 10c/core/hour. 

·       It shows the difference between same job run on Hadoop and then SPARK architecture (single pass).  Speedup of about 20% and reduced cost of $10 per run. Over a year assuming a typical business use case of 1 run/day this amounts to a cost saving of $3,650.  Similar modelling may be used to optimise more complex problems and configuration options, including checking if workloads and architectures will scale up to signicantly larger sizes (e.g. What if 1PetaByte of data? What if 10,000 cores?, what is the optimal price/performance, etc.

·       Given that performance models can now be routinely and automatically built for problems such as this for arbitrary sized/complexity systems from APM vendor data (e.g. Dynatrace), it would be relatively straightforward to investigate some typical Data Analytics applications and platforms, and determine if there were performance and scalability problems or possible optimisations, etc.

·       There is also the substantial opportunity to invent a new better performing/scalable/cost effective/cloud capable etc Data Analytics solution through the use of performance modelling of existing platforms and alternatives to solve specific novel problems (e.g. in areas of Big Data, IoT, Application/Cloud Performance Management - real-time autonomic systems, e.g. consuming AWS X-Ray monitoring data and automatically scaling AWS services and infrastructure as load/problems changes over time.

·       There is still a lot of manual configuration and perforamance engineering, particularly for databases, for AWS which is time consuming, error prone, and insfuficiently elastic and automatic for a cloud platform and which could be potentially addressed through a combination of monitoring and performance modelling and prediction.

HADOOP Model

The following performance model is for the standard MapReduce HADOOP solution.
  

SPARK Model

The following performance model is for the Apache SPARK version of the same problem (in memory, and single pass processing).



And the results:


Comments

  1. Get expert Question & Answer Dumps PDF Online AWS Certified Professional Exams. We Provide latest IT Certification Exams PDF for preparation Study Guide Test Practice for Success in exams. This is an online education portal

    ReplyDelete
  2. Very nice article,Thank you for sharing this awesome Blog.

    Keep updating.....

    Big Data Hadoop Training

    Big Data Online Training

    ReplyDelete
  3. Thanks for the detailed blog.The blog consist of informational data about what a user basically serach.You may visit to the Global Tech Council to get the best deal.

    Visit-Big data analytics certification

    ReplyDelete
  4. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Data Analytics Certification

    ReplyDelete
  5. I am very happy when reading this blog post because the blog post writes in a good manner. Thanks for sharing valuable information.
    Online Big Data Hadoop Training Cost

    ReplyDelete
  6. Big Data is the most popular tech stack that every business is taking into consideration. Thanks for sharing.

    Looking for Business Data Analytics Company? Reach Way2Smile DMCC.

    ReplyDelete
  7. I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article. You can also check info about Big Data Analytics Systems & Solutions
    big data customer analytics

    ReplyDelete

Post a Comment

Popular posts from this blog

Down the Lambda (Functions) Hole!

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM