Who am I? What did a Computer Scientist do during a typical career: Cloud and Internet scale technologies

Cloud and Internet Scale Computing



Picture: Guess!? Is it small, big or "virtual" (actually an internet visualisation)

"Cloud" computing isn't really new, it's been around in one form another since the internet and distributed computing (i.e. decades).

Amazon Web Services (AWS) Attended AWS Summit 2017 in Sydney (sessions of particular interest were on Kinesis, micro service architectures and workflows and monitoring, security architecture, scientific workloads, data analytics and machine learning, the new AWS X-Ray monitoring service, and enterprise migration). I am currently studying the AWS Certified Solutions Architect Official Study Guide (Sybex, 2017) and hope to be certified by June 2017.   I am writing a blog from the perspective of a “computer scientist” learning AWS which I hope will include insightful architectural observations and trade-offs (https://acomputerscientistlearnsaws.blogspot.com.au) . As a practical exercise I am attempting to determine the ways that multiple example applications that I have experience with could be deployed to AWS taking into account alternative workloads and cost trade-offs. I am also using resources at https://aws.amazon.com/architecture/ and https://acloud.guru/

Cloud Benchmarking In March this year I conducted an independent performance assessment of cloud platforms for a private cloud company.   The plan is for a phased evaluation increasing the number of platforms and what is measured as more sponsors become available. The initial round included 2 industry standard benchmarks and focussed on CPU and Memory speed on AWS, Azure, and private clouds.  Initial results show that AWS is the slowest for “high speed cpu instances” and has non-linear scalability with increasing instance size (vCPUs/threads). Other observations (with potential impacts on architecting cloud systems) include higher variability over results, (for some cloud/instance combinations, particularly smaller public cloud instances) and unpredictable bursting and throttling of resources. Future rounds will expand to benchmark storage, multiple instances, LAN and WAN speed. Future benchmarks have been identified and trialled to use and include the SPECjbb®2015 benchmark and the new SPEC cloud elasticity benchmark (SPEC Cloud™ IaaS 2016 benchmark).

Cloud/Internet scale performance Invited participant at The Workshop on Performance and Reliability (WOPR25), Wellington, NZ February 2017 (accepted abstract on performance modelling for DevOps), 3 days of robust discussions on internet/cloud scale performance testing and engineering (e.g. with Dynatrace, flood.io, Xero and Facebook experts). Challenges identified included incomplete top-to-bottom application monitoring, cloud “bill-shock”, unexpected throttling of cloud resources, and erroneous behaviour of 3rd party services under load.

Cloud elasticity & cost R&D on the impact of cloud elasticity on real applications. Using performance data from three real client systems I conducted benchmarking on multiple cloud provider platforms and modelled the impact of different instances sizes and type mixes, instance spin up times and workload patterns/spikes on response time SLAs and cost. Published results in CMG and ICPE.

Cloud migration performance engineering Many of our clients over the last 10 years have wanted to understand the impact on performance, scalability, reliability, availability and cost of migrating to and from various cloud and in-house platforms.  For example, one client was conducting UAT on a hybrid public cloud/in-house platform (the performance of the user experience in this environment was poor), and I measured and modelled the expected performance and capacity on the final target 100% in-house platform.  For another client I measured and modelled the performance and capacity resulting from migrating from an in-house physical platform to a private virtualised cloud platform. This required significant re-architecting and risk mitigation for the benchmarked and predicted poorer performance.

Security challenges Working with many government departments also gave me experience of data security challenges. Typically the data was performance monitoring data captured by APM tools such as Dynatrace running in their secured production environments. Different approaches were needed to satisfy their security policies in order to build performance models from the performance data including: encryption of the data before removal to our lab, transfer of data onto a laptop on their premises before removal, de-identification of performance data before removal from their systems, secure remote login to Dynatrace on their systems (in some cases a separate copy of Dynatrace and the production data), and complete installation of our modelling software onto their environment.

Application Performance Monitoring For many past clients one of the main barriers to performance engineering distributed systems was incomplete application performance monitoring visibility/coverage (missing workloads, components, partial depth, into 3rd part services or component, etc). For managed cloud services this raises questions of whether it is possible and how to do end-to-end/top-to-bottom complete application performance monitoring.  If comprehensive monitoring is impossible how is it possible to achieve adequate digital performance management in cloud environments? How is it possible to architect systems for guaranteed response time SLAs for end users?

Internet architecting With CSIRO I’ve had experience of re-architecting systems for the internet using custom HTTP protocols, Grid cluster computing, managed the development of a java based application built on the Jabber/XMMP protocol, and managed a detailed technology and architecture test-bed evaluation of the Open Geospatial Consortium (OGC) Sensor Web standards including web notification service and sensor alert services.

Grid Services, Cloud features I was a visiting senior research fellow managing a distributed project for the architectural and technical evaluation of the Open Grid Services Architecture (OGSA) infrastructure. It was based at UCL across 4 sites in the UK working with Professor Wolfgang Emmerich. We established and trialled the OGSA infrastructure at four locations (London x 2, Newcastle, Edinburgh) and conducted an architectural trade-off evaluation, which was published in the journal of grid computing (2006).

I developed a new service to automatically deploy, secure, discover and securely consume end-user web services and supporting code across the distributed infrastructure (i.e. services to deploy, manage, resource and execute end-user supplied services). As a side-effect of this work I discovered some missing aspects that became common public cloud computing features including: resource measurement and billing to ensure fair use of resources and prevent excess resource hogging (this included an idea to have different possibly dynamic/bidding based charging models to ensure that short-term/interactive services could obtain some resources quickly even in competition with long running batch computations, c.f. AWS spot-instances), virtual machines for more scalable and agile security and isolation of resources, and automatic deployment of web services/code across resources to provide ease of use for find-gained (non-batch) realistic distributed applications.

Cloud technologies The technology stack for our SaaS Performance Modelling tool (including integration for APM tools) consists of javascript (Dojo, D3, React, mobx, NPN, gulp), Grails, Java 8, groovy, grails, Gradle, MySQL, Cassandra for storage of simulation metrics, REST APIs to Dynatrace XML and Compass/AppDynamics JSON, Apache Hive and Spark processing of APM data, Apache Maths library for linear and non-linear regression, R, etc.

UNIX I have 15+ years of UNIX experience, and have previously worked for 5 years as a UNIX systems and kernel programmer and consultant in distributed systems, and (e.g.) designed/wrote a TCP/IP based application for a client for Oracle database replication across geographical regions.

I was Senior Tutor for Professor John Lion's Networked Systems Course at UNSW 1987-1990 (Covered networks and distributed systems).

Data Analytics I am generally familiar with data analytics technologies as we (a) our modelling tool is a predictive data analytics tool which consumes large amounts of APM data, (b) we use data analytics tools ourselves, (c) our clients use data analytics tools and some of their problems/solutions have focussed on them (e.g. use of bin-packing algorithms combined with performance modelling to optimise load balancing for “R” models in Immigration Visa processing system), (d) I have built performance models of the difference between Hadoop vs Spark performance, (e) I’ve attended conferences where the performance of data analytics tools and architectures has been the topic of conversation.  Conducted architectural risk assessment for a client in the financial regulation area recently who used technologies including: Data Lake, Kafka, Elasticsearch, Greenplum data warehouse, Oracle, blue/green deployment, Kappa architecture.

Example large-scale Software Architecture Project An example of a reasonably complex architectural project that I have been involved with was a multi-phase multiple-year Intelligence Surveillance and Reconnaissance (ISR) infrastructure project for the Department of Defence:

·       This was a Department of Defence (RPDE/DSTO/CIOG) ISR Integration solution evaluation and validation early-lifecycle (pre-tender) project.  I was involved in the contract negotiations, was the technical lead and managed many aspects of data collection and correlation and synthesis as well as the architectural modelling, detailed evaluation, report writing and presentation aspects.

·       The initial phases were for RPDE and was a complex project involving multiple stakeholders in Défense (e.g. ISR community, ISR infrastructure providers, etc), and vendors, and multiple sources of data including multiple vendor interoperability test-bed run by RPDE. collection and de-classification of ISR scenarios and user/location volumetric data and correlation with other data sources.

·       I also managed the planning and setup of a testbed and detailed performance benchmarking of ESB services performance and topologies in our laboratory for a single representative ESB product. Using these data sources as inputs I modelled alternative ESB architectures (centralized, hierarchical and hybrid/peer-to-peer with different sub-nets, including security architectures such as gateways and security levels, across multiple deployment zones and at continental and larger scale - for deployed/mobile/field nodes), built detailed performance models and ran simulations and sensitivity analysis to identify most critical workloads, services and locations, produced comprehensive capacity and response time and scalability predictions,  and analyses the results and produced reports and conclusions.


·       A subsequent phase (for CIOG) incorporated more developed concrete constraints on service types and locations, and included an analysis of failures. Because of the need to model at a larger scale used an enhanced method including generative modelling and Monte Carlo simulation techniques to speed up the exploration of the larger search space before targeting a subset of likely solutions for more detailed performance modelling, final analysis and reporting of results.

Comments

  1. Very useful post and I think it is rather easy to see from the other comments as well that this post is well written and useful. I bookmarked this blog a while ago because of the useful content and I am never being disappointed. Keep up the good work..
    software testing company
    QA Outsourcing Sevices
    Performance testing Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.