Who am I? What did a Computer Scientist do during a typical career: Cloud and Internet scale technologies
Cloud and Internet Scale Computing
Picture: Guess!? Is it small, big or "virtual" (actually an internet visualisation)
"Cloud" computing isn't really new, it's been around in one form another since the internet and distributed computing (i.e. decades).
Amazon Web Services
(AWS) Attended AWS Summit 2017 in Sydney (sessions of particular interest
were on Kinesis, micro service architectures and workflows and monitoring,
security architecture, scientific workloads, data analytics and machine
learning, the new AWS X-Ray monitoring service, and enterprise migration). I am
currently studying the AWS Certified Solutions Architect Official Study Guide (Sybex, 2017) and hope to be certified by June
2017. I am writing a blog from the
perspective of a “computer scientist” learning AWS which I hope will include
insightful architectural observations and trade-offs (https://acomputerscientistlearnsaws.blogspot.com.au)
. As a practical exercise I am attempting to determine the ways that multiple
example applications that I have experience with could be deployed to AWS taking
into account alternative workloads and cost trade-offs. I am also using
resources at https://aws.amazon.com/architecture/ and https://acloud.guru/
Cloud Benchmarking
In March this year I conducted an independent performance assessment of cloud
platforms for a private cloud company.
The plan is for a phased evaluation increasing the number of platforms
and what is measured as more sponsors become available. The initial round
included 2 industry standard benchmarks and focussed on CPU and Memory speed on
AWS, Azure, and private clouds. Initial
results show that AWS is the slowest for “high speed cpu instances” and has non-linear
scalability with increasing instance size (vCPUs/threads). Other observations (with
potential impacts on architecting cloud systems) include higher variability over
results, (for some cloud/instance combinations, particularly smaller public
cloud instances) and unpredictable bursting and throttling of resources. Future
rounds will expand to benchmark storage, multiple instances, LAN and WAN speed.
Future benchmarks have been identified and trialled to use and include the SPECjbb®2015 benchmark
and the new SPEC cloud elasticity benchmark (SPEC Cloud™ IaaS 2016
benchmark).
Cloud/Internet scale
performance Invited participant at The Workshop on Performance and
Reliability (WOPR25), Wellington, NZ February 2017 (accepted abstract on
performance modelling for DevOps), 3 days of robust discussions on
internet/cloud scale performance testing and engineering (e.g. with Dynatrace,
flood.io, Xero and Facebook experts). Challenges identified included incomplete
top-to-bottom application monitoring, cloud “bill-shock”, unexpected throttling
of cloud resources, and erroneous behaviour of 3rd party services
under load.
Cloud elasticity
& cost R&D on the impact of cloud elasticity on real applications.
Using performance data from three real client systems I conducted benchmarking
on multiple cloud provider platforms and modelled the impact of different
instances sizes and type mixes, instance spin up times and workload
patterns/spikes on response time SLAs and cost. Published results in CMG and
ICPE.
Cloud migration
performance engineering Many of our clients over the last 10 years have
wanted to understand the impact on performance, scalability, reliability,
availability and cost of migrating to and from various cloud and in-house
platforms. For example, one client was
conducting UAT on a hybrid public cloud/in-house platform (the performance of
the user experience in this environment was poor), and I measured and modelled
the expected performance and capacity on the final target 100% in-house
platform. For another client I measured
and modelled the performance and capacity resulting from migrating from an
in-house physical platform to a private virtualised cloud platform. This
required significant re-architecting and risk mitigation for the benchmarked
and predicted poorer performance.
Security challenges
Working with many government departments also gave me experience of data
security challenges. Typically the data was performance monitoring data
captured by APM tools such as Dynatrace running in their secured production environments.
Different approaches were needed to satisfy their security policies in order to
build performance models from the performance data including: encryption of the
data before removal to our lab, transfer of data onto a laptop on their
premises before removal, de-identification of performance data before removal
from their systems, secure remote login to Dynatrace on their systems (in some
cases a separate copy of Dynatrace and the production data), and complete
installation of our modelling software onto their environment.
Application
Performance Monitoring For many past clients one of the main barriers to
performance engineering distributed systems was incomplete application
performance monitoring visibility/coverage (missing workloads, components, partial
depth, into 3rd part services or component, etc). For managed cloud
services this raises questions of whether it is possible and how to do
end-to-end/top-to-bottom complete application performance monitoring. If comprehensive monitoring is impossible how
is it possible to achieve adequate digital performance management in cloud
environments? How is it possible to architect systems for guaranteed response
time SLAs for end users?
Internet architecting
With CSIRO I’ve had experience of re-architecting systems for the internet
using custom HTTP protocols, Grid cluster computing, managed the development of
a java based application built on the Jabber/XMMP protocol, and managed a
detailed technology and architecture test-bed evaluation of the Open Geospatial
Consortium (OGC) Sensor Web standards including web notification service and
sensor alert services.
Grid Services, Cloud
features I was a visiting senior research fellow managing a distributed project
for the architectural and technical evaluation of the Open Grid Services
Architecture (OGSA) infrastructure. It was based at UCL across 4 sites in the
UK working with Professor Wolfgang Emmerich. We established and trialled the
OGSA infrastructure at four locations (London x 2, Newcastle, Edinburgh) and
conducted an architectural trade-off evaluation, which was published in the
journal of grid computing (2006).
I developed a new service to automatically deploy, secure,
discover and securely consume end-user web services and supporting code across
the distributed infrastructure (i.e. services to deploy, manage, resource and
execute end-user supplied services). As a side-effect of this work I discovered
some missing aspects that became common public cloud computing features
including: resource measurement and billing to ensure fair use of resources and
prevent excess resource hogging (this included an idea to have different
possibly dynamic/bidding based charging models to ensure that
short-term/interactive services could obtain some resources quickly even in
competition with long running batch computations, c.f. AWS spot-instances), virtual
machines for more scalable and agile security and isolation of resources, and
automatic deployment of web services/code across resources to provide ease of
use for find-gained (non-batch) realistic distributed applications.
Cloud technologies
The technology stack for our SaaS Performance Modelling tool (including
integration for APM tools) consists of javascript (Dojo, D3, React, mobx, NPN,
gulp), Grails, Java 8, groovy, grails, Gradle, MySQL, Cassandra for storage of
simulation metrics, REST APIs to Dynatrace XML and Compass/AppDynamics JSON,
Apache Hive and Spark processing of APM data, Apache Maths library for linear
and non-linear regression, R, etc.
UNIX I
have 15+ years of UNIX experience, and have previously worked for 5 years as a
UNIX systems and kernel programmer and consultant in distributed systems, and (e.g.)
designed/wrote a TCP/IP based application for a client for Oracle database
replication across geographical regions.
I was Senior Tutor for Professor John Lion's Networked Systems Course at UNSW 1987-1990 (Covered networks and distributed systems).
I was Senior Tutor for Professor John Lion's Networked Systems Course at UNSW 1987-1990 (Covered networks and distributed systems).
Data Analytics I
am generally familiar with data analytics technologies as we (a) our modelling
tool is a predictive data analytics tool which consumes large amounts of APM
data, (b) we use data analytics tools ourselves, (c) our clients use data
analytics tools and some of their problems/solutions have focussed on them
(e.g. use of bin-packing algorithms combined with performance modelling to
optimise load balancing for “R” models in Immigration Visa processing system),
(d) I have built performance models of the difference between Hadoop vs Spark
performance, (e) I’ve attended conferences where the performance of data
analytics tools and architectures has been the topic of conversation. Conducted architectural risk assessment for a
client in the financial regulation area recently who used technologies
including: Data Lake, Kafka, Elasticsearch, Greenplum data warehouse, Oracle,
blue/green deployment, Kappa architecture.
Example large-scale Software Architecture Project An
example of a reasonably complex architectural project that I have been involved
with was a multi-phase multiple-year Intelligence Surveillance and
Reconnaissance (ISR) infrastructure project for the Department of Defence:
·
This was a Department of Defence (RPDE/DSTO/CIOG)
ISR Integration solution evaluation and validation early-lifecycle (pre-tender)
project. I was involved in the contract negotiations,
was the technical lead and managed many aspects of data collection and
correlation and synthesis as well as the architectural modelling, detailed
evaluation, report writing and presentation aspects.
·
The initial phases were for RPDE and was a
complex project involving multiple stakeholders in Défense (e.g. ISR community,
ISR infrastructure providers, etc), and vendors, and multiple sources of data
including multiple vendor interoperability test-bed run by RPDE. collection and
de-classification of ISR scenarios and user/location volumetric data and
correlation with other data sources.
·
I also managed the planning and setup of a
testbed and detailed performance benchmarking of ESB services performance and
topologies in our laboratory for a single representative ESB product. Using
these data sources as inputs I modelled alternative ESB architectures
(centralized, hierarchical and hybrid/peer-to-peer with different sub-nets,
including security architectures such as gateways and security levels, across
multiple deployment zones and at continental and larger scale - for
deployed/mobile/field nodes), built detailed performance models and ran
simulations and sensitivity analysis to identify most critical workloads,
services and locations, produced comprehensive capacity and response time and
scalability predictions, and analyses
the results and produced reports and conclusions.
·
A subsequent phase (for CIOG) incorporated more
developed concrete constraints on service types and locations, and included an
analysis of failures. Because of the need to model at a larger scale used an
enhanced method including generative modelling and Monte Carlo simulation
techniques to speed up the exploration of the larger search space before
targeting a subset of likely solutions for more detailed performance modelling,
final analysis and reporting of results.
Very useful post and I think it is rather easy to see from the other comments as well that this post is well written and useful. I bookmarked this blog a while ago because of the useful content and I am never being disappointed. Keep up the good work..
ReplyDeletesoftware testing company
QA Outsourcing Sevices
Performance testing Services