Complexity and Chaos (Engineering)

I recently came across this post on Chaos Engineering which got me thinking again about cloud complexity. And this article. The idea behind both is that current software systems are so complex you just can't test them before production, you have to test them during production (and try not to annoy too many paying or advert clicking customers, or just use Kiwis as Canaries). The closest photos I could find that had canary (Black polish) and kiwi (neutral polish, should be black!) related.

I also came across the DTA (Aussi government) initiative to provide an entry level cloud experience for government software development to dip their toes in the cloud waters with cloud.gov.au.
As the previous DTO head pointed out the (good) idea was to find projects that were self-contained and wanted to work with DTO (i.e. not too complex and there was pull in place).

I think the other key thing they did was to provide a simple to use technology. My current experience with AWS so far is that it's older origins of a IaaS platform provider are still too prominent in the certification. AWS now provides many managed services (SaaS) which are a better starting point, but I think their PaaS offerings are hidden (E.g. AWS Lambda). Cloud.gov.au uses a PaaS CloudFoundry running on AWS (and maybe a private cloud).

Using a PaaS approach is a great idea as it reduces the cloud complexity by abstraction. Sure you probably can't do everything you can on a IaaS, but s/w developers would rather not worry about lower level sys, network and security admin concerns (thank you very much).

My experiences in 2003/2004 with Grid computing and early Grid middleware suggest that even back then a focus on Web services and PaaS was critical. We experimented with getting complex scientific services and workflows running initially on a Grid Cluster (very time consuming all the infrastructure installed and debugged and then the application software and data etc running on top of it), and then on a real distributed grid (4 locations in the UK). We invented a PaaS for deploying and running end-user code (securely and load balanced consumed by workflows) as web services on top of the OGSA (resource focussed infrastructure). Our paper on "Two ways to Grid" in 2005/2006 (to use modern terminology) suggested that rather than a focus (purely) on IaaS you need PaaS. We "invented" many of the features of current public cloud to do this, but I still think PaaS is key.

The other observation from a project I was with in CSIRO (1996-1999, a cross-divisional software process/engineering improvement project), was that (as Paul observed in the DTO selection of projects) it's critical to select projects that want to work with you (at least). However, this doesn't mean that they are necessarily simple of self-contained. Some of the projects we selected were not simple, and the reason we selected them was because they were large and complex across multiple divisions and were therefore critical to fix and get right as they were "platforms" for other projects. We therefore had to do something non-trivial to assist them, even putting on extra resources with work directly with some of them and other resources to work across them (at the level of solving complexity, architecture, protocols, cross-project management, integration, testing etc).

So I wonder for cloud.gov.au how it will will scale as projects start getting more complex, and as they need to tackle integration issues? Will CloudFoundry work adequately for more complex integration projects? Possibly. Can you just use AWS services directly in conjunction with CloudFoundry? Maybe. Some Hackathons and pilots focussed on more complex integration problems may be worth trying? (and maybe they have)

And going back to chaos engineering approaches in production, there are risks and limitations in doing this on your production systems. Our performance prediction approach based on automatic modelling from APM data is directly relevant to this. We can (and have) built realistic complete performance models of production systems from production APM data (e.g. Dynatrace). The models encompass the complete production systems (end-to-end, top-to-bottom, including user experience). Even though they are complete they are abstractions, so are still understandable, use-able and fit for purpose to evaluate most QoS/non-Functional "ilities" (performance, scalability, capacity, availability, reliability, recovery-time, security). This can be be done quickly (think hours not days), and once you have a model you can do "what' if" experiments at the level of workloads, software changes or infrastructure changes (or combinations).

We propose that this can be done continuously and automatically during DevOps and Production in the future. For example, in DevOps and just before CD you automatically produce a performance model from the current production system combined with the changes about to take place from the latest commit. And then you run a simulation using the highest peak workload from the previous week to determine if the production system resulting from the latest commit will survive this peak load next week. So you don't break anything in production and you get warning before something breaks so you have time to go back and take remedial action. I've also done some prototyping and experimentation with load forecasting (e.g. based on previous few weeks observed workload, what range of expected loads can be expected with 90% confidence extrapolated into the next few weeks), and what could be worst case peak load in terms or TPS and transaction mix based on observation and inference of the previous few weeks data.

This is one or my DevOps related Performance Modelling talks from ICPE2016. I'm still working on a version from WOPR2016 this year.

Watch out for Melbourne Trams ("Right turn from left lane only". I.e. DevOps, shift left to shift right)

PS
In terms of simplification for clouds, there are also alternatives to AWS such as Google cloud which according to this article has some advantages including simpler cost structures (e.g. pay per minute, and no spot instances but discounts the longer resources are used).

PPS
Has anyone thought of and tried doing chaos engineering for performance engineering? I.e. introduce performance problems on purpose in production systems and see how quickly they are detected and if the system is self-healing enough to prevent any SLAs being violated? What would you call this? Chaos Sloths?!

Search This Blog

A computer scientist learns Amazon Web Services (AWS)

Complexity and Chaos (Engineering)

Comments

Post a Comment

Popular posts from this blog

Chapter 11: AWS Directory Service, Cloud Directory

Chapter 2: Amazon Simple Storage Service (S3) and Amazon Glacier Storage (for storing your beer?)

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.