Chapter 5: Auto Scaling

Auto scaling has a few (probably odd) interpretations...

Auto (small)

Scaling (i.e. worlds biggest car)

Auto(matic) fish scaling (sure beats doing it by hand but probably very messy):

And then there's scaling in the scientific sense:

E.g. https://medium.com/sfi-30-foundations-frontiers/scaling-the-surprising-mathematics-of-life-and-civilization-49ee18640a8

Also related to Zipf's Law (which was originally formulated for the relative frequency of words in a language), e.g. http://io9.gizmodo.com/the-mysterious-law-that-governs-the-size-of-your-city-1479244159?IR=T

but also turns out to be true for the relative difference in size between cities, web site traffic, some networks, is also useful in performance engineering (e.g. the service demand across sub-systems is typically a Zipf distribution), and animal species average weights. A trick application of Zipf's law is to work out the total weight of all the animals on Noah's Ark (big boat thing from the Bible) given you only know the average weight of the few largest species and x2 of each etc (it actually doesn't matter much if you assume that Dinosaurs made it on or not. Just kidding, of course they did otherwise where did Tuataras come from :-) One of the coolest beer bottles is from New Zealand and made in the "shape" of a Tuatara, but you really need to hold one to understand, see the spikes?).

From memory (I did the calculation a few years ago - it tends to a few hundred tonnes max no matter how many species etc you assume, fleas don't weigh much). Coincidentally I was going down the Merwede River in the Netherlands after ICPE 2016 and came across a very odd looking "boat" which turned out to be a replica ark, it was enormous:

By analogy is you know the service demand of the biggest few sub-systems in your system you can easily compute the total required capacity. Also useful for synthetic generation of load data etc. NB A few of us had a discussion about Zipf's law and Service demand during ICPE 2016 and concluded that one reason it may hold particularly for SOAs is that some service compositions heavily reuse other services resulting in a few services having the most calls (Daniel's talk made this observation). What are the implications of this? 1) That you need lots of resources for only some of your services and 2) the few services with highest service demand are likely to be most sensitive to changes in workload and require the ability to quickly scale up (auto scale). and 3) sometimes the remainder of the lower demand services can even be hosted on shared infrastructure (i.e. multi-tenancy) as their combined resource demands are much smaller than the few higher demand services.

So back to AWS Auto Scaling (which is really none of the above).... It's a mechanism to allow the number of EC2 instances to be either maintained (for a constant workload if some fail) or to increase and decrease automagically to cope with changes in loads. It's really is the key "magic" ingredient for elastic cloud computing, and puts the large public cloud providers like Amazon in a different category from the smaller private cloud providers due the sheer scale and number of instances that they can be make available on-demand. In theory this allows you to scale up quickly for load spikes, and gets lots of instances for a short period of time for a big job (e.g. data analytics).

Auto Scaling works in conjunction with EC2 instances, Elastic Load Balancing and CloudWatch (for EC2 metrics gathering to trigger rules). I'm still getting my head around some of the details (best check AWS docs) but you can have multiple scaling groups, and 4 different types of scaling (constant, manual, scheduled, dynamic). Each group has a name, min/max and optional desired number of instances (default is min) and on-demand or spot instances (but not both, and obviously not reserved), a launch configuration (specifies template for new instances including name, AMI, type, security group etc), and policies (thresholds, adjust by number or %).

What happens if you have multiple policies that contradict or interact in unpredictable ways? Is there any way of testing out policies before use? (performance modelling comes to mind). Here's AWS Dynamic Scaling documentation but I'm not sure it answers these questions.

You also need to keep in mind limitations (E.g. EC2 instances per region etc).

Billing is by hour increments with minimum of 1 hour so be careful. The AWS advice is to scale up fast and scale down slowly, and there is a cool down period option (which suspends changes). Also remember that launching EC2 instances takes time (depending on what needs to be loaded). It would make sense to use an image that has been created from an existing EC2 instance to reduce time.

There are 2 potential problems with scaling up fast. (1) the scale up mechanisms are reactive, i.e. they only respond to metrics collected in the past. This means that the load on the system may have exceeded available capacity already (or be increasing even faster), and it will still take time to add new instances. The only way around this is be more aggressive in adding extra instances and hope you catch up in time (i..e either by reducing the threshold or adding more instances each time the threshold is triggered). However, the downside of this is cost (2). Scaling up fast means you may overshoot (by a lot) the actual resources required and given the minimum charge of an hour per instance this may be expensive.

The other approach is predictive scaling. Some work has been done in the academic community over the years (not just for cloud) for predictive dynamic capacity planning, but it's tricky to get right. One approach may be to combine performance modelling with auto scaling. I did some work recently for a customer which used both load forecasting (from APM data), and performance modelling of the required capacity for the forecast load (automatically built from APM data). This took into account error margins (upper and lower confidence intervals) for both the load forecasting and the capacity prediction and allowed for decisions about how much risk of having insufficient capacity vs. cost of having over capacity. A similar approach could work to optimise cost and performance for a specified risk level (e.g. <= 20% chance of overload) for AWS auto scaling. Here's an example graph from this customer showing 6 permutations of forecast vs. capacity (both in terms of tps, but tps can be converted to number of instances in the model):

For example, assume the model predicts that in hour minutes there is a 90% chance of the load increasing to between 100 and 400 TPS (form whatever the current load is, assumed to be less), see permutation "2" in above graph. The model also predicts that the capacity of the system (for say 10 instances for the example) will be between 300 and 500 TPS (with 90% chance). There will therefore be approximately a 20% chance that the system will be overloaded. If this an acceptable risk then extra instances can be spun up over the next hour gradually in expectation of the increased load. Alternative numbers of instances can be trivially and/or the number of instances required to give a specified probability of the system overloading/not overloading can be predicted (and the cost taken into account to determine cost vs risk etc). The above graph was design to illustrate a fixed capacity prediction range with variable load forecast ranges relative to the capacity range (rather than the other way around which would be more useful for the auto scaling example).

The other option is serverless computing such as AWS Lambda functions. But I'd need to look at this in more detail to see how it auto scales. And another idea is to use Lambda in conjunction with Auto Scaling.

I also wondered why only on-demand or spot instances are allowed and not both? And how AWS decides which instances to terminate first when scaling in? Here's the comprehensive AWS docs with a useful diagram: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html

Can you have "hybrid auto scaling"? Is that useful? Maybe. E.g. try to find spot instances first and if none are available for the price you are willing to big then use on-demand instances. Or if spot instances aren't available of a particular type in a region then try other types/regions? Is this getting too complex? A few blogs have discussed this with potential solutions (e.g. multiple groups with the same ELB. does this actually work?) E.g.

https://serverfault.com/questions/448746/ec2-auto-scaling-with-spot-and-on-demand-instances

http://sanketdangi.com/post/65421200445/combination-ondemand-spot-instances-autoscaling

Another question I wondered about is if it's possible to pay NOTHING for spot instances? If AWS terminates spot instances then they don't charge for the last partial hour. But do you always get spot instances for >= 1 hour? If not then it may be possible to get them for nothing. Someone has suggesting "gaming" the system this way: Play "Chicken" with Spot Instances. And taken a bit further.

This article is also interesting: http://santtu.iki.fi/2014/03/19/ec2-spot-usage

And "On Why I Don't Like Auto-Scaling in the Cloud" (E.g. bill shock attack)

AWS now also has Application Auto Scaling, see https://aws.amazon.com/autoscaling/faqs/

Finally a couple of references to my older work on performance modelling elasticity on AWS (using real applications and APM data from in-house infrastructure, benchmarking on AWS EC2, and modelling to take into account workload patterns and spin-up time and impact on response times).

https://www.researchgate.net/publication/254008779_Is_your_cloud_elastic_enough_performance_modelling_the_elasticity_of_infrastructure_as_a_service_IaaS_cloud_applications

A slightly longer version originally published in the CMG magazine (part 1, part 2).

SPEC has a new cloud elasticity benchmark (although there aren't many results submitted yet) SPEC Cloud™ IaaS 2016 benchmark: https://www.spec.org/cloud_iaas2016/

Postscript

A few questions I've since thought of...

Does AWS Auto Scaling support changing the instance size rather than the number of instances?

In policies can you trigger a change with say I/O rather than CPU utilization? Or response time percentiles (e.g. 95% > 5s? or some dynamically computed time value, perhaps from an APM). i.e. using CPU only may not be an accurate metric of how loaded an instance is if I/O is predominant, and using average response times isn't a great way of detecting problems (average is not a statistically robust metric, median or percentiles is way more meaningful). I see that you define custom metrics for CloudWatch which may be a solution: https://aws.amazon.com/cloudwatch/faqs/

Also realised that I had planned on checking how complex each AWS service API is. For Auto scaling here's the API documention: http://docs.aws.amazon.com/AutoScaling/latest/APIReference/Welcome.html
There are 52+ actions, 20+ data types.
The AWS Auto Scaling java client documentation is 58 pages long.

Search This Blog

A computer scientist learns Amazon Web Services (AWS)