The first doubt is: Health Check Grace Period is the same of Instance Warmup. The explanation of these two concepts is the same for me. A time period set for the auto scaling group to check whether the instance is already healthy or not.
The second doubt is about the Cool Down time. I understand perfectly about the concept of CoolDown time. I did not understand what feature in target tracking scaling policy and in step scaling policy controls and ensure that one more instance will not be added to an auto scaling group. The AWS documentation says about warmup: "While scaling out, we do not consider instances that are warming up as part of the current capacity of the group. This ensures that we don’t add more instances than you need."
I do not agree with this because if the alarm trigger during this interval (the warmup interval), one more instance will be added to the group, even more than the new instance data is not part of the auto scaling group yet.
Warmup period – applies to both initiating health checks and then adding to the auto scaling group. So the capacity of the group changes only after the warmup period and successful health check.
Cooldown period on the other hand is: to ensure that your Auto Scaling group doesn’t launch or terminate additional instances before the previous scaling activity takes effect. So if a scaling activity has been initiated – no new scaling activity will be initiated till the Cooldown period is over.
Lets say the cool down period is set as 1 minute, a new scaling activity can be started only after 1 min. So if the warmup period is longer than 1 minute (the cool down period) – then the effect of the previous scale activity will Not yet have been be realized, but new scaling activity could be initiated. So cool down period and warmup period should be in sync – and coolddown period should be longer than the warmup period.
The above explanation is not 100% accurate in my view. When I’m writing this (Dec 2020) warm-up period has nothing to do with health-checks.
_"Warmup period – applies to both initiating health checks and then adding to the auto scaling group." << is incorrect (_link)
Documentation says a warm-up time is to give an instance sufficient time to fully boot and settle before including it in the ASG metrics, therefore serves very similar purpuse to the cooldown period.
Now on to the health-check grace period. This is how it’s described in AWS Console: "The amount of time until EC2 Auto Scaling performs the first health check on new instances after they are put into service."
This is tricky: do note it says "after they are put in to service", secondly do recall ASG does monitor the status of all EC2 in the group continously – it’s why we love ASG. ASG is going to know the status of every EC2 it launched, right, so why the heck does this value defaults to 5mins, especially that the following explanation is also displayed on the same page: "EC2 Auto Scaling automatically replaces instances that fail health checks. If you enabled load balancing, you can enable ELB health checks in addition to the EC2 health checks that are always enabled."
It took my two coffees and more than 5 biscuits to think through this. This link sheds light on the Health-Check Grace Period (HCGP) – think of it an ELB thing, it is what links an ELB with an ASG and enables the ASG to act upon a faulty ELB health-check
To ensure that your Auto Scaling group can determine an instance’s health based on additional tests provided by the load balancer, you can configure the Auto Scaling group to use Elastic Load Balancing (ELB) health checks. The load balancer periodically sends pings, attempts connections, or sends requests to test the EC2 instances and determines if an instance is unhealthy. If you configure the Auto Scaling group to use ELB health checks, it considers the instance unhealthy if it fails either the EC2 status checks or the ELB health checks. If you attach multiple load balancer target groups or Classic Load Balancers to the group, all of them must report that the instance is healthy in order for it to consider the instance healthy. If any one of them reports an instance as unhealthy, the Auto Scaling group replaces the instance, even if other ones report it as healthy. link
Warmup period – applies to scaling-out scenario, prevents a non-fully started instance from including in the metrics of the ASG group.
A cooldown period: "conceals" a scale-in/scale-out events to prevent a storm of scaling activities (in other words to ensure that your Auto Scaling group doesn’t launch or terminate additional instances before the previous scaling activity takes effect). So if a scaling activity has been initiated – no new scaling activity will be initiated till the Cooldown period is over.
So how does all of these stack together?
– Cooldown period applies to simple-scaling ONLY.
– Warm-up period applies to target and step-scaling policies and it does not prevent an ASG from firing off additional scale-out/in impulses.
_The main issue with simple scaling is that after a scaling activity is started, the policy must wait for the scaling activity or health check replacement to complete and the cooldown period to expire before responding to additional alarms. Cooldown periods help to prevent the initiation of additional scaling activities before the effects of previous activities are visible. _In contrast, with step scaling the policy can continue to respond to additional alarms, even while a scaling activity or health check replacement is in progress. Therefore, all alarms that are breached are evaluated by Amazon EC2 Auto Scaling as it receives the alarm messages. link
Still following me?
A few further questions I wasn’t able to google a definitive answer to:
Does health-check grace period (HCGP) stacks on top of warm-up period?
(My guess is they are unraleted.)
Should HGCP be always greater than the warm-up period?
(I guess it should.)
What if you set the health-check grace period too short and you ELB considers a still-booting instance unhealthy and orders a replacement? Are you going to end up in an endless loop of replacing a not-yet-ready instance?
(I guess it could happen.)
Bonus consideration: by default an ELB health-check needs 2 consecutive probes to fail to declare an EC2 as unhealthy and it sends them 30sec apart which adds additional time delay. Not confused yet? I didn’t mention lifecycle hooks yet.