I understand the svc is updating ep based on the API server’s events to scale pods up and down.. but what happens when a pod fails, does svc notice based on not being able to talk to the pod? And then it would update the ep (scratch out the pod?) also, some events would be thrown etc.. Who is responsible to respawn a failing pod?
First up, I’d say that the EP object updates itself based on watching the API (rather than the SVC updating the EP). All comms and events etc. go via the API server.
Re noticing a failed Pod… Without looking at the code, I’d say this. Obviously there are different circumstances. However, usually the kubelet of the node running the Pod will notice it has failed and inform the API server (it might perform a retry of its own first – I honestly can’t remember off the top of my head). It is then the API server’s resposibility to find it a new home. The EP object will be watching the API server, and will notice and update itself accordingly. The SVC object will obviously leverage the updated EP object as normal.
There is a Master tunable terminated-pod-gc-threshold that plays part in this along with the other abstracts e.g. deployments, replica set, and scheduler. so if the pod is part of ReplicationController, ReplicaSet, or Deployment, then its restart policy is to restart always. unlike if it was a job where its restart policy would be never. so if a pod dies the responsible controller will notice and restart a new one, if a node fails the same case applies. all of these happens via the API server coordination with other controllers and scheduler.