Abstract
Kubernetes probes are explained in detail in their documentation. There are numerous articles on the internet that also discuss their implementation. This article, however, spends a little more time in addressing the confusion that most technicians have around their applicability, functioning, & risks. It also focuses on using them for console apps, background services, & daemons because they are slightly trickier than web-enabled apps. Implementing probes requires meticulous study because ignorance can lead to erratic results in production environment.
First of all, if your application gracefully exits or cleanly crashes after an error then the container that is hosting it will stop too. Kubernetes will automatically reboot its pod without the help of any special health probes.
If the application neither exits cleanly nor auto-recover from error then the container will not stop. Hence k8s will never know what happened. In such case, special health probes are required. K8s has 3 types of probes. K8s clusters have a kubelet instance running on each k8s node. Kubelet fires one or more probes against the pods for which probing is enable. It sends the result to the master / control plane & coordinates with it to either “temporarily” blacklist those pods or restart them.
Kubernetes Probes
- Startup (for legacy apps, in Alpha release):
- Function: Checks if the application has finished booting up. If it has then the probe returns success & k8s allows readiness & liveness probes to fire, if they are configured. Otherwise, startup probe is fired again after an interval for a few more times until threshold (interval x attempts) is reached. If the application doesn’t respond before the threshold is reached then it is assumed to have failed & hence k8s restarts it (or as per podSpec restartPolicy specified).
- Frequency: Startup probe is fired only once when the pod is scheduled.
- Purpose: Helps application to take sufficient time to bootstrap & not accidentally assumed as failed & restarted.
Startup probe disables liveness & readiness checks until it succeeds. It is useful for legacy applications with slow boot-up or performs critical bootstrap activities. Also it helps liveness probe to work correctly with such applications. If startup probe never succeeds then container is killed after 300s default timeout and is subject to pod’s restartPolicy.
- Readiness:
- Function: Checks if the application is reachable & responsive. If it is then the probe returns success. Application’s pods are marked Ready. Kubernetes sends network traffic or communication requests to its pods. Otherwise, it is assumed to be temporarily busy serving previous work & hence its pods arr considered unready. K8s stops sending sending network traffic or communication requests.
This is similar to removing IP address of busy pods from network load balancer’s records. See k8s documentation for actual approach.
-
- Frequency: Readiness probe is fired periodically throughout the life of the application.
- Purpose: Helps applications offer sense of higher availability & throughput as requests are not channeled to busy pods. Otherwise requests may take too much time to get served or get timed out if pods crash due to overload.
- Liveness:
- Function: Checks if the application is reachable & responsive. If it is then the probe returns success. Otherwise, liveness probe is fired again after an interval for a few more times until threshold (interval x attempts) is reached. If the application doesn’t respond before threshold is reached then it is assumed to have failed & hence k8s restarts it (or as per podSpec restartPolicy specified).
- Frequency: Liveness probe is fired periodically throughout the life of the application.
- Purpose: Helps application recover from rogue state such as deadlocks, memory leaks, & edge-case bugs where application does not exit or crash. It just stops functioning & never auto-recovers.
Readiness & liveliness do not follow any such cadence between themselves though. They are fired independent of each other.
Benefits
- Enables zero downtime deployment
- Prevents deployment of broken images
- Ensures failed containers are auto-restarted
Recommendations
- Web/network applications
- Readiness probe
- Liveness probe
- Startup probe – (only if startup/bootstrap takes a while otherwise liveness probe with sufficiently long initial delay will suffice)
- Other non-web/non-network applications (background services, daemons, console apps)
Readiness probe– Not useful- Liveness probe
- Startup probe (only if startup/bootstrap takes a while otherwise liveness probe with sufficiently long initial delay will suffice)
K8s probes are suitable for apps that are supposed to be always on. Hence probes only support DemonSet, StatefulSet, & ReplicaSet/Deployment. They don’t support Job, & CronJob. For them; their timeouts, restartPolicy, & other specs are sufficient, unless you hosted a server or made coding errors. Then those pods are not going to stop & eventually devour your cluster resources.
Tips
- Pod is considered ready when all of its containers are ready. Does this mean if one pod is busy then no traffic is routed to all containers in the same pod? Kubernetes allows hosting more than one container per pod only in case those containers are tightly-coupled apps that communicate directly, like sharing a volume between them, or proxy sidecar. So, can you host multiple web-enabled apps on the same pod but with their independent network interface? No! All containers in a pod share the same IP address and Port number. They cannot be contacted separately over network. Don’t try to hack your way around this.
- Keep in mind network latency & other external factors in mind when you specify probe timeouts, & periodicity. Use optimal values. Responding to pods is additional work for your apps anyway.
- Do not perform heavy long running activity when trying to respond to liveness probe. Like checking access to application dependencies, database, external service, etc.
- In web-enabled applications; the endpoint used for responding to health checks should not ideally be internet facing, or should be protected, even if they are read-only/GET endpoints (anonymous access). For intranet applications it is fine. This requires in-depth knowledge of Kubernetes security, kubelet configuration, & networks concepts.
- Sometimes pod may go in endless restart loop if it crashes due to some coding error. Pod restarts but fails again. In this case there is no way to decide that the pod is corrupt and hence Kubernetes must back off from rebooting it repeatedly. Just like how Kubernetes backs off from attempting to download container images if registry is not accessible or images are missing in the registry. Kubernetes documentation states that restarted pods are rebooted with exponential delay and is capped at 5 minutes. Meaning if pods keep restarting continuously then after 5 minutes it should just be removed & not rescheduled. But I have seen otherwise. It could be k8 version that I was using or the issue isn’t fixed at all.
I know; complicated stuff. It is a community driven open-source project after all. The community greatly influences its future technical decisions, especially due to k8s existing implementations in the world. Sometimes I feel k8s stack is too much for its own good. Let’s move ahead with what we have, for now.
All k8s probes currently use 3 separate methods to directly or indirectly interact with container apps to obtain their health status, viz. HTTP, TCP, & Command. All of them are legit ways to perform probing but ultimately its target application’s design & coding that can make make or break things. Applications that support interaction over network interface can directly respond to probes without additional coding effort. Others need to support the interaction through command-line interface or proactively broadcasting their health status so that probes can notice. This requires additional effort & can get tricky because of synchronization needs of probing. This is why probing web-enabled applications is much simpler compared to console, background services, & daemons. The later ones are often designed to perform scheduled & long-running jobs, or respond to command-line requests. Hence, they don’t necessarily listen to incoming network requests. They pull their work instead of being given to them.
On a side-note; applications that do not use multi-threading, multi-tasking, or parallel computing concepts may never run into deadlocks. Such applications may not be benefited much by liveness probe unless you want to enforce timeouts on their long-running operations.
For e.g. Dotnet console apps use single threaded apartments hence don’t use Synchronization Context or Thread-pools (even if Task Parallel Library or async-await pattern is used).
One solution that seems quite obvious here is modifying apps to publish their health status periodically. Just like firing a flare in the sky every 15 mins, but this time to indicate that everything’s fine though. Absence of a flare for more than 15 mins will indicate that something’s wrong.
Lets take a scenario,

A background service is running inside Kubernetes cluster hosted on Microsoft Azure. It reads a message from Request queue, performs some long running activity, & sends a message to Response queue. The service writes started.txt file once when it is up and all of its dependencies (both request and response queues) are accessible to it. Then, it writes alive.txt file repeatedly after a fixed interval.
Kubernetes Startup probe tries to delete started.txt file. If it can then it assumes that the service is up. Similarly, Kubernetes Liveness probe tries to delete alive.txt file. If it can then it assumes that the service is healthy.
Implementation
Below YAML configuration sample has both Readiness & Liveliness probes configured.
startupProbe:
exec:
command:
– rm
– started.txt
initialDelaySeconds: 5
periodSeconds: 15
livenessProbe:
exec:
command:
– rm
– alive.txt
initialDelaySeconds: 15
periodSeconds: 15
Background service will create/modify started.txt file after booting up successfully to indicate that it is ready. Startup Probe will try to delete that file to agreed that the background service is up & running. Background service will create/modify alive.txt file every 15 seconds while also performing its core duties in-parallel of performing jobs received on the request queue. Liveness probe tries to delete it as per its schedule to be aware that the service is healthy.
Code sample
Sample is under development. You can track its progress here.
