With the continued growth and transition to microservices it’s important to ensure that the time and money re-engineering systems to modern, cloud-based solutions lead to tangible benefits to the organization. In this multi-part series, we’ll look at different components and pitfalls that need to be considered when modernizing to microservices. 

In this blog, we’ll look at best practices for Kubernetes Health Endpoints to Achieve Self-Healing.

Save Time (and costs) by Changing Your Reliability Goals

The health endpoints are key to making the system resilient in the face of unexpected problems. They are also an opportunity to save time (and cost) by not shooting for near-perfection in reliability. Admitting that sometimes the services will fail is okay, as long as it can recover automatically with no adverse effect to end users. The way that this is achieved in Kubernetes/OpenShift is through the use of liveness and readiness probes. The liveness probe is used to restart the container if it fails more than its failure threshold, and the readiness probe is used to know whether or not to route traffic to that pod. Keep in mind that the liveness and readiness probes are completely independent. In other words, the liveness probe can fail, but as long as the readiness probe is successful then the traffic will be routed to the container (until the failure threshold of the liveness probe is reached, at which point the container will be killed).

Don’t Make Assumptions on the Health of the Cluster

As a best practice, there should be no assumption made on the health of the cluster, which means that the liveness and readiness probes need to support both a fast and slow startup of the container. Until Kubernetes 1.16, that created a dilemma: the only way to route traffic quickly after a fast startup time is to put a low initialDelay in the readiness probe. But to support a slow start of the container, the failureThreshold needs to be increased substantially, which delays reacting to problems in the service. Starting in Kubernetes 1.16, the startup probe can be leveraged to allow for a long startup time before the liveness probe takes over and not overly pad the liveness and readiness probes. So, try to use the startup probe whenever possible.

Another consideration is being more aware of cascading effects in readiness probes. If a readiness probe tests its downstream services’ readiness probes, a transient issue might cascade to other services, thereby inactivating themselves. For that reason alone, a readiness probe should not test downstream services, but be limited to concerns under its direct control like finishing its startup process.

One More Challenge With Health Endpoints

One last challenge: the default SpringBoot Actuator health endpoint does not test for the memory situation. So, if the functional endpoints need a particular amount of memory to respond properly, they might error out because there is not enough memory available (because of a memory leak or too many requests in flight consuming too much), while the health probes return that everything is well, in a perfect world. To improve the health endpoint and take into account the unavailability of memory, the first thought could be to check how much memory is left in the liveness probe. However, if the garbage collector behavior is unpredictable, a better workaround, in our opinion, is to attempt to allocate a substantial buffer in the liveness probe when it’s called (and not hold to it). If the allocation is successful, the liveness probe can return that it is alive. While not perfect and with slight overhead, it ensures that the microservice gets restarted automatically when the inability to allocate memory persists longer than the duration of the allowed failure threshold.

Need to catch-up? Previously, lessons included:

Part 1: The Importance of Starting with the Team

Part 2: Defining Ownership

Part 3: Process Management and Production Capacity

Part 4: Reserving Capacity for Innovation

Part 5: Microservices Communication Patterns

Part 6: Using Shadow Release Strategy

Part 7: Performance Testing Microservices

Part 8: Memory Configuration Between Java and Kubernetes

Part 9: Prioritizing Testing within Microservices

Part 10: Distributed Systems

Bonus: Connection Pools and Queues

Ready to modernize your organization’s microservices? Oteemo is a leader in cloud-native application development. Learn more: https://oteemo.com/cloud-native-application-development/