With the continued growth and transition to microservices it’s important to ensure that the time and money re-engineering systems to modern, cloud-based solutions lead to tangible benefits to the organization. In this multi-part series, we’ll look at different components and pitfalls that need to be considered when modernizing to microservices.
In this blog, we’ll look at how to properly plan for failures in distributed systems.
The Era of Cattle, Not Pets
In the legacy world, any error was a big enough deal to investigate and try to address. In the era of cattle and not pets, where increased complexity makes it more likely for things to fail (more network traffic; orchestration layer; pod scaling up and down, etc.), the error rate is what needs to be monitored, and a proper retry strategy needs to be implemented.
Two Options for a Retry Strategy
There are two main options for a retry strategy. Either use a service mesh like Istio or, alternatively, add the strategy explicitly in the code with frameworks such as Resilience4j (Hystric is end-of-life). If you know you’re eventually going to have a service mesh for all of its benefits (observability, security, and reliability), it is acceptable to temporarily forgo the reliability benefits – as long as the rate of errors is under the SLA. You can also selectively add explicit retry strategies in the code to the few places that would benefit most.
A word of caution on retry strategies: if you are already at the edge of your SLA, then, depending on when the failure occurs in the processing, the service may be unable to respond in a timely manner. That’s another reason why your SLOs should be much more aggressive than your external SLAs.
Need to catch-up? Previously, lessons included:
Part 1: The Importance of Starting with the Team
Part 2: Defining Ownership
Part 3: Process Management and Production Capacity
Part 4: Reserving Capacity for Innovation
Part 5: Microservices Communication Patterns
Part 6: Using Shadow Release Strategy
Part 7: Performance Testing Microservices
Part 8: Memory Configuration Between Java and Kubernetes
Part 9: Prioritizing Testing within Microservices
Part 10: Distributed Systems
Ready to modernize your organization’s microservices? Oteemo is a leader in cloud-native application development. Learn more: https://oteemo.com/cloud-native-application-development/