Distributed Systems Will Fail – Plan For It: Part 10

May 6, 2023 | By Brice Dardel

Distributed Systems

With the continued growth and transition to microservices it’s important to ensure that the time and money re-engineering systems to modern, cloud-based solutions lead to tangible benefits to the organization. In this multi-part series, we’ll look at different components and pitfalls that need to be considered when modernizing to microservices. 

In this blog, we’ll look at how to properly plan for failures in distributed systems.

The Era of Cattle, Not Pets

In the legacy world, any error was a big enough deal to investigate and try to address. In the era of cattle and not pets, where increased complexity makes it more likely for things to fail (more network traffic; orchestration layer; pod scaling up and down, etc.), the error rate is what needs to be monitored, and a proper retry strategy needs to be implemented.

Two Options for a Retry Strategy

There are two main options for a retry strategy. Either use a service mesh like Istio or, alternatively, add the strategy explicitly in the code with frameworks such as Resilience4j (Hystric is end-of-life). If you know you’re eventually going to have a service mesh for all of its benefits (observability, security, and reliability), it is acceptable to temporarily forgo the reliability benefits – as long as the rate of errors is under the SLA. You can also selectively add explicit retry strategies in the code to the few places that would benefit most.

A word of caution on retry strategies: if you are already at the edge of your SLA, then, depending on when the failure occurs in the processing, the service may be unable to respond in a timely manner. That’s another reason why your SLOs should be much more aggressive than your external SLAs.

Need to catch-up? Previously, lessons included:

Part 1: The Importance of Starting with the Team

Part 2: Defining Ownership

Part 3: Process Management and Production Capacity

Part 4: Reserving Capacity for Innovation

Part 5: Microservices Communication Patterns

Part 6: Using Shadow Release Strategy

Part 7: Performance Testing Microservices

Part 8: Memory Configuration Between Java and Kubernetes

Part 9: Prioritizing Testing within Microservices

Part 10: Distributed Systems

Ready to modernize your organization’s microservices? Oteemo is a leader in cloud-native application development. Learn more: https://oteemo.com/cloud-native-application-development/


Submit a Comment

Who We Are & What We Do

As passionate technologists, we love to push the envelope. We act as strategists, practitioners and coaches to enable enterprises to adopt modern technology and accelerate innovation.

We help customers win by meeting their business objectives efficiently and effectively.

icon         icon        icon

Newsletter Signup:

Join tens of thousands of your peers and sign-up for our best technology content curated by our experts. We never share or sell your email address!