With the continued growth and transition to microservices it’s important to ensure that the time and money re-engineering systems to modern, cloud-based solutions lead to tangible benefits to the organization. In this multi-part series, we’ll look at different components and pitfalls that need to be considered when modernizing to microservices.
In this blog, we’ll look at Performance Testing Microservices.
The Challenge of Performance Testing Microservices
In the past, legacy application performance was tested as a whole, from the perspective of the users’ requests: separate components to tests did not exist. In a microservices world, final system performance should be monitored, but it shouldn’t be the product teams’ focus for three main reasons: delayed feedback, complexity and lack of accountability.
Waiting for the whole system of microservices to be in a test environment creates coordination problems between teams, relies on a complex environment setup, and is slow to produce results. Worse, results lack clear actionable information about performance bottlenecks and regressions. If the orchestration-level first service is always responsible for the performance of the whole system, you’ll unfairly burden that team to be accountable for microservices that are the responsibility of other teams.
Why Performance Tests Should be Initially Done in Isolation
In a microservices world, performance testing needs to be done in isolation for each microservice. First, define service level objectives (SLOs) for all the services, leveraging meaningful service level indicators (SLIs). In the legacy application, you might have had a service level agreement (SLA) for each type of operation entering the monolith. Now each microservice (each endpoint) should have its own SLO. Here’s an example to illustrate the point: service A calls service B which then calls service C with a 99th percentile SLA of seven seconds for the whole system; you could define a SLO on A of five seconds, B of three seconds, and C of one second.
The idea here is to 1) have more aggressive internal SLOs than your published SLAs, and 2) instead of performance testing the whole system at once and not understanding what is going on, you can performance test A by virtualizing B, giving it a constant response time of three seconds. Then, A’s response time can be verified to be under that five-second threshold (virtualization solutions can also add random delays and simulate more varied performance profiles). Similarly, the team owning B can virtualize C at its SLO of one second, ensuring that B responds under three seconds. SLOs can then be further refined to include network latency.
This dependency on virtual services for performance testing improves the accountability and sanity of teams. It simplifies the dependencies on other services, and it can also help tremendously in automating performance testing and shifting the activity left. Each team can become responsible for their own service performance instead of it being relegated to a later stage.
More Lessons Learned When Testing
A few other lessons learned with regard to performance testing in a Kubernetes/OpenShift environment:
- To remove some variability in performance testing, you need to ensure that the CPU request and limit are set to the same value to compare a performance test result with a previous one. Otherwise, you have no idea how much CPU the service actually got in case of a congested node. The CPU limit is not guaranteed, only the request is.
- While setting the CPU limit and request to the same value is a good practice in the performance test environment, production settings can benefit from a higher CPU limit, if available. The CPU request of the production environment can match the values in the test environment to ensure SLO consistency while increasing the CPU limit.
- If all services have a CPU request defined (highly recommended), you can remove CPU limits in production (and in all environments where you are not doing performance testing). Why throttle the compute resources already paid for? For a longer rationale and technical details, there are good articles here and here.
- Don’t stop at load testing the system: do stress tests to see when and how it starts breaking, and most importantly, how it recovers. Also, perform soak tests to run tests for a long period to rule out any subtle memory leaks or other long-term side effects.
Need to catch-up? Previously, lessons included:
Part 2: Defining Ownership
Part 6: Using Shadow Release Strategy
Part 10: Distributed Systems
Ready to modernize your organization’s microservices? Oteemo is a leader in cloud-native application development. Learn more: https://oteemo.com/cloud-native-application-development/