Microservices: What do you need to tweak to optimize throughput and response times

Performance tuning usually goes something like followed:

  • a performance problem occurs
  • an experienced person knows what is probably the cause and suggests a specific change
  • baseline performance is determined, the change is applied, and performance is measured again
  • if the performance has improved compared to the baseline, keep the change, else revert the change
  • if the performance is now considered sufficient, you're done. If not, return to the experienced person to ask what to change next and repeat the above steps

This entire process can be expensive. Especially in complex environments where the suggestion of an experienced person is usually a (hopefully well informed) guess. This probably will require quite some iterations for the performance to be sufficient. If you can make these guesses more accurate by augmenting this informed judgement, you can potentially tune more efficiently.

In this blog post I'll try to do just that. Of course a major disclaimer applies here since every application, environment, hardware, etc is different. The definition of performance and how to measure it is also something which you can have different opinions on. In short what I've done is look at many different variables and measuring response times and throughput of minimal implementations of microservices for every combination of those variables. I fed all that data to a machine learning model and asked the model which variables it used to do predictions of performance with. I also presented on this topic at UKOUG Techfest 2019 in Brighton, UK. You can view the presentation here .



I varied several things

  • wrote similar implementations in 10 frameworks (see code here )
  • varied the number assigned cores
  • varied the assigned memory
  • varied the Java version (8,11,12,13)
  • varied the JVM supplier (OpenJ9, Zing, OpenJDK, OracleJDK)
  • varied the garbage collection algorithm (tried all possible algorithms for every JVM / version)
  • varied the number of concurrent requests

What did I do?

I measured response times and throughput for every possible combination of variables. You can look at the data here .

Next I put all data into a Random Forest Regression model, confirmed it was accurate and asked the model to provide me with feature importances. Which feature was most important in the generated model for determining the response time and throughput. These are then the features to start tweaking first. The features with low feature importance are less relevant. Of course as I already mentioned, the model has been generated based on the data I've provided. I had to make some choices, because even when using tests of 20s each, testing every combination took over a week. How accurate will the model be when looking at situations outside my test scenario? I cannot tell; you have to check for yourself.

Which tools did I use?

Of course I could write a book about this study. The details of the method used, explain all the different microservice frameworks tested, elaborate on the testtooling used, etc. I won't. You can check the scripts yourself here and I already wrote an article about most of the data  here (in Dutch though). Some highlights;

- I used Apache Bench for load generation. Apache Bench might not be highly regarded by some but it did the job well enough and when for example comparing performance to wrk , there is not much difference (see here )

- I used Python for running the different scenario's. Easier than Bash, which I used before.

- For analyzing and visualization of the data I used Jupyter Notebook.

- I first did some warm-up / priming before starting the actual tests

- I took special care not to use virtualization tools such as VirtualBox or Docker

- I also looked specifically at avoiding competition for resources even though I measured on the same hardware as where I produced load. Splitting the load generation and service to different machines would not have worked since the performance differences, were sometimes pretty small (sub millisecond). These differences would be lost when transporting over a network.


Confirm the model is accurate

In the below plot I've shown predicted values against actual values. The diagonal line indicates perfect accuracy. As you can see accuracy is pretty high of the model. Also the R^2 value (coefficient of determination) was around 0.99 for both response times and throughput which is very nice!

Feature importance

The below graphs show the results for feature importance of the different variables.

However I noticed feature importance becomes less accurate when the number of different classes differs per variable. In order to fix that I also looked at permutation feature importance. Permutation feature importance is determined by calculating the reduction in model accuracy when a specific variable is randomized. Luckily this looked very similar:


As you can see, the feature importance of the used framework/implementation was highest. This indicates the choice of implementation (of course within the scope of my tests) was more important than for example the JVM supplier (Zing, OpenJ9, OpenJDK, OracleJDK) for the response times and throughput. The JVM supplier was more important than the choice for a specific garbage collection algorithm (the garbage collection algorithm did not appear to be that important at all, even though when memory became limiting, it did appear to become more important). The Java version did not show much differences.

The least important features during these test were the number of assigned cores. Apparently assigning more cores did not improve performance much. Because I found this peculiar, I did some additional analyses on the data and it appeared certain frameworks are better in using more cores or dealing with higher concurrency then others.

You can check the notebook here