Performance tuning usually goes something like followed:
This entire process can be expensive. Especially in complex environments where the suggestion of an experienced person is usually a (hopefully well informed) guess. This probably will require quite some iterations for the performance to be sufficient. If you can make these guesses more accurate by augmenting this informed judgement, you can potentially tune more efficiently.
In this blog post I'll try to do just that. Of course a major disclaimer applies here since every application, environment, hardware, etc is different. The definition of performance and how to measure it is also something which you can have different opinions on. In short what I've done is look at many different variables and measuring response times and throughput of minimal implementations of microservices for every combination of those variables. I fed all that data to a machine learning model and asked the model which variables it used to do predictions of performance with. I also presented on this topic at UKOUG Techfest 2019 in Brighton, UK. You can view the presentation here .
I varied several things
What did I do?
I measured response times and throughput for every possible combination of variables. You can look at the data here .
Next I put all data into a Random Forest Regression model, confirmed it was accurate and asked the model to provide me with feature importances. Which feature was most important in the generated model for determining the response time and throughput. These are then the features to start tweaking first. The features with low feature importance are less relevant. Of course as I already mentioned, the model has been generated based on the data I've provided. I had to make some choices, because even when using tests of 20s each, testing every combination took over a week. How accurate will the model be when looking at situations outside my test scenario? I cannot tell; you have to check for yourself.
Which tools did I use?
Of course I could write a book about this study. The details of the method used, explain all the different microservice frameworks tested, elaborate on the testtooling used, etc. I won't. You can check the scripts yourself here and I already wrote an article about most of the data here (in Dutch though). Some highlights;
- I used Apache Bench for load generation. Apache Bench might not be highly regarded by some but it did the job well enough and when for example comparing performance to wrk , there is not much difference (see here )
- I used Python for running the different scenario's. Easier than Bash, which I used before.
- For analyzing and visualization of the data I used Jupyter Notebook.
- I first did some warm-up / priming before starting the actual tests
- I took special care not to use virtualization tools such as VirtualBox or Docker
- I also looked specifically at avoiding competition for resources even though I measured on the same hardware as where I produced load. Splitting the load generation and service to different machines would not have worked since the performance differences, were sometimes pretty small (sub millisecond). These differences would be lost when transporting over a network.
Confirm the model is accurate
In the below plot I've shown predicted values against actual values. The diagonal line indicates perfect accuracy. As you can see accuracy is pretty high of the model. Also the R^2 value (coefficient of determination) was around 0.99 for both response times and throughput which is very nice!
The below graphs show the results for feature importance of the different variables.
However I noticed feature importance becomes less accurate when the number of different classes differs per variable. In order to fix that I also looked at permutation feature importance. Permutation feature importance is determined by calculating the reduction in model accuracy when a specific variable is randomized. Luckily this looked very similar:
As you can see, the feature importance of the used framework/implementation was highest. This indicates the choice of implementation (of course within the scope of my tests) was more important than for example the JVM supplier (Zing, OpenJ9, OpenJDK, OracleJDK) for the response times and throughput. The JVM supplier was more important than the choice for a specific garbage collection algorithm (the garbage collection algorithm did not appear to be that important at all, even though when memory became limiting, it did appear to become more important). The Java version did not show much differences.
The least important features during these test were the number of assigned cores. Apparently assigning more cores did not improve performance much. Because I found this peculiar, I did some additional analyses on the data and it appeared certain frameworks are better in using more cores or dealing with higher concurrency then others.
You can check the notebook here